hadoop yarn (in a nutshell)

I was just explaining to a colleague today how Hadoop 2.0 (aka YARN, which stands for Yet Another Resource Negotiator) differs from Hadoop 1.0. Today's "core Hadoop" consists of HDFS and MapReduce and each have their own master & worker daemon processes. Specifically, NameNode & DataNode for HDFS and JobTracker & TaskTracker for MapReduce. This itself makes sense as HDFS and MapReduce are focused on two different things.

HDFS offers redundant, reliable storage while the tightly-coupled MapReduce framework concentrates on data processing and the cluster resource management that is needed for such a scalable platform. This model works well when we only have applications that layer on top of MapReduce such as the open-source Pig & Hive frameworks and the commercial offering from Datameer, but these tools will only perform as fast as the underlying batch-oriented MapReduce layer will allow.

There are other offerings that are bringing near real-time responsiveness to the Hadoop "ecosystem". One of the most established ones is HBase. In projects like this one an entirely new set of daemons come into play to address the necessary data processing requirements as well as cluster resource management. This model starts to become a nightmare on two fronts as more and more Hadoop tools & frameworks become available.

First, we have multiple teams (open source and commercial) building the same kinds of software to handle operating in a clustered environment which simply leads to way too many implementations of the same general problem. Second, it isn't hard to imagine that we could quickly have a number of different sets of master & worker daemons starting to run on all the machines in our cluster. Hadoop 2.0 / YARN is here to help with both of these concerns.

YARN attacks the underlying resource management problem for all and features an interface that allows data processing frameworks to plug into this shared functionality. MapReduce and HBase in Hadoop 2.0 will sit on top of YARN (w/o requiring app developers to rewrite anything). As always, a picture is worth a thousand words and Hortonworks has presented a few in their Hadoop YARN write-up. In fact, their synopsis is even better than mine, but I thought I'd give it a try anyway.

Hadoop 2.0 is coming fast and has a great opportunity to be the "app fabric" of tomorrow as a wise man recently predicted to me. If you have an app or framework that needs to scale to the level of where Hadoop is going (i.e. thousands of nodes), then this is the time to see how you can unwind some of your own cluster management code and take advantage of YARN yourself. If you need some help doing it -- drop me a line as that would be a fun project!!