Page Comparison

...

Mount Point	Which Disks & How Configured	Details
`/`	Use both of the 500GB disks in a mirrored fashion	Ensure the OS can run even is a disk fails
`/var/log`	Use two of the 2TB disks in a mirrored fashion	Definitely want admins' buy-in on this strategy, but just trying to keep the boxes running at all costs and to not allow the size of the logs to be a runtime concern
`/hadoop`	Use two of the 2TB disks in a mirrored fashion
`/master/data/1`	Use two of the 2TB disks in a mirrored fashion	These will be the mount points used for by the NameNode and JournalNode processes
`/master/data/2`	Use two of the 2TB disks in a mirrored fashion	Also for NN & JN as even with HA NN setup, we want the NN processes to write their `fsimage` files to more than one (logical) disk – hey, the bunker scene can happen!! I even strongly recommend soft mounting a NFS directory and periodically making a backup of the NN's `fsimage` and JN's `edits` files; this may seem like overkill, but ensure your operational procedures are rock solid & tested when it comes to the NameNode and its fsimage/edits files recovery
`/master/zk`	Use two of the 2TB disks in a mirrored fashion	As more and more components start leveraging ZooKeeper, and with the intensity that HBase communicates with it, it makes solid sense to allow ZK to have its own disks

This would leave two additional 2TB drives in the chassis which could allow some flexibility in system restoration should one of the other drives fail. One might also think that these only the appropriate /master/XXX file systems should be mounted on the "correct" master nodes, but as a robust set of hadoop master nodes (it is hard to swing it with two machines) indicates, you really won't will not have an unbounded number of master nodes and it makes more sense to build out all universally so that one could adapt to system failures and/or additional node expansions by allowing master component reassignments to be easier with all appropriate file systems being on all master nodes.

Worker

...

kdjkdfj

...

Nodes

The worker nodes have a different restoration policy than the masters, should be setup to maximize storage instead of minimizing the possibility of failure. Taking this into account, I prescribe the following based on the hardware described earlier.

Mount Point	Which Disks & How Configured	Details
`/`	Use one of the 500GB disks
`/var/log`	Use the other 500GB disk	Again, definitely want admins' buy-in on this strategy
`/hadoop`	Use one of the 2TB disks
`/worker/00` thru `/worker/11`	Individually mount the remaining 11 2TB disks	The heart of the Just a Bunch Of Disks (JBOD) strategy

The real gotcha above is consuming a full "big disk" just for /hadoop. An alternative is to partition it into two entities that would surface as /hadoop and one of the /worker/NN mount points.

Edge Nodes

The edge, and ingestion, nodes are focused on ensuring there is plenty of storage for transient data. Generally speaking, these are replaceable nodes and the data they house should be able to be recreated from either the source or from HDFS.

Mount Point	Which Disks & How Configured	Details
`/`	Use one of the 500GB disks
`/var/log`	Use the other 500GB disk
`/hadoop`	Use one of the 2TB disks	Maintain consistency with the other node stereotypes
varies (likely the root of home directories)	Use one, to many, of the remaining (or available) disks in a logical volume that provides some level of protection from a drive failure	It is very likely that a much less resource rich machine than previously described will be used for the edge nodes and thus, you should use whatever drives are available to build the best-case storage option for this transient data requirement

Especially when there are multiple edge nodes and the expectation is for a user (or process) to arbitrarily log into any of them and have a consistent home directory, other alternatives, such as network mounted home directory may be appropriate.

Parting Thoughts

With all of the varying opinions, experiences and self-imposed standards at play, it is very likely that the individual mount points used for one enterprise's Hadoop will be identical to that from another company. Fortunately, what matters most is understanding what the technology itself is doing, which components will be most utilized and validating your assumptions to ensure you are being data-driven (not emotionally-driven) when making these important decisions for your cluster. I'm hopeful this information helped a bit.

Happy Hadooping!

Versions Compared

Old Version 5

New Version 6

Key

Worker

Nodes

Edge Nodes

Parting Thoughts