Blog from November, 2014

These instructions are for "simple" Hadoop clusters that have no sophisticated PAM and/or Kerberos integrations.  They are ideal for the HDP Sandbox or other such "simple" setups like the one called out in building a virtualized 5-node HDP 2.0 cluster (all within a mac)  that rely on "local" users.

For all command examples, replace $theNEWusername with the username being created.

Edge Node / Client Gateway

On the box(es) where the user will SSH to and utilize CLI tools (this does NOT have to be a dedicated machine; for example, on the Sandbox there is only one machine), login as root and execute the following commands to create a local account and set the password.

useradd -m -s /bin/bash $theNEWusername
passwd $theNEWusername

Then create a HDFS home directory for this new user.

su - hdfs
hdfs dfs -mkdir /user/$theNEWusername
hdfs dfs -chown $theNEWusername /user/$theNEWusername
hdfs dfs -chmod -R 755 /user/$theNEWusername

Master & Worker Nodes

On the remainder of the cluster nodes (if any), we just need to have the new user present.  There is no need to set a password as these CLI users will not need to log into any of these hosts directly.

useradd $theNEWusername

User Validation

To validate, users can SSH into the edge node with their new credentials and run the following commands to verify that they can manipulate content on HDFS.  Note: where in Linux user can use "~" to reference their home directory, the FS Shell treats relative referencing (i.e. nothing before the initial file or folder name) as the equivalent to "~/" which means everything is based on the user's home folder in HDFS.

hdfs dfs -put /etc/group groupList.txt
hdfs dfs -ls /user/$theNEWusername
hdfs dfs -cat groupList.txt
hdfs dfs -rm -skipTrash /user/$theNEWusername/groupList.txt

If you really know how many variables are at play in a "typical" Hadoop cluster (including which components to use & what use cases are most important to you) it is easy to see where there aren't too many node sizing guides published out there.  That said, I'll go out on a limb and offer my personal sizing guide for worker nodes.

Spec out your worker nodes with multiples of the following logical building block.

1 Hard Drive (1-4TB in size)  –  2 CPU Cores  –  10 GB of RAM

With that approach, here are some typical worker node configurations based on these variables.

T-Shirt SizeDrivesCoresMemory
Small4848GB
Medium81696GB
Large1224128GB

That said, what I see a LOT of is a box with 12 2TB drives, dual 8 core processors, and 128GB of RAM.

What about master nodes, well... keep it simply and order the same boxes and if you happen to be in the 12 disk setup then pull out about 6 of the drives to help you with the inevitable failures that will occur across your cluster!