These instructions are for "simple" Hadoop clusters that have no sophisticated PAM and/or Kerberos integrations. They are ideal for the HDP Sandbox or other such "simple" setups like the one called out in building a virtualized 5-node HDP 2.0 cluster (all within a mac) that rely on "local" users.
For all command examples, replace $theNEWusername
with the username being created.
Edge Node / Client Gateway
On the box(es) where the user will SSH to and utilize CLI tools (this does NOT have to be a dedicated machine; for example, on the Sandbox there is only one machine), login as root and execute the following commands to create a local account and set the password.
useradd -m -s /bin/bash $theNEWusername passwd $theNEWusername
Then create a HDFS home directory for this new user.
su - hdfs hdfs dfs -mkdir /user/$theNEWusername hdfs dfs -chown $theNEWusername /user/$theNEWusername hdfs dfs -chmod -R 755 /user/$theNEWusername
Master & Worker Nodes
On the remainder of the cluster nodes (if any), we just need to have the new user present. There is no need to set a password as these CLI users will not need to log into any of these hosts directly.
useradd $theNEWusername
User Validation
To validate, users can SSH into the edge node with their new credentials and run the following commands to verify that they can manipulate content on HDFS. Note: where in Linux user can use "~" to reference their home directory, the FS Shell treats relative referencing (i.e. nothing before the initial file or folder name) as the equivalent to "~/" which means everything is based on the user's home folder in HDFS.
hdfs dfs -put /etc/group groupList.txt hdfs dfs -ls /user/$theNEWusername hdfs dfs -cat groupList.txt hdfs dfs -rm -skipTrash /user/$theNEWusername/groupList.txt
If you really know how many variables are at play in a "typical" Hadoop cluster (including which components to use & what use cases are most important to you) it is easy to see where there aren't too many node sizing guides published out there. That said, I'll go out on a limb and offer my personal sizing guide for worker nodes.
Spec out your worker nodes with multiples of the following logical building block.
1 Hard Drive (1-4TB in size) – 2 CPU Cores – 10 GB of RAM
With that approach, here are some typical worker node configurations based on these variables.
T-Shirt Size | Drives | Cores | Memory |
---|---|---|---|
Small | 4 | 8 | 48GB |
Medium | 8 | 16 | 96GB |
Large | 12 | 24 | 128GB |
That said, what I see a LOT of is a box with 12 2TB drives, dual 8 core processors, and 128GB of RAM.
What about master nodes, well... keep it simply and order the same boxes and if you happen to be in the 12 disk setup then pull out about 6 of the drives to help you with the inevitable failures that will occur across your cluster!