hadoop superuser (you can have more than 'hdfs')

These corrections were made on 9/2/2015 to this blog posting.

So... time to eat some crow.  I had a customer who is automating their user onboarding process for his Hadoop cluster and wanted to know if he could use a linux account besides hdfs to create a HDFS user home directory and set the appropriate permissions – see simple hadoop cluster user provisioning process (simple = w/o pam or kerberos) .  I told him he was out of luck and that was just the way it was going to be.

Thinking about it a bit later, I realized I actually never ran this one down.  Navigating through the Hadoop site got me to http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#The_Super-User which told me what I've been espousing all along; the user that starts up the NameNode (NN) is the superuser.  Then I saw it – the phrase that let me know I was wrong in my reply...

In addition, the administrator may identify a distinguished group using a configuration parameter. If set, members of this group are also super-users.

Doh!  I was definitely wrong in my thinking and reply to my customer.  Hey, only the second time this month, but we have half a month to go!!

Let's see this in action.  First, we need a test bed to work from.  Let's use hdfs to create a test directory and then lock down the permissions to only the hdfs user.

[root@sandbox ~]# su hdfs 
[hdfs@sandbox root]$ hdfs dfs -mkdir /testSuperUser
[hdfs@sandbox root]$ hdfs dfs -mkdir /testSuperUser/testDirectory
[hdfs@sandbox root]$ hdfs dfs -ls /testSuperUser
Found 1 items
drwxr-xr-x   - hdfs hdfs          0 2014-08-13 22:42 /testSuperUser/testDirectory
[hdfs@sandbox root]$ hdfs dfs -chmod 700 /testSuperUser/testDirectory
[hdfs@sandbox root]$ hdfs dfs -ls /testSuperUser
Found 1 items
drwx------   - hdfs hdfs          0 2014-08-13 22:42 /testSuperUser/testDirectory

Now let's create an animals group with two users in it; cat and bat.

[hdfs@sandbox root]$ exit
exit
[root@sandbox ~]# groupadd animals
[root@sandbox ~]# useradd -ganimals cat
[root@sandbox ~]# useradd -ganimals bat
[root@sandbox ~]# lid -g animals
 cat(uid=1021)
 bat(uid=1022)

Then make sure they can't do anything that requires superuser access.

[root@sandbox ~]# su cat
[cat@sandbox root]$ hdfs dfs -ls /testSuperUser
Found 1 items
drwx------   - hdfs hdfs          0 2014-08-13 22:42 /testSuperUser/testDirectory
[cat@sandbox root]$ hdfs dfs -chgrp bogus /testSuperUser/testDirectory
chgrp: changing ownership of '/testSuperUser/testDirectory': Permission denied

No joy, but that is as expected.  The instructions at http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#Configuration_Parameters let me know I need to make sure there is a dfs.permissions.superusergroup KVP created for hdfs-site.xml.  This parameter can be found in Ambari at Services > HDFS > Configs > Advanced > dfs.permissions.superusergroup.  For my Hortonworks Sandbox this value is set to hdfs.  This also aligns with the fact that unless you do a −chgrp, your newly created items have the group set to hdfs on this little pseudo-cluster.  Contrary to what you would expect (i.e. the group becomes the value for this setting), I did find out later that even with a different superusergroup identified, the owning group stayed as hdfs.

[cat@sandbox root]$ exit
exit
[root@sandbox ~]# su turtle
[turtle@sandbox root]$ hdfs dfs -put /etc/group groups.txt
[turtle@sandbox root]$ hdfs dfs -ls 
Found 1 items
-rw-r--r--   1 turtle hdfs       1033 2014-08-13 23:12 groups.txt

After I changed the "superuser" group to be animals, I could then make the changes that I wanted to earlier.

[turtle@sandbox root]$ exit
exit
[root@sandbox ~]# su cat
[cat@sandbox root]$ hdfs dfs -ls /testSuperUser
Found 1 items
drwx------   - hdfs hdfs          0 2014-08-13 22:42 /testSuperUser/testDirectory
[cat@sandbox root]$ hdfs dfs -chgrp bogus /testSuperUser/testDirectory
[cat@sandbox root]$ hdfs dfs -ls /testSuperUser
Found 1 items
drwx------   - hdfs bogus          0 2014-08-13 22:42 /testSuperUser/testDirectory

I also did not screw up the fact that hdfs is my true superuser as shown by my "old" HDFS home directory process.

[cat@sandbox root]$ exit
exit
[root@sandbox ~]# useradd user1
[root@sandbox ~]# su hdfs
[hdfs@sandbox root]$ hdfs dfs -mkdir /user/user1
[hdfs@sandbox root]$ hdfs dfs -ls /user

   ... rm'd some lines ...  NOTICE THAT THE GROUP STILL DEFAULTS TO hdfs, NOT animals

drwxr-xr-x   - hdfs           hdfs           0 2014-08-13 23:49 /user/user1
[hdfs@sandbox root]$ hdfs dfs -chown user1 /user/user1
[hdfs@sandbox root]$ hdfs dfs -chgrp user1 /user/user1
[hdfs@sandbox root]$ hdfs dfs -ls /user

   ... rm'd some lines ...

drwxr-xr-x   - user1          user1          0 2014-08-13 23:49 /user/user1

Which can now also be done as a "real" user if set up appropriately.  If bat had appropriate sudo rights, then I could have done the following without starting out at root.

[hdfs@sandbox root]$ exit
exit
[root@sandbox ~]# useradd user2
[root@sandbox ~]# su bat
[bat@sandbox root]$ hdfs dfs -mkdir /user/user2
[bat@sandbox root]$ hdfs dfs -ls /user

   ... rm'd some lines ...  NOTICE THAT THE GROUP STILL DEFAULTS TO hdfs, NOT animals

drwxr-xr-x   - user1          user1          0 2014-08-13 23:49 /user/user1
drwxr-xr-x   - bat            hdfs           0 2014-08-13 23:55 /user/user2
[bat@sandbox root]$ hdfs dfs -chown user2 /user/user2
[bat@sandbox root]$ hdfs dfs -chgrp user2 /user/user2
[bat@sandbox root]$ hdfs dfs -ls /user

   ... rm'd some lines ...

drwxr-xr-x   - user1          user1          0 2014-08-13 23:49 /user/user1
drwxr-xr-x   - user2          user2          0 2014-08-13 23:55 /user/user2

As usual, there are many ways to skin this cat and this simple property is the gateway to those choices.  For many, the simple model of just adding the desired linux user(s) to the existing "superusers" group may be the way to go.  If you are using this today, or might just do so, I'd love to hear your actual, or planned, approach.