Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: added graphic

There are plenty of folks who will tell you "Hadoop isn't secure!", but that's not the whole story.  Yes, as Merv Adrian points out, Hadoop in its most bare-bone mode does have a number of concerns.  Fortunately, like with the rest of Hadoop, software layers upon software layers come together in concert to provide various levels of security shields to address specific requirements.

 

The Infrastructure Architecture section described how the physical isolation of the cluster allows for network-enforce “perimeter security” from the edge nodes.  This section will walk through the multiple layers of Hadoop security that is available with HDP.  It will cover the full AAA (Authentication, Authorization and Auditing) functionality that is typical Hadoop started out with no heavy thought to security.  This is because there was a problem to be solved and the "users" of the earliest incarnations of Hadoop all worked together – and all trusted each other.  Fortunately for all of us, especially those of us who made career bets on this technology, Hadoop acceptance and adoption has been growing by leaps and bounds which only makes security that much more important.  Some of the earliest thoughts around Hadoop security was to simply "wall it off".  Yep, wrap it up with network security and only let a few, trusted, folks in.  Of course, then you needed to keep letting a few more folks in and from that approach of network isolation came the every present edge node (aka gateway server, ingestion node, etc) that almost every cluster employs today.  But wait... I'm getting ahead of myself.

My goal for this posting is to cover how Hadoop addresses the AAA (Authentication, Authorization, and Auditing) spectrum that is typically used to describe system security. 


Image AddedThe following diagram presents the categorization that will be used to describe the components that are availablePICTURE..diagram on the right represents my view of Hadoop security; well, as it is in mid-2014 as there is rapid & continuous innovation in this domain. The sections of this blog posting are oriented around the italicized characteristics in the white boxes.

Table of Contents

While the diagram above presents layers, it should be clearly stated that each of these components could be utilized without dependencies on any others.  The only exception to that is the based base HDFS authentication that is always present and is foundational as compared to the other components.  The overall security posture is enhanced by employing more (appropriate) components, but additional system complexity could be introduced into the overall system when using a component that has no direct alignment with security requirements.

...

  • Host-Based: This is the simplest to understand, but surely the toughest to maintain and each user and group need to be created individually on every machine.  There will almost surely be security concerns around password strength & expiration and auditing as well.
  • Directory-Based: In this approach a centralized resource, such as Active Directory (AD) or LDAP, is us to create users/groups and either through a commercial product or internally-development framework an integration with the hosts ensures the seamless interaction to users.  This is often addressed with Pluggable Authentication Modules (PAM) integration.
  • Hybrid Solution: As Hadoop is unaware of the “hows” of user/group provisioning, a custom solution could also be employed.  An example could be PAM integration with AD user objects to handle authentication onto the hosts, but utilizing local groups for the authorization aspects of HDFS.

Mercy’s solution will be to utilize a hybrid approach like that described in the example.

What are people doing in this space?  In my experience, it seems most Hadoop systems are leveraging a hybrid approach such as the example provided for their underlying linux user & group strategy.

Where do I go for more info?  http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html

User Integrity

With the pervasive reliance on identities as they are presented to the OS, Kerberos allows for even stronger authentication.  Users can more reliability identify themselves and then have that identity propagated throughout the Hadoop cluster.  Kerberos also secures the accounts that run the various elements in the Hadoop ecosystem thereby preventing malicious systems “posing as” part of the cluster to gain access to the data.

Kerberos can be integrated with AD and this is Mercy’s preferred mechanism for implementationcorporate directory servers such as Microsoft's Active Directory which helps tons when there are significant number of direct Hadoop users.

What are people doing in this space?  In my experience, I'd estimate that about 50% of Hadoop clusters are utilizing Kerberos.  This seems to be because either the use case does not require this additional level of security and/or there is a believe that the integration & maintenance activities are too high. 

Where do I go for more info?  http://hortonworks.com/wp-content/uploads/2011/10/security-design_withCover-1.pdf

Additional Authentication

The right combination of secure users & groups, along with appropriate physical access points described later, provide a solid authentication model for users taking advantage of Command-Line Interface (CLI) tools from the Hadoop ecosystem (i.e. Hive, Pig, MapReduce, Streaming, etc).  Additional access points will need to be secured as well.  These access points could include the following.

  • HDP Access Components
    • HiveServer2 (HS2): Provides JDBC/ODBC connectivity to Hive
    • Hue: Provide Provides a user-friendly web application to the majority of the Hadoop ecosystem tools to complement the CLI approach.
  • 3rd Party Components: Tools that sit “in front” of the Hadoop cluster may use a traditional model where they connect to the Hadoop cluster via a single “system” account (ex: via JDBC) and then present their own AAA implementations. 

Hue and HS2 will be configured to utilize AD to be aligned with the same authentication that the CLI tools are usingeach have multiple authentication configuration options, to include various approaches to impersonation, that can augment, or even replace, whatever approach is being leveraged by the CLI tooling environment.  Additionally, 3rd party tools such as Business Objects will provide their own AAA functionality to manage existing users and utilize a system account to connect to Hive/HS2.

What are people doing in this space?  Generally, it seems to me that teams are configuring authentication in these access points to be aligned with the same authentication that the CLI tools are using.  Some organizations with a more narrowly-focused plan for utilization of their cluster are "locking down" access to the data from first-class Hadoop clients (FS Shell, Pig, Hive/HS2, MapReduce, etc) to only the true exploratory data science and software development communities and taking an approach from the application server playbook – having 3rd party BI/reporting tools use a "system account" to interact with relational-oriented data via HS2 and leveraging the rich AAA abilities of these user-facing tools.

Where do I go for more info?  https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2 and https://github.com/cloudera/hue/blob/master/desktop/conf.dist/hue.ini

Additional Authorization

The POSIX-based permission model described earlier will accommodate most security needs, but there is an additional option should the security authorization rules become more complex than this model can handle.  HDFS ACLs (Access Control Lists) have surfaced to accommodate this need.  This feature became available in HDP 2.1 and is inherently available thus can be taken advantage of when the authorization use case demands it.

...