Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: clarifying Ranger product name
Info
titleUPDATE: Feb 5, 2015

The component referred to as "XA Secure (Apache Argus)" finally settled down with its finalized open-source name and is now referred to as Apache Ranger.  I've had some very positive hands-on experiences with Ranger since this blog posting was written and I am very enthusiastic about its place in the Hadoop security stack.  Check out the Apache project or Hortworks product page for more information on Ranger.

There are plenty of folks who will tell you "Hadoop isn't secure!", but that's not the whole story.  Yes, as Merv Adrian points out, Hadoop in its most bare-bone mode does have a number of concerns.  Fortunately, like with the rest of Hadoop, software layers upon software layers come together in concert to provide various levels of security shields to address specific requirements.

Hadoop started out with no heavy thought to security.  This is because there was a problem to be solved and the "users" of the earliest incarnations of Hadoop all worked together – and all trusted each other.  Fortunately for all of us, especially those of us who made career bets on this technology, Hadoop acceptance and adoption has been growing by leaps and bounds which only makes security that much more important.  Some of the earliest thoughts around Hadoop security was to simply "wall it off".  Yep, wrap it up with network security and only let a few, trusted, folks in.  Of course, then you needed to keep letting a few more folks in and from that approach of network isolation came the every ever present edge node (aka gateway server, ingestion node, etc) that almost every cluster employs today.  But wait... I'm getting ahead of myself.

...

As a result of Hortonworks’ recent acquisition of XA Secure, the addition of a unified console for authorization across the ecosystem offerings has become available.  This mature product The maturity of this product, which is actively being spun into the new incubating project called Apache Argus which , should greatly help its eventual promotion to a top-level Apache project.

What are people doing in this space?  Those already using HBase are leveraging its features as appropriate to their projects, but HDFS ACLs and Hive's dfjdjfkdjfkjfdkfdjhe additional ACL abilities can become useful as Mercy’s applications and expertise mature, but the recommendation is to first address data access controls via the base POSIX permission model as this will the first line of defense for CLI access to the file system and with ecosystem tools like Pig.  Attention needs to be paid to this level of authorization especially considering the approach of limiting access on data, not the tools. ATZ-NG are still gaining adoption due to their relatively recent introductions.

Where do I go for more info?  http://hbase.apache.org/book/hbase.accesscontrol.configuration.html, https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization, http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.5/bk_system-admin-guide/content/ch_acls-on-hdfs.html and http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.3/bk_HDPSecure_Admin/content/ch_XA-overview.html

Activity Logging

HDFS has a built in ability to log access requests to the filesystem.  This provides a low-level snapshot of events that occur and who did them.  These are human-readable, but not necessarily reader-friendly.  They are detailed logs that can themselves be used as a basis for reporting and/or ad-hoc querying with Hadoop frameworks and/or other 3rd party tools.

Hive also has an ability to log metastore API invocations.  Additionally, HBase offers its own audit log.  To pull this all together into a single pane of glass , Hortonworks offers you can leverage XA Secure / Apache Argus.  This software layer pulls this dispersed audit information into a cohesive user experience.  The audit data can be further broken down into access audit, policy audit and agent audit data, giving granular visibility to auditors on users’ access as well as administrative actions within this security portal.

This feature set, coupled with the single pane of glass for authorization rights, seems to be a solid fit for Mercy who, by nature of the data being stored, will require this level of administrative controls.  Additional investigations are warranted to ensure XA Secure will meet the end goals of the auditing requirements that are present and additional hardware will need to be secured to run this web tier service.

Mercy should ensure that appropriate levels of auditing information is being produced and maintained to support any requirements or desired reporting opportunities

Perimeter Security

Establishing What are people doing in this space?  XA Secure / Apache Argus is early in its adoption curve, but is a feature-rich application so I do expect quick adoption of it.  While the raw data is there, it seems that few organizations are currently rolling the disparate activity logs into a combined view for audit reporting.

Where do I go for more info?  http://books.google.com/books?id=drbI_aro20oC&pg=PA346&lpg=PA346&dq=hdfs+audit+log&source=bl&ots=t_wnyhn0i4&sig=PEbiD5LdLdkUP0jnjhtUoOoBDMM&hl=en&sa=X&ei=YwsJVKeGDZSBygT5x4D4Cg&sqi=2&ved=0CEIQ6AEwAg#v=onepage&q=hdfs%20audit%20log&f=false and http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.3/bk_HDPSecure_Admin/content/ch_XA-audit.html

Perimeter Security

As mentioned at the beginning of this post, establishing perimeter security has been a foundational Hadoop security approach.  This is simply viewed as having limited access into the HDP Hadoop cluster itself and utilizing an “edge” node (aka gateway) as a host users and systems can utilize directly and then have all other actions fired from this gateway into the cluster itself. 

The Knox Gateway is a recent addition to the HDP stack and Apache Knox gateway provides a software layer intended to perform this perimeter security function.  It has a pluggable provider based mechanism to integrate customer AAA mechanisms.  Not all operations are fully supported yet with Knox to have it completely replace the need for the traditional edge node (ex: HS2 access only supported for JDBC; ODBC functionality still lacking). 

Mercy could benefit from additional investigations into Knox at this time, but would need to provision additional hosts and wall off the Knox Gateway web server farm with firewalls similar to the network architecture diagram below.

KNOX PICTURE

The traditional edge node also often is satisfying the ingestion node requirements that are in place in many models that need a landing zone for data before being persisted into HDFS.  Mercy’s current investment in in traditional edge nodes is a meaningful one, but the project's roadmap addresses missing functionality.  It's REST API extends the reach to different types of clients and eliminates the need to SSH to a fully configured "Hadoop client" to interact with the Hadoop cluster (or several as Knox can front multiple clusters).

What are people doing in this space?  Early adopters have already deployed Knox, but the majority of clusters still rely heavily on traditional edge nodes.  The interest is clearly present in almost all customers that I work and I expect a rapid adoption of this technology.

Where do I go for more info?  http://www.dummies.com/how-to/content/edge-nodes-in-hadoop-clusters.html and http://knox.apache.org/

Data Encryption

Data encryption can be broken into two primary scenarios; encryption in-transit and encryption at-rest.  Wire encryption options exist in Hadoop to aid with the in-transit needs that might be present.  HDP has many There are multiple options available to protect data as it moves through Hadoop over RPC, HTTP, Data Transfer Protocol (DTP), and JDBC. 

For encryption at-rest, there are some open source activities underway, but HDP Hadoop does not inherently have a baseline encryption solution for the data that is persisted within HDFS.  There are several 3rd party solutions available (including Hortonworks partners) that specifically target this requirement.  Custom development could also be undertaken, but the absolute easiest mechanism to obtain encryption at-rest is to tackle this at an OS or hardware level. Mercy will need to determine what, if any, of the available encryption options provide the right trade-off of simplicity and security.  This decision will surely be impacted by what other security components and practices are employed

What are people doing in this space?  My awareness is that few Hadoop administrators have enabled encryption, at-rest or in-transit, at this time.

Where do I go for more info?  http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.3/bk_reference/content/reference_chap-wire-encryption.html

Summary

As you can see, there are many avenues to explore to ensure you create the best security posture for your particular needs.  Remember, the vast majority of these options are mutually exclusive allowing for multiple approaches to security.  This is surely one area there still is work to be done and definitely in pulling together the disparate pieces into tools that are easy to adopt by enterprises.