os patching your hadoop cluster (pre & post rolling upgrades)

A modern Hadoop cluster is a beautifully resilient and robust set of machines working together to bring awesome storage capabilities and processing power.  A modern Hadoop cluster (build upon hardened distributions such as HDP) is also a very complex set of orchestrated software components that themselves are leveraging even smaller packages/frameworks and ultimately the underlying operating system and hardware itself.  The old "don't fix something that ain't broke" adage comes to mind once your cluster is running smoothly.  Then... the rest of the data center and the world catches up with you... and you're faced with a (needed or not) set of "patching schedules", "security updates", and maybe even a sprinkling of "corporate policy".

SIDEBAR: On that last one, I even ran into an organization who formalized a policy that requires ALL servers to be restarted every 90 days.  Knowing first hand where that concept started and fully understanding the unix/linux mindset of running a box forever, I pointed them to my just reboot it (when was that ever a good idea?) posting.  (wink)

I'm surely not suggesting that patching is optional.  There are very solid reasons for doing this that still make it a requirement.  I'm simply suggesting that (especially early on in your Hadoop journey) you rethink how this process will occur on the hosts that make up your Hadoop cluster.  The very obvious easy answer is to simply take an outage by stopping all Hadoop services, perform the OS patching as it is done today, and then restart all services thereby ending the outage.  In most shops, this could happen in a very fast manner, but it still requires a service outage which we all want to prevent if possible.

Up until now, my strong advice has been to only apply OS level patches & updates when performing a Hadoop platform maintenance activity such as an upgrade.  The intention is to take advantage of the downtime that will be present with an upgrade and, in fact, to encourage platform architects to always be thinking about the next upgrade; especially with the level of innovation (and fixes!) that Hadoop is still undergoing.

Regardless of how you introduce OS patching, no production cluster should ever be upgraded without appropriate testing in, at least one, pre-production environment of not just Hadoop, but your specific use cases and applications that sit on top of it.  Aligning the OS patching with this cycle can prevent an unwanted side effect of an updated dependent artifact that slips into the environment when doing OS patching separately.  This model also forces the OS patches to get some real testing which in my experience is almost always not done.

Rolling Upgrades Enters Stage Left

Now that Hadoop Rolling Upgrades is rapidly coming at us in sophisticated Hadoop Distribution Components & Versions such as HDP, we are entering a new phase where we will not have such Hadoop "maintenance activities" that require system downtimes as described above.  Let's pause for a second... That's AWESOME news!!  Now, back to the topic at hand...  This means that my current recommendation of coupling OS patching with these outage-generating Hadoop updates is no longer applicable.  No worries, we just need to forget what my "old" advice has been (yes, being a Hadoop consultant makes you embrace change) and revisit the problem again.

We still want to prevent updated dependencies from causing problems in the cluster all while not incurring, or at least limiting, any unwanted system downtime.  Therefore, my new recommendation is to decouple the OS patching from the forthcoming rolling upgrade strategies.  Truthfully, we couldn't really couple them even if we tried.  The cluster is going to be very busy with managing two releases of Hadoop at the same time, smoke testing the new one, moving jobs from old to new, and ultimately cleaning up after itself when wrapping up the exercise.  Trying to somehow get in the middle of that with OS patching and possibly restarting servers would only make the waters even more muddy.

We could obviously leverage the simple model mentioned earlier; just take a system downtime, but would could also use a "rolling OS patching" strategy as well.  This model could be done with any Hadoop cluster actually, but now is the front runner approach. 

Test Everything First

Testing in a pre-production environment means having some servers on pre-patched versions of the OS and some on recently patched versions.  The logical model would be to roll through the master nodes first and then when they are all done, progress into your worker nodes.  This causes multiple testing scenarios, as shown below, to be paused at and validated.

Testing ScenarioMaster NodesWorker Nodes
Master patching in-flightHalf PatchedNone Patched
Master patching completeAll PatchedNone Patched
Worker patching in-flightAll PatchedHalf Patched
Worker patching completeAll PatchedAll Patched

Naughty or Nice Shutdowns

As there are usually multiple Hadoop daemons running on each node in the cluster, you need to decide if you want to use the naughty or nice strategy for shutting them down.  The "naughty" model (aka the sysadmin approach) is just to reboot the box at the time you are ready knowing that the cluster is robust enough to handle this.  The "nice" model (aka Lester's preference) is to script shutdown routines for the services on a particular node before restarting it.  There are multiple ways to go down that path.  If you are an Ambari user you could leverage its REST API.  If the Ambari Shell eventually includes this functionality you would have another option.

Once the box has been restarted, you'll need to restart the Hadoop daemons on it as they won't come up by default (i.e a very good design choice with all things considered).  This means to fully automate your rolling OS patching strategy you'll have to script your start procedure with an option like the Ambari route previously mentioned.

After this level of testing, we can feel much more confident that the production environment rollout off OS patches will go much smoother.  While some system patching can be done without restarting the host, this blog posting assumes you'll want (or be "forced") to perform the restart after the OS patching is complete.

Patch Masters One at a Time

Now that you are ready to perform your "rolling OS patching" my recommendation is to start with the master nodes and to do one at a time.  While you do want to automate this process, you'll want to create enough pausing and blocking to ensure that the master services on each host are restored and properly being leveraged before moving on to the next host.  Regardless of the naughty or nice decision, by their nature of their design the HA enabled components should perform admirably for you and not cause any missed beats with jobs and actions in-fligh, but unfortunately those services that are either not capable of, or configured to use, HA options will cause an outage for their respective clients.  Someday... all of master processes will be HA.  Hopefully, someday soon...

From a timing perspective, this process will be slow.  Let's just assume it takes five minutes to reboot the machine and restart the Hadoop services.  The good news is that while we can't do these all at once, there is likely a relatively small number of master nodes.  Even if you have the extremely generous number of master nodes detailed near the end of the a robust set of hadoop master nodes (it is hard to swing it with two machines) posting, you could still get all of the masters done in a relatively short period of time.

Patch Workers Two at a Time

The real time killer to this whole process is rolling the OS upgrades out to the worker nodes.  To prevent the cluster from ever having any fully-replicated dataset blocks without at least one replica available, then you can only have dfs.replication - 1 servers offline at any given moment.  For most of us, that means only two servers at a time.  Obviously, you could perform more at a time to speed things up, but again you'll be at a risk of needing a block that is currently available at the time of need.

If you are open to not having some blocks available at any given time, then I'd go back to the beginning of this posting and just take a cluster-wide outage to perform the OS patching and then restoring service.  My guess is that if you made it this far (in what I have to admit is a pretty long and rather dry post) then that is not really an option for you.  If you don't want to do that, but you want to go faster then you could always consider increasing the replication factor, but that has some pretty holistic effects that you'd have to consider wisely, and truthfully, how much faster would that really speed you up since you aren't going to set dfs.replication = 10 or anything like that.

Wrap-Up Chit-Chat

Hopefully, this post has given you Hadoop administrators out there something to think about and possibly even helps you with your operational strategy to address the need to perform periodic OS patches.  I know this isn't a fun topic, but that's why we get paid the medium bucks, isn't it?!?!  I'd love to hear what questions, comments and/or concerns you have with this information, but I'd be even more interested to learn what you are doing now for OS patching and if you have any future plans as we all begin to embrace Hadoop Rolling Upgrades.