Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The real time killer to this whole process is rolling the OS upgrades out to the worker nodes.  To prevent the cluster from ever having any fully-replicated dataset blocks from not having without at least one replica available, then you can only do have dfs.replication - 1 servers offline at a timeany given moment.  For most of us, that means only two servers at a time.  Obviously, you could perform more at a time to speed things up, but again you'll be at a risk of needing a block that is currently  currently available at the time of need.

If you are open to not having some blocks available at any given time, then I'd go back to the beginning of this posting and just take a cluster-wide outage to perform the OS patching and then restoring service.  My guess is that if you made it this far (in what I have to admin admit is a pretty long and rather dry post) then that is not really an option for you.  If you don't want to do that, but you want to go faster then you could always consider increasing the replication factor, but that has some pretty holistic effects that you'd have to consider wisely, and truthfully, how much faster would that really speed you up since you aren't going to set dfs.replication = 10 or anything like that.

...