just reboot it (when was that ever a good idea?)

While working my way through Eric Sammer's Hadoop Operations book I came across this call-out from Chapter 9.

On "Reboot It" Syndrome

The propensity for rebooting hosts or restarting daemons without any form of investigation is the opposite of everything discussed thus far. This particular form of disease was born out of a different incarnation of the 80/20 rule, one in which 80% of the users have no desire or need to understand what the problem is or why it exists. As administrators, we exist within the 20% for whom this is not –nor can we allow it to become-- the case. By defaulting to just restarting things, you're defaulting to the nuclear option, obliterating all information and opportunities to learn from an experience. Without meaningful experience, preventative care simply isn't possible.

Consider for a moment what would happen if doctors opted for the medical version of "reboot it". Maybe they'd just cut off anything that stopped working.

Amen!! I remember when I first encountered this mindset. I had just completed one of the six-month (evenings & weekends) UNIX/C "retooling" programs at SMU that were popular back in the early-mid 1990s. I learned a LOT about UNIX in that program from an awesome, but curt and crusty, UNIX administrator named Bobby. With that bald head, long ponytail, anti-government rhetoric, and wild stories about the "early" days of UNIX he told, I'm sure he was "off the grid" back then and hasn't been back on it yet. I was coming from a mainframe development background at the time. That massively uptime environment coupled with learning early about the bragging rights the came from running the uptime command solidified in my brain that systems should (and could!) be run for a long time without a restart.

The skills I learned from those courses (not to mention the tenacity I showed to learn anything and everything I could about web development – mostly CGI back in those days) helped me land a web developer job. I got hired at an ISP that was about to go national and compete against companies like AOL, MSN and NetZero (yes... this was the 90s – anyone remind AltaVista?) and we had a big operations team. The ops team was split about 50/50 between UNIX and Windows administrators and they were all on the same big floor. Although the cube farm layout was almost identical, it was VERY easy to spot which side was which.

As you walked from the elevator through the door that was square in the middle of this large open area, you could immediately see the contrast in styles. On the left was the Windows team. Everyone looked 16 (I'm sure they were in their 20's!) and they all had short & stylish haircuts and wore polo shirts and khaki pants. They were all so eager to be walking around and talking to each other. It was like a scene out of Stepford Wives; except they were all dudes. By contrast, the UNIX side looked like a dark scene from a Tolkein novel. Most of the florescent lights from the suspended ceiling were unscrewed and the primary illumination was from the 21" CRTs most of these admins had on their desks. Some of the folks even rigged up "ceilings" for their cubes made up of flattened cardboard boxes that their servers where shipped in. Long scraggly hair (mostly down) and unshaven faces was the norm.

What really stood out was the big "here there be dragons" sign!! I really did love hanging out on that side of the floor and I learned a lot from these guys.

I only take you down memory lane as this wild variance in administrative "styles" was were I first encountered the "Reboot It" Syndrome that Eric was describing. Most of these young Windows administrators never saw another platform in their life beyond a PC and the UNIX guys have been running mission-critical systems for years. That maturity and craftsmanship was evident in the pride they took in the operational behavior, and yes... UPTIME, of the servers in their charge. Conversely, this Windows administration team was quick to declare "reboot it!" at the first sign of any trouble.

They even came up with alternative phrases such as "kick it", the ever popular "restart it", as well as the one that tried to make it sound like a desired process; an "environmental refresh".

Hey... don't get me wrong... machines do need to be restarted sometimes, but executing the nuclear option shouldn't be the first step in your analysis. Many of these young administrators probably went on to develop their skills and experiences to a senior level, but deep down I'm almost positive most still suffer from "Reboot It" Syndrome.

So, the next time your crappy AT&T U-Verse DVR freezes up go ahead and pull the plug on it to get it that much needed "environmental refresh". But, if a critical Hadoop daemon in your cluster is throwing some wild Exceptions... PLEASE take a few minutes to do some investigation activities and restarting it is most likely just going to produce a small outage and then return you to the undesired state you were hoping to fix with "magic". Or, at least tell me you did this before you Reboot It!!