disruptive possibilities (the rise of platform architecture)

Much like in too big to ignore (too boring to read), Disruptive Possibilities (Kindle edition is FREE) kicks off with discussions about how Big Data will change the world. Jeffrey Needham surmises the following:

Big data will bring disruptive changes to organizations and vendors, and will reach far beyond networks of friends to the social network that encompasses the planet. But with those changes come possibilities. Big data is not just this season’s trendy hemline; it is a must-have piece that will last for generations.

He then quickly ventures far away from that other book and goes down a path more focused on the technology crowd rather than the general business leader. And yes… he quickly starts talking about Hadoop, and to some degree, NoSQL technologies. As the title suggests, he focuses in on the disruptive nature of how this technology will affect most of us. Like others in industry, Needham states the following:

Big data is the most disruptive force this industry has seen since the introduction of the relational database.

The book has quite a bit of material focused on the need to tear down silos within our organizations. Needham’s position is that for Big Data tooling, such as Hadoop, to be successful we need to value & promote the concept of Platform Architecture. He predicts that since these Big Data platforms/clusters are so big in scale, new in technology, and critical for business that the current approach of creating a layered organization based on technologies (storage, network, database, development, operations, etc) will only get in the way.

Platform Architects are a rare-beed; they are the folks who can triage a problem at any level when customers are at their wit’s end. Needham declares there are three tenets of Platform Architecture/Engineering; avoid complexity, prototype perpetually, and optimize everything. He’s not talking about doing this only at the beginning of a project in a POC exercise, or when there seems to be a performance or scalability problem. He’s stating this is the new status quo and that most organizations are not set up to be successful in this model. We’ll have to tear down some silos to become this aware of “the platform”.

Silos will have to be removed not just in the purest technology spaces, but also around how to get access to data throughout one’s organization and bring it all together into the data lake/reservoir that will be needed to allow analytics to take place across the entire enterprise’s information. The author is pragmatic enough to realize that this will probably take years at most organizations as we are very often talking about assembling data from many different, mission-critical, systems; all with little, or no, additional funding. He does predict that this problem will self-correct as vendors adapt their technology to leverage HDFS for their underlying data storage technology thus reducing/eliminating the need to copy it to the data reservoir.

I was pleased to see a big section of the book devoted to discussing how clouds and (Hadoop) clusters intersect. With Cloud Computing and Big Data surfacing about the same time there is an (understandable) if both are good by themselves then they have to be good together mindset out there. I was glad to see some rational discussions on how these two concepts don’t necessarily align. This isn’t to say that Hadoop could not operate with cloud computing, but more that there will be performance/scalability trade-offs if you try to get the most out of these approaches. Here's an excerpt from the book.

Conventional clouds consist of thousands of virtual servers and as long as nothing else is on your server beating the daylights out of it, you’re good to go. The problem with running Hadoop on clouds of virtualized servers is that it beats the crap out of bare-iron servers for a living. Hadoop achieves impressive scalability with shared-nothing isolation techniques, so Hadoop clusters hate to share hardware with anything else. Share no data and share no hardware — Hadoop is shared-nothing.

While saving money is a worthwhile pursuit, conventional cloud architecture is still a complicated exercise in scalable platform engineering. Many private conventional clouds will not be able to support Hadoop because they rely on each individual application to provide its own isolation. Hadoop in a cloud means large, pseudo-monolithic supercomputer clusters lumbering around. Hadoop is not elastic like millions of mailboxes can be. It is a 1000-node supercomputer cluster that thinks (or maybe wishes) it is still a Cray. Conventional clouds designed for mailboxes can be elastic but will not elastically accommodate supercomputing clusters that, as a single unit of service, span 25 racks.

Again, Needham isn’t suggesting that Hadoop can’t run on conventional clouds, but that your “SLA mileage will vary a lot”. Obviously, changes in technologies in the future from hardware vendors may change this, but Hadoop was initially designed to run on commodity bare-medal boxes using Just a Bunch Of Disks (JBOD) configurations instead of network-available storage.

While I won’t try to paraphrase the next very interesting section of the book where we get a peek at the future and what the author believes needs to change for use to get to ludicrously big data that is definitely worth a read.

Needham is generous enough to share some personal experiences to visualize what he feels the future will need to evolve to in order to “process zettabytes and yottabytes of data on million-node clusters”. He also heavily uses great analogies throughout the book which always help with understanding.

My only complaint about this short book is that there was a teaser at the beginning of the book that perked up my privacy-nut ears; “civil liberties and privacy will be compromised as technology improvements make it affordable for any organization (private, public or clandestine) to analyze the patterns of data and behavior of anyone who uses a mobile phone”. I was let down to not read a discussion on this very important thread that will surely touch all of us focusing on Big Data.

To wrap up, I surely recommend reading Disruptive Possibilities. It is a well-written and concise book that doesn’t constantly reiterate a small number of basic arguments. If nothing else, the price (i.e. FREE) is right! If you do read it, I’d love to hear your thoughts on its applicability and quality, too.