Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This blog post introduces the three streaming frameworks that are bundled in the Hortonworks Data Platform (HDP) – Apache Storm, Spark Streaming, and Kafka Streams – and focuses on the topology supervision features offered to the topologies (aka workflows) running with, or within, these particular frameworks. This post does not attempt to fully described each framework nor does it provide examples of their usage via working code. The goal is to develop an understanding of what, if any, services are available to help with lifecyle events, scalability, management, and monitoring.

Table of Contents

The Frameworks

...

Kafka Streams are tightly coupled with Kafka the Kafka’s messaging platform; especially the streaming input data. Kafka Streams is intentionally designed to fit into any Java or Scala application which gives it plenty of flexibility, but offers no inherent lifecycle, scaling, management, or monitoring features.

...

Apache Spark’s streaming frameworks allow for a variety of input and output data technologies. Spark Streaming apps are themselves Spark applications who, in a Hadoop cluster at least, run under YARN which provides good answers to coverage for many of the lifecycle and management features. The Spark framework addresses a number of the scaling and monitoring needs.

...

Kafka Streams

Spark Streaming

Apache Storm

Lifecycle Events

Start

(blue star) RYO

(tick) Submitter

(tick) Submitter

Stop

(blue star) RYO

(warning) Patterns available

(tick) Available

Pause/Restart

(blue star) RYO

(blue star) Not available

(tick) Available

Scalability

Initial Parallelization

(blue star) RYO

(tick) Parameterized

(tick) Parameterized

Runtime Elasticity

(blue star) RYO

(warning) Auto-scaling based on properties for min/max number of executors

(warning) No auto-scaling but each component can be scaled +/- individualyindividually

Management

Resource Availability

(blue star) No inherent resources

(tick) Managed

(tick) Managed

Failure Restart

(blue star) RYO

(tick) Automatic

(tick) Automatic

Monitoring

Topology UI

(blue star) RYO

(warning) Combined with all Spark jobs

(tick) Centralized

Integration

(blue star) RYO

(tick) JMX

(tick) JMX

...

Let me start by pointing out that it looks like Kafka Streams is “all bad”… it surely isn’tbad”, but that’s not the case. It is build around the concept of writing and deploying standard applications and consciously does not want be part of a runtime framework. Due to that and the focus of this blog post, it should be obvious why it scored so low on these features. The RYO (Roll Your Own – aka “custom”) callouts I gave are likely a badge of honor to the folks who are bringing us this framework.

Kafka Streams also has a lot of early interest and I surely would not discount it for a second. The biggest issue for those teams who stand up a decent sized Hadoop/Spark cluster is that you don’t get to take advantage of all those nodes to run a your Kafka Streams app apps on. You’ll need to size out what you’ll need is needed for each application and ensure that needed resources are available to run your app apps on.

On the other end of the spectrum, one would think that will an almost perfect green checkmark score on the features identified that it Storm would be a no-brainer. Storm is the grandpa of the streaming engines and its event-level isolation provide something the other microbatch frameworks can’t do. This maturity shines through in all of these supervision features, but on the other hand it is the least “exciting” of the frameworks for folks starting their streaming journey in 2019. If you need to get something into production asap and you just need to know it works – all day long and every day… then go with Storm!

This brings me to my personal recommendation (and of Spark Streaming. Note that this comes from a guy who really does love Apache Storm and values the simplicity & flexibility of Kafka Streams) of Spark Streaming. There is simply too much excitement & focus around Spark in general and the ability to transition applications between batch and streaming paradigms with minimal coding make it a no-brainerclose the case. It is still maturing, but its alignment with YARN help it score high on many of these supervision-oriented features.