Review: Storm’s real-time processing comes at a price

The open source stream processing solution is proven reliable at scale, but difficult to learn and use

business storm 157689723
Thinkstock

Storm, a top-level Apache project, is a Java framework designed to help programmers write real-time applications that run on Hadoop clusters. Designed at Twitter, Storm excels at processing high-volume message streams to collect metrics, detect patterns, or take actions when certain conditions in the stream are detected. Typically Storm scenarios are at the intersection of real time and high volume, such as analyzing financial transactions for fraud or monitoring cell-tower traffic to maintain service level agreements.

Traditionally these sorts of systems have been constructed using a network of computers connected by a message bus (such as JMS). What makes Storm different is that it combines the message passing and processing infrastructure into a single conceptual unit known as a “topology” and runs them on a Hadoop cluster. This means that Storm clusters can take advantage of the linear scalability and fault tolerance of Hadoop, without the need to reconfigure the messaging bus when increasing capacity.

When working with teams new to Storm, I have found it helpful to approach system design from three dimensions: operations, topology, and data. These roughly map onto their corresponding dimensions in traditional enterprise applications, but translated into the Hadoop world. A Storm topology is a processing workflow analogous to a set of steps in a processing pipeline that would be managed by Oozie in a multipurpose Hadoop cluster.

The topology is the fundamental unit of deployment in Storm. It consists of two types of objects: spouts (message sources) and bolts (message processors). Spouts are available for many common data sources such as JMS, Kafka, and HBase.

Deploying a Storm cluster

Storm clusters typically run multiple Storm applications (topologies) simultaneously. In this sense it is analogous to a Java application server. A developer bundles up a JAR file containing the Storm topology and all of its dependencies, then deploys it to the cluster where it runs until terminated. Storm application developers do not need to be aware of the specific configuration of the cluster their application runs on, so they can focus on the specifics of their application.

Installing a Storm cluster these days is straightforward using Ambari. I was able to install Storm and its dependencies in less than 10 minutes using Hortonworks Data Platform. AWS users who want to spin up Storm clusters on Amazon should check out the storm-deploy project. HDInsights users can provision a Storm cluster on Azure with only a few clicks from their Azure portal or from PowerShell. Manual installation of Storm is straightforward but tedious.

Typically you will be asked where you would like the major components of Storm installed during this process. The three master services -- Nimbus, the UI server, and a distributed remote procedure call -- are usually located together on a single node. Then there are the worker nodes, confusingly called “supervisors.” The more of these you can deploy, the more processing power your Storm cluster will have. Unlike Hadoop services, Storm services are fault tolerant, so the decision of which services to place on which nodes is less critical.

Storm Cluster architecture

Under the hood, Storm has a lot of moving parts, but because they use Zookeeper to maintain state, they are fault tolerant and can be restarted, even on different nodes, without affecting the running application. Although Zookeeper is a critical component of a Storm topology, all you need to know at this point is that it’s a centralized service for maintaining configuration information and coordinating distributed applications. Storm stores all of its important information in Zookeeper, so if a component should fail, it can be restarted and will pick up where it left off by reading its state from Zookeeper.

The master service, known as Nimbus, is responsible for resource allocation in the Storm cluster. This daemon, implemented as a Thrift service, accepts incoming topologies, distributes code around the cluster, assigns tasks to supervisors, and handles failed nodes. Although failure of the Nimbus daemon will not cause running jobs to fail, its importance in managing failure of worker nodes means that it should be run under some kind of monitoring program to restart it if it fails.

Finally, the worker nodes, or supervisors, are responsible for running spouts and bolts in a pool of JVMs and threads (“workers” and “tasks” in Storm-speak). A bolt or spout executes as one or more threads running in parallel across the cluster. This gives Storm its scalability. Further, Storm worker nodes can be shared with other Hadoop services like Spark or MapReduce.

Managing a Storm cluster

Storm comes with a simple GUI that displays metrics collected by Nimbus. The metrics are useful for knowing when it is time to add nodes to a cluster and for tuning a topology. Although useful, the Storm GUI is not a complete solution, and most production clusters use additional monitoring tools such as JMX, Graphite, or Metrics (Yammer). These tools requires advanced configuration and tuning knowledge, and when the need for troubleshooting arises, be prepared to spend a good deal of time grepping through the logs.

Storm UI

Storm comes with a simple GUI that provides essential metrics, but a complete monitoring solution will require additional tools.  

Developing a Storm application

Storm is written in a combination of Java and Clojure, though spouts and bolts can be written in any language that supports Thrift, such as LISP, Python, or JavaScript. Developers work in local mode when writing and debugging their topologies. In local mode, threads are used to simulate worker nodes, and they allow the developer to set breakpoints, halt execution, inspect variables, and profile their topology before deploying it to a distributed cluster where all of this is much more difficult.

A topology is implemented as Thrift data structures. In a typical workflow, developers write spouts and bolts that operate over streams of tuples (ordered lists of values of any type). These components are assembled into a topology specific to the problem they are trying to solve. When testing is complete, the bundled JAR is submitted to Nimbus to be executed on the cluster.

Designing a topology is similar in many ways to designing a solution architecture. Each bolt in the processing pipeline modifies the tuples in the stream. The Storm architect’s job is to connect multiple bolts together in a topology to implement a complex set of real-time transformations or analysis. Given a sufficiently rich library of bolts, the process of developing a new application becomes one of simply combining the parts into the right topology and modifying any required parameters.

Data transformation

The fundamental data abstraction in Storm is a stream, or an unbounded sequence of tuples. Each stream in a topology is uniquely identified, as are the tuples within it. Creating a topology is, conceptually, specifying the “flow” of the streams through the bolts. Streams are produced by spouts that act as adapters between external data sources and a Storm cluster.

A bolt’s job is to process tuples as they are delivered by the stream. Bolts do all the heavy lifting in a Storm application. They are used to filter, aggregate, or join tuples together. Bolts can also send tuples to external message queues, databases, or HDFS. Bolts can process any number of input streams and produce any number of output streams. Like spouts, bolts are available for common data sources, including MongoDB, Cassandra, HBase, HDFS, and many relational database management systems.

1 2 Page 1
Page 1 of 2