Apache Storm 1.0 packs a punch

Apache's streaming data processing system takes on Spark with better performance and more convenient debugging features

When big data mavens debate the merits of using Apache Spark versus Apache Storm for streaming data processing, the argument usually sounds like this: Sure, Storm has great scale and speed, but it's hard to use. Plus, it's slowly being overtaken by Spark, so why go with old and busted when there's new and hot?

That's why Apache Storm 1.0 hopes to turn the ship around, not only by making it faster but by also easier and more convenient to work with.

Apache announced this week that Apache Storm 1.0 can crank out results "up to 16 times faster" than before, with a 60 percent reduction in latency. "For most use cases users can expect a 3× performance boost over earlier versions."

A collection of strategic fixes provide the performance boosts, among them a new distributed cache API that enables data associated with a given Storm setup, or "topology" -- which can run to many gigabytes -- to be shared between nodes and updated from the command line; it doesn't have to be redeployed by hand to each node. The data can be drawn from the local filesystem, but if it is stashed in an Hadoop HDFS store -- a good place to put it -- it can be drawn from there as well.

A new batching methodology also provides a major speed boost -- one micro-benchmark increased fivefold -- with only a very slight increase in latency.

Many of the other changes in version 1.0 will help Storm be easier to work with. Debugging earlier releases of Storm typically involved writing custom "bolts" (processing functions) to extract live data. With version 1.0, users can sample a percentage of data moving through Storm, which can be viewed in the UI or saved to disk for later inspection. Likewise, a new log-search function lets the user search logs across the entire topology of Storm supervisor nodes.

Storm faces competition from more than Spark alone, both in terms of performance and ease of use. The Project Apex streaming framework, also known as DataTorrent RTS, is meant to be "10 to 100 times faster" than Spark Streaming, and is easier to develop with and deploy than either Spark or Storm.

Next read this:

Serdar Yegulalp is a senior writer at InfoWorld, focused on machine learning, containerization, devops, the Python ecosystem, and periodic reviews.