Spark and Storm face new competition for real-time Hadoop processing

DataTorrent is releasing its real-time data processing engine for Hadoop and beyond as the open source Project Apex

Real-time processing of streaming data in Hadoop typically comes down to choosing between two projects: Storm or Spark. But a third contender, which has been open-sourced from a formerly commercial-only offering, is about to enter the race, and like those components, it may have a future outside of Hadoop.

DataTorrent RTS (real-time streaming) has long been a commercial offering for live data processing apart from the family of Apache Foundation open source projects around Hadoop. But now DataTorrent (the company) is preparing to open-source the core DataTorrent RTS engine, offer it under the same Apache 2.0 licensing as its competitors, and eventually contribute it to the Apache Foundation for governance.

Built for business

Project Apex, as the open source version of DataTorrent RTS's engine is to be called, is meant to not only compete with Storm and Spark but to be superior to them -- to run faster (10 to 100 times faster than Spark, it's claimed), to be easier to program, to better support enterprise needs like fault tolerance and scalability, and to make it easier to demonstrate the value of Hadoop to a business owner.

According to DataTorrent VP of Marketing John Fanelli, DataTorrent RTS/Project Apex is meant to ease the process of working with Spark's streaming processing. "Spark is very much a development framework," Fanelli said in a phone conversation, "where you have to write everything by hand ... and where you have to think and program in more of a MapReduce paradigm."

Fanelli said that Spark lacks other key features that would be attractive to enterprises, such as event processing, the ability to guarantee the order of events, and fault-tolerance at the platform level. Apex doesn't require Scala to program it, meaning existing Java programmers wouldn't need to do as much retooling to leverage it. (Spark is written in Scala and can be programmed both with it and a few other languages, including Python and Java -- but the best results with Spark generally come from using Scala.)

Fanelli also felt Apex can help Spark users get away from working with time-consuming batch-oriented methods to generate insights from existing data. "It's better to use a streaming product to do batch than it is to use a batch product to do streaming," he said.

Hadoop might only be the beginning

There's little question Apex is being open-sourced in part to entice users toward the commercial DataTorrent RTS product. Many of its features -- such as graphical app design and dynamic optimizations of workloads, which expand upon the core that Apex offers -- are an attempt to address what Fanelli feels are the value propositions Hadoop doesn't always communicate well to enterprise customers, like generating real-time actionable insight on ingested data.

If Hadoop isn't taking off in some enterprises because of its value proposition, that by itself isn't tied to any one issue. Aside from the perception that Hadoop is overkill for the work being done, there's also the notion that Hadoop is too costly or complex to be worth the trouble. Hadoop vendors keep trying to address these issues, but there's reason to believe Hadoop only has so much appeal with enterprises.

Likely less limited is the culture of reuse and development around individual pieces within Hadoop, like Spark -- and now Project Apex. Their real-time processing functionality doesn't have to be coupled with Hadoop to be useful, although it's been the most common scenario associated with how they're leveraged. Having Apex as an open source project will add another option to that toolbelt, one that's useful apart from any other happenings with Hadoop.

Next read this:

Serdar Yegulalp is a senior writer at InfoWorld, focused on machine learning, containerization, devops, the Python ecosystem, and periodic reviews.