Apache Flink 1.0 takes on Spark in Hadoop processing

Hadoop needs fast and easy-to-use stream processing, and Flink provides that -- but it'll compete with Spark and Storm

Apache Flink, a potential contender for Apache Spark's big-data processing jobs, released its first API-stable 1.0 version this week.

Spark is mainly for in-memory processing of batch data. Though it has a streaming processing engine, streaming -- processing incoming data in real time -- has not been its strong suit.

Flink, on the other hand, was built around a stream model, which it can apply to batch and SQL processing jobs as well. It includes libraries for complex event processing (essentially, a pattern detection system for streams), machine learning, and graph processing.

The streaming model benefits iterative processing, or repeated passes on the same data as used in applications like machine learning. Flink can be instructed to only process the parts of the data that have actually changed, thus significantly speeding up the job. Spark can perform iterative processing as well, but each iteration has to be scheduled and executed separately.

Flink provides more efficient memory processing than Spark since it has a memory management system that reduces the amount of garbage collection performed by the JVM. Spark has done a lot of work to address these issues via its Project Tungsten initiative, but Flink implemented such ideas far earlier in its lifecycle. Any state data that needs to be stored when processing a stream is held in an instance of RocksDB, an open source key-value store developed by Facebook.

Flink is also likely to eclipse Apache Storm, a stream-processing system with a broad ecosystem of development. Users can take Storm's topologies and run them in Flink to transition between the two. It's a smart move given Storm's reputation for being tough to use.

InfoWorld's Andy Oliver looked at Flink back in mid-2015 and found it to be "more promise than practical experience" -- a good idea that meets real needs, but at that point "a niche technology that people use when Spark or Storm doesn't work out."

Apart from that, Spark's widespread existing popularity means Flink faces the challenges inherent with any project where incumbents already hold the field. But there's clearly a need for real-time processing frameworks that blend Spark's simplicity with Storm's low-latency, stream-first approach.

Next read this:

Serdar Yegulalp is a senior writer at InfoWorld, focused on machine learning, containerization, devops, the Python ecosystem, and periodic reviews.