Yahoo makes TensorFlow and Spark better together

Open source project that merges deep learning and big data frameworks is said to operate more efficiently at scale and require little change to existing Spark apps

Want Google TensorFlow’s deep learning chocolate in your Spark peanut butter? Good news: Yahoo has unveiled TensorFlowOnSpark to satisfy that craving.

Last year Yahoo combined two stars of big data and machine learning, integrating the in-memory data processing framework Spark with the deep learning framework Caffe. Applications written in Spark would have Caffe’s training functionality built into them or use trained models to make predictions that weren't possible with Spark’s native machine learning.

The latest Yahoo project, TensorFlowOnSpark (TFoS), does exactly what it says: It adds support for the TensorFlow deep learning library into Spark.

In a blog post, Yahoo’s Big ML engineering team described how this mingling of deep minds and big data arose from the need to make TensorFlow easier to deploy on existing clusters, like those running Spark. Several projects already aimed to do that: Databricks’ TensorFrames, which uses GPU acceleration, and the SparkNet project, created at the same Berkeley lab that gave rise to Spark.

TFoS was created partly in response to perceived inadequacies in those projects. “While these approaches are a step in the right direction,” wrote Yahoo, “after examining their code, we learned we would be unable to get the TensorFlow processes to communicate with each other directly, we would not be able to implement asynchronous distributed learning, and we would have to expend significant effort to migrate existing TensorFlow programs.”

TFoS was designed to run on existing Spark and Hadoop clusters, and use existing Spark libraries like SparkSQL or Spark’s MLlib machine learning libraries. Yahoo claims existing TensorFlow programs do not need to be heavily modified to work with TFoS. “Typically, changing fewer than 10 lines of Python code are needed,” it said. Parallel instances of TensorFlow can communicate directly with each other without having to go through Spark itself. Data can be ingested from TensorFlow’s native facilities for reading from HDFS or through Spark.

Clustered machine learning frameworks run faster when they can use remote direct memory access (RDMA). The original TensorFlow project doesn't support RDMA as a core feature, but it’s in the works. Rather than wait, Yahoo elected to create its own RDMA support and add it to TensorFlow’s C++ layer; the company is sharing its implementation as alpha-quality code.

Even without Yahoo’s contributions, TensorFlow has been progressing by leaps and bounds. The first full 1.0 version of the framework introduced optimizations that make it possible to deploy it on smartphone-grade hardware, and IBM chose TensorFlow as the deep learning system for its custom machine learning hardware.

When it comes to running at scale, TensorFlow’s most direct competition is MXNet, the deep learning system that Amazon has thrown its weight behind. Amazon claims MXNet scales better than the competition across multiple nodes, so it's faster to train models if you have the hardware to devote to the problem. It’ll be worth seeing how TensorFlowOnSpark compares—both in running on big clusters and in convenience to work with.

Next read this:

Serdar Yegulalp is a senior writer at InfoWorld, focused on machine learning, containerization, devops, the Python ecosystem, and periodic reviews.