IBM Invests to Help Open-Source Big Data Software — and Itself

Photo
Participants at a Spark programming session sponsored by IBM in San Francisco.Credit Eorge Nikitin/Feature Photo Service for IBM

The IBM “endorsement effect” has often shaped the computer industry over the years. In 1981, when IBM entered the personal computer business, the company decisively pushed an upstart technology into the mainstream.

In 2000, the open-source operating system Linux was viewed askance in many corporations as an oddball creation and even legally risky to use, since the open-source ethos prefers sharing ideas rather than owning them. But IBM endorsed Linux and poured money and people into accelerating the adoption of the open-source operating system.

On Monday, IBM is to announce a broadly similar move in big data software. The company is placing a large investment — contributing software developers, technology and education programs — behind an open-source project for real-time data analysis, called Apache Spark.

The commitment, according to Robert Picciano, senior vice president for IBM’s data analytics business, will amount to “hundreds of millions of dollars” a year.

In the big data software market, much of the attention and investment so far has been focused on Apache Hadoop and the companies distributing that open-source software, including Cloudera, Hortonworks and MapR. Hadoop, put simply, is the software that makes it possible to handle and analyze vast volumes of all kinds of data. The technology came out of the pure Internet companies like Google and Yahoo, and is increasingly being used by mainstream companies, which want to do similar big data analysis in their businesses.

But if Hadoop opens the door to probing vast volumes of data, Spark promises speed. Real-time processing is essential for many applications, from analyzing sensor data streaming from machines to sales transactions on online marketplaces. The Spark technology was developed at the Algorithms, Machines and People Lab at the University of California, Berkeley. A group from the Berkeley lab founded a company two years ago, Databricks, which offers Spark software as a cloud service.

Spark, Mr. Picciano said, is crucial technology that will make it possible to “really deliver on the promise of big data.” That promise, he said, is to quickly gain insights from data to save time and costs, and to spot opportunities in fields like sales and new product development.

IBM said it will put more than 3,500 of its developers and researchers to work on Spark-related projects. It will contribute machine-learning technology to the open-source project, and embed Spark in IBM’s data analysis and commerce software. IBM will also offer Spark as a service on its programming platform for cloud software development, Bluemix. The company will open a Spark technology center in San Francisco to pursue Spark-based innovations.

And IBM plans to partner with academic and private education organizations including UC Berkeley’s AMPLab, DataCamp, Galvanize and Big Data University to teach Spark to as many as 1 million data engineers and data scientists.

Ion Stoica, the chief executive of Databricks, who is a Berkeley computer scientist on leave from the university, called the IBM move “a great validation for Spark.” He had talked to IBM people in recent months and knew they planned to back Spark, but, he added, “the magnitude is impressive.”

With its Spark initiative, analysts said, IBM wants to lend a hand to an open-source project, woo developers and strengthen its position in the fast-evolving market for big data software.

By aligning itself with a popular open-source project, IBM, they said, hopes to attract more software engineers to use its big data software tools, too. “It’s first and foremost a play for the minds — and hearts — of developers,” said Dan Vesset, an analyst at IDC.

IBM is investing in its own future as much as it is contributing to Spark. IBM needs a technology ecosystem, where it is a player and has influence, even if it does not immediately profit from it. IBM mainly makes its living selling applications, often tailored to individual companies, which address challenges in their business like marketing, customer service, supply-chain management and developing new products and services.

“IBM makes its money higher up, building solutions for customers,” said Mike Gualtieri, a analyst for Forrester Research. “That’s ultimately why this makes sense for IBM.”