Q&A: Hortonworks and IBM double down on Hadoop

New alliance brings IBM's DSX data science and machine learning toolkit to the Hortonworks Data Platform

Editor at Large, InfoWorld |

Hortonworks and IBM recently announced an expanded partnership. The deal pairs IBM’s Data Science Experience (DSX) analytics toolkit and the Hortonworks Data Platform (HDP), with the goal of extending machine learning and data science tools to developers across the Hadoop ecosystem. IBM’s Big SQL, a SQL engine for Hadoop, will be leveraged as well.

InfoWorld Editor at Large Paul Krill recently met with Hortonworks CEO Rob Bearden and IBM Analytics general manager Rob Thomas at the DataWorks Summit conference in Silicon Valley, to talk about the state of big data analytics, machine learning, and Hadoop’s standing among the expanding array of technologies available for large-scale data processing.

InfoWorld: What does IBM Data Science Experience bring to the Hadoop Data Platform?

Thomas: We launched Data Science Experience last year and the idea was we saw a change coming in the data science market. Traditionally, organizations were either SPSS users or SAS users but the whole market was moving toward open languages. We built Data Science Experience on Jupyter. It’s focused on Python data scientists, R, Spark, Scala programmers. You can use whatever language you want.

And you can use whatever framework you want for the machine learning underneath. You can use TensorFlow or Caffé or Theano … It’s really an open platform for data science. We focus on the collaboration, how you get data scientists working as a team as part of doing that. Think about Hadoop. Hadoop has had an enormous run in the last five to six years in enterprises. There is a lot of data in Hadoop now. There is not super value for the client by just having data there. Sometimes, there is some cost savings. Where there is super value for the client is they actually start to change how they’re interacting with that data, how they’re building models, discovering what’s happening in there.

InfoWorld: IBM has a well-known experience with machine learning with Watson. Hortonworks has positioned Apache Spark and Hadoop as its entrance into the machine learning space. Can you discuss the company’s future plans for machine learning, AI, and data science?

Bearden: It’s going to be through the DSX framework and the IBM platforms that come through that. Hadoop and HDP will continue to be the platform. We’ll leverage some of the other processing platforms collectively like Spark and there’s a tremendous amount of work that IBM’s done to advance Spark. We’ll continue to embody that inside of HDP through YARN but then on top of all of these large data sets, we’ll leverage DSX and the rest of the IBM tool suite. We expressed that DSX and the rest of the tool suite from IBM for machine learning, deep learning, and AI will be our strategic platforms going forward and we’re going to co-invest very deeply to make sure all the integration is done properly. That goes back to being able to bring all resources into a focused distribution so that we can not only innovate horizontally but integrate vertically.

InfoWorld: InfoWorld ran a story late last year claiming that Hadoop had peaked, that other big data infrastructure including Spark, MongoDB, Cassandra, and Kafka were marching past it. InfoWorld asked Hortonworks CTO Scott Gnau a similar question last year. What can you say about the continued vitality of Hadoop?

Bearden: We’re a public company and we’re continuing to grow at 24 to 30 percent a year. The way we get paid is by bringing data under management. That’s one vector and it’s just a quantitative data point. I think what you have to then revert backwards to is, is the volume of data growing in the enterprise? According to just about any CIO you’ll speak with or any of the traditional industry analysts, and I think Rob will back this up, about every 18 months the volume of data doubles across the enterprise. About 70 to 80 percent of that data is not going to go into the traditional data platforms, the traditional SQL transactional EDW, etc., and they’re looking for that new area to come to rest, if you will. Hadoop is the right platform and architecture for that to happen. That’s why this partnership is so important. We’re great at landing that data, bringing it under management, securing it, providing the governance, etc., and being able to drive mission-critical marks on some pretty good economics. But what the enterprise really wants is the ability to gain insight from it, to get access to it, to have visibility, to be able to act on a decision and create an action that drives value for an application.

Thomas: Maybe the hype peaked but the hype always peaks when the hard work starts. I think Hadoop is still in its early days. We’ll look back at some point and it will be like sitting here in 1992 saying relational warehouses have peaked. It was just the start. We’re in the same place but the hard work has begun, which is—all right, now we’ve got the data there, how do I actually integrate this across my whole data landscape, which is why Scott talked a lot about Big SQL and what we’re doing there. That’s a really hard problem and if people don’t solve that then there’s probably a natural limitation to how much they could do with Hadoop. But together we solve that problem to the point of the whole discussion on data science, data governance. When you bring those things to Hadoop and you do it at scale, it again changes the opportunity for how fast and how widely Hadoop can be deployed.

InfoWorld: What’s going to happen with the evolution of YARN? What’s next on the roadmap for it?

Bearden: The notion of containers and having the ability to then take a container-based approach to applications and being able to do that as an extension through YARN is actually part of the roadmap today. We published that and we think that opens up new use cases and applications that can leverage Hadoop.

You go back to the ability to get to existing applications, whether it be fraud detection, money laundering, two of the typical ones that you look at in financial services. Rapid diagnostics in the healthcare world, being able to get to better processing for genomics… analyzing the genome for certain kinds of diseases and being able to take those existing algorithms or applications and moving them over to the data via a container approach. You can do that much cleaner with YARN.

InfoWorld: Is there anything else you want to mention?

Thomas: I’d mention just one more point around data governance. We started working with Hortonworks over the last, oh, 18 months around a project called Atlas. I’d say it’s just coming into form as we’ve both been working with a lot of clients and we view it as a key part of our joint strategy around how we’re going to approach data governance. You use data governance for compliance. You use data governance for insights. There’s a big compliance mandate with things like GDPR (General Data Protection Regulation) that’s happening right now in Europe. I think you’ll see more and more on this topic in the future from us.

Next read this:

Paul Krill is an editor at large at InfoWorld, whose coverage focuses on application development.