Big data is all about the cloud

Picking between Spark or Hadoop isn’t the key to big data success. Picking the right infrastructure is

Big data isn't about real-time vs. batch processing. It's not a question of either/or, as Ovum analyst Tony Baer and others stress. Given the broad range of options and workloads that make up a successful big data strategy, this isn't surprising or controversial.

More controversial, though perhaps not surprising, is the nature of the infrastructure required to get the most from big data. For example, AWS (Amazon Web Services) data science chief Matt Wood warns that, while "analytics is addictive," this positive addiction quickly turns sour if your infrastructure can't keep up.

The key to big data success, Wood says, is more than Spark or Hadoop. It's running both on elastic infrastructure.

Hortonworks Vice President of Corporate Strategy Shaun Connolly agrees that the cloud has a big role to play in big data analytics. But Connolly believes the biggest factor in determining where big data processing is done is "data gravity," not elasticity.

The main driver for big data deployments, Connolly says, is to extend and augment traditional on-premise systems, such as data warehouses. Eventually, this leads large organizations to deploy Hadoop and other analytics clusters in multiple locations -- typically on site.

Nevertheless, Connolly acknowledges, the cloud is emerging an increasingly popular option for the development and testing of new analytics applications and for the processing of big data that is generated "outside the four walls" of the enterprise.

Essential ingredients for big data analytics

While AWS big data customers range from nimble startups like Reddit to massive enterprises like Novartis and Merck, Wood suggests three key components to any analytics system.

A single source of truth. AWS provides multiple ways to store this single source of truth, from S3 storage to databases like DynamoDB or RDS or Aurora to data warehousing solutions like Redshift.
Real-time analytics. Wood says that companies often augment this single source of truth with streaming data, such as website clickstreams or financial transactions. While AWS offers Kinesis for real-time data processing, other options exist like Apache Storm and Spark.
Dedicated task clusters. Task clusters are a group of instances running a distributed framework like Hadoop, but spun up specifically for a dedicated task like data visualization.

With these components in mind, Wood repeats that big data isn't a question of batch versus real-time processing, but rather about a broad set of tools that allows you to handle data in multifaceted ways:

It's not Spark or Hadoop. It's a question of "and," not "or." If you're using Spark, that shouldn't preclude you from using traditional MapReduce in other areas, or Mahout. You get to choose the right tool for the job, versus fitting a square peg into a round hole.

As Wood sees it, "Real-time data processing absolutely has a role going forward, but it's additive to the big data ecosystem."

This echoes something Hadoop creator Doug Cutting said in an interview last week, in response to a question about whether streaming or real-time data processing would displace options like Hadoop:

I don't think there will be any giant shift toward streaming. Rather streaming now joins the suite of processing options that folks have at their disposal. When they need interactive BI, they use Impala; when they need faceted search, they use Solr; and when they need real-time analytics, they use Spark Streaming, etc. Folks will still perform retrospective batch analytics too. A mature user of the platform will likely use all of these.

Hortonworks' Connolly sees a similar future. Hadoop caught on with enterprises as a way to extend the data warehouse and facilitate analytics across existing application siloes at dramatically lower cost. But as customers become more sophisticated, new data sources, new tools, and often the cloud get added to the mix:

If you think of business use cases around the 360 degree view [that consolidates customer or product data across different siloes], that might be on prem. But your machine learning and data discovery might be in the cloud. You might have new data sets like weather data and census data that you may not have already had in your four walls, so you may want to mix that with some of your existing data to do advanced machine learning.

Because the laws of physics prohibit the easy movement of hundreds of terabytes or petabytes of data across the network, Connolly says customers will have Hadoop clusters on prem and on various clouds to be able to do the appropriate analytics wherever the bulk of the data has landed. His term for that is "data gravity." When the newer data sets -- such as weather data, census data, and machine and sensor data -- originate outside the enterprise, the cloud becomes a natural place to do the processing.

Building in elasticity and scale

While many mistakenly believe big data is a matter of massive volumes of data and neglect the more common complexities inherent in variety and velocity of data, even volume isn't as simple as some suspect.

In the opinion of Amazon's Wood, the challenge of big data "is not so much about absolute scale of data but rather relative scale of data." That is, while a project like the Human Genome Project might start as a gigabyte-scale project, it quickly got into terabyte and then petabyte scale. "Customers will tool for the scale they're currently experiencing," Wood notes, but when the scale makes a step change, enterprises can be caught completely unprepared.

As Wood told me in a previous conversation, "Those that go out and buy expensive infrastructure find that the problem scope and domain shift really quickly. By the time they get around to answering the original question, the business has moved on."

In other words, "Enterprises want a platform that graciously allows them to move from one scale to the next and the next. You just can't get this if you drop a huge chunk of change on a data center that is frozen in time."

As an example, Wood walked through The Weather Channel, which used to have only a couple of million of locations on which it'd report weather every four hours. Now it has have billions and updates every few minutes on AWS, all with 100 percent uptime. In other words, it's not only about big data processing but also about cloud delivery of that data.

For Hortonworks' Connolly, the flexibility of the cloud is as important as its elastic scalability. "We're starting to see more dev test where you just spin up ad hoc clusters to do your work around a subset of data," he notes.

Particularly in the case of machine learning, he says, you can push up enough data for the machine learning solution to work against, allowing you to create your decision model in the cloud. That model will then be used in a broader application that might be deployed elsewhere.

"The cloud is great for that front end of 'let me prove my concept, let me get some of my initial applications started,'" he adds. "Once that's done, the question becomes, 'Will this move on premise because that's where the bulk of the data is, or will it remain in the cloud?'"

Ultimately, Connolly says, it's not an "all in on cloud" versus "all in on premises" dilemma. In cases where the bulk of the data is created on prem, the analytics will remain on prem. In other use cases, such as stream processing of machine or sensor data, the cloud is a natural starting point.

"Over the next year or two," Connolly believes, "it's going to be an operational discussion around where do you want to spend the cost and where is the data born and where do you want to run the tech. I think it's going to be a connected hybrid experience, period."

However it shapes up, it's clear that most successful big data strategies will incorporate a range of big data technologies running in the cloud.

Next read this:

Matt Asay runs developer relations at MongoDB. The views expressed herein are Matt’s and do not reflect those of his employer.