Review: Apache Hive brings real-time queries to Hadoop

Hive's SQL-like query language and vastly improved speed on huge data sets make it the perfect partner for an enterprise data warehouse

Apache Hive is a tool built on top of Hadoop for analyzing large, unstructured data sets using a SQL-like syntax, thus making Hadoop accessible to legions of existing BI and corporate analytics researchers. Developed by Facebook engineers and contributed to the Apache Foundation as an open source project, Hive is now at the forefront of big data analysis in commercial environments. 

Hive, like the rest of the Hadoop ecosystem, is a fast-moving target. This review covers version 0.13, which addresses several shortcomings in previous versions. It also brings a significant speed boost to SQL-like queries across large-scale Hadoop clusters, building on new capabilities for interactive query introduced in prior releases. 

[ Also on InfoWorld: Know this about Hadoop right now | Learn how Hadoop works and how you can reap its benefits: Download InfoWorld's Hadoop Deep Dive PDF. | Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]

Hive is fundamentally an operational data store that's also suitable for analyzing large, relatively static data sets where query time is not important. Hive makes an excellent addition to an existing data warehouse, but it is not a replacement. Instead, using Hive to augment a data warehouse is a great way to leverage existing investments while keeping up with the data deluge.

A typical data warehouse includes many expensive hardware and software components such as RAID or SAN storage, optimized ETL (extract, transform, load) procedures for cleaning and inserting data, specialized connectors to ERP and other back-end systems, and schemas designed around the questions an enterprise wants to ask such as sales by geography, product, or channel. The warehouse ecosystem is optimized around bringing enriched data to the CPU to answer the classes of questions the schema was designed for.

By contrast, a Hive data store brings together vast amounts of unstructured data -- such as log files, customer tweets, email messages, geo-data, and CRM interactions -- and stores them in an unstructured format on cheap commodity hardware. Hive allows analysts to project a databaselike structure on this data, to resemble traditional tables, columns, and rows, and to write SQL-like queries over it. This means that different schemas may be projected over the same data sets, depending on the nature of the query, allowing the user to ask questions that weren't envisioned when the data was gathered.

Hive queries traditionally had high latency, and even small queries could take some time to run because they were transformed into map-reduce jobs and submitted to the cluster to be run in batch mode. This latency wasn't usually a problem, because the overhead for query planning and starting up the map-reduce job was dwarfed by the processing time for the query itself, at least when running on the very large data sets Hive was designed for. However, users soon found that such long-running queries were inconvenient and troublesome to run in a multi-user environment, where a single job could dominate the cluster.

InfoWorld Scorecard
Scalability (20.0%)
Value (10.0%)
Management (25.0%)
Availability (20.0%)
Performance (25.0%)
Overall Score (100%)
Apache Hive 0.13 10.0 10.0 7.0 8.0 7.0 8.1
1 2 3 4 5 Page 1
Page 1 of 5