The data lake is becoming the new data warehouse

Platforms like AWS Lake Formation and Delta Lake point toward a central hub for decision support and AI-driven decision automation

The data lake is becoming the new data warehouse
Thinkstock

Are data warehouses relevant again, or are they a dying breed?

You’re forgiven if you’re a bit confused on this issue. On the one hand, data warehousing certainly seems to be on a hot streak. As a longtime industry observer, I’ve seen the industry surge in successive waves of innovation and startup activity.

This trend essentially began when the appliance form factor entered the data warehousing mainstream a decade ago, and then gained new momentum several years ago as the market shifted toward the new generation of cloud data warehouses. In the past few years, one cloud data warehouse vendor—Snowflake—has gained an inordinate amount of traction in the marketplace.

The eclipse of the data warehouse

On the other hand, data warehousing keeps getting eclipsed by new industry paradigms, such as big data, machine learning, and artificial intelligence. This trend has fostered the impression that data warehousing is declining as an enterprise IT priority, but in fact most organizations now have at least one and often multiple data warehouses serving various downstream applications.

The persistence of data warehousing as a core enterprise workload is why, several years ago, I felt I had to contribute my thoughts on why the data warehouse is far from dead. It also probably explains why other observers felt they had to redefine the concept of the data warehouse to keep it relevant in the era of the data lakes and cloud computing.

Data warehousing as a practice is not only thriving, but is now perceived as a central addressable growth frontier for the cloud computing industry. However, you would be missing much of the action in this space if you focused strictly on those platforms—such as Snowflake—that go to the market under this label.

The rise of the data lakes

What many call a “data lake” is rapidly evolving into the next-generation data warehouse. For those unfamiliar with the concept, a data lake is a system or repository of multi-structured data that are stored in their natural formats and schemas, usually as object “blobs” or files.

Data lakes usually function as a single store for all enterprise data, including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics, and machine learning. They incorporate a distributed file or object store, machine learning model library, and highly parallelized clusters of processing and storage resources. And, rather than enforce a common schema and semantics on the objects they store, data lakes generally do schema on read and use statistical models to extract meaningful correlations and patterns from it all.

None of this is inconsistent with the core Inmon and Kimball concepts that inform most professionals’ approach to data warehousing. Fundamentally, a data warehouse exists to aggregate, retain, and govern officially sanctioned, “single-version-of-the-truth” data records. This concept is agnostic to the specific application domains of the data being managed and to the particular use cases in which it is being used.

If you doubt what I’m saying on that score, just check out this discussion of Bill Inmon’s definition of a data warehouse and this comparison of Inmon’s and Ralph Kimball’s frameworks. The data warehouse is all about data-driven support of decisioning generally, which makes it quite extensible to the new world of AI-driven inferencing.

The next-generation data warehouses 

In the past year, several high-profile industry announcements have signaled a shift in the role of the data warehouse. Although decision support—also known as business intelligence, reporting, and online analytical processing—remains the core use case of most data warehouses, we’re seeing a steady shift toward decision automation. In other words, data warehouses are now supporting the data science pipeline that builds machine learning applications for data-driven inferencing.

The new generation of data warehouses are in fact data lakes designed, first and foremost, to govern the cleansed, consolidated, and sanctioned data used to build and train machine learning models. At the Amazon re:Invent conference last fall, for example, Amazon Web Services announced AWS Lake Formation. The express purpose of this new managed service is to simplify and accelerate the setup of secure data lakes. However, AWS Lake Formation has all of the hallmarks of a cloud data warehouse, though AWS is not calling it that and in fact already offers a classic data warehouse, Amazon Redshift, which is oriented toward decision support applications.

AWS Lake Formation looks, walks, and acts like a data warehouse. Indeed, AWS describes it in a way that invites those comparisons: “A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake enables you to break down data silos and combine different types of analytics to gain insights and guide better business decisions.”

Indeed, AWS presents AWS Lake Formation as a sort of über data warehouse for both decision support and AI-driven decision automation. Specifically, the vendor states that the service is designed to manage data sets that “your users then leverage… with their choice of analytics and machine learning services, like Amazon EMR for Apache Spark, Amazon Redshift, Amazon Athena, Amazon SageMaker, and Amazon QuickSight.”

Another case in point is Databricks’ recently announced Delta Lake open-source project. The express purpose of Delta Lake, which is available now under the Apache 2.0 license, is similar to AWS Lake Format: aggregation, cleansing, curation, and governance of data sets maintained in a data lake to support the machine learning pipeline.

Delta Lake sits on top of an existing on-premises or cloud data storage platform that can be access from Apache Spark, such as HDFS, Amazon S3, or Microsoft Azure blob storage. Delta Lake stores data in Parquet to provide what Databricks refers to as a “transactional storage layer.” Parquet is an open source columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework. It supports ACID transactions via optimistic concurrency serializability, snapshot isolation, data versioning, rollback, and schema enforcement.

One key difference between Delta Lake and AWS Lake Formation is that Delta Lake processes both batch and streaming data in that pipeline. Another is that Delta Lake supports ACID transactions on all that data, enabling multiple simultaneous writes and reads by hundreds of applications. In addition, developers can access earlier versions of each Delta Lake for auditing, rollbacks, or to reproduce the results of their MLFlow machine learning experiments.

At the broadest level, Delta Lake appears to compete with the most widely adopted open source data warehousing project, Apache Hive, though Hive relies exclusively on HDFS-based storage, and has lacked support for ACID transactions until recently. Announced a year ago, Hive 3 finally brings ACID support to Hadoop-based data warehouses. Hive 3 provides atomicity and snapshot isolation of operations on transactional CRUD (create read update delete) tables using delta files.

The foundation for AI-driven decision automation

What these recent industry announcements—AWS Lake Formation, Delta Lake, and Hive 3—foretell is the day when data lakes become governance hubs for all decision support and decision automation applications, and also for all transactional data applications. For these trends to accelerate, open-source projects such as Hive 3 and Delta Lake will need to gain broader traction among vendors and users.

The term “data warehousing” will probably endure to refer primarily to governed, multi-domain stores of structured data for business intelligence. However, the underlying data platforms will continue to evolve to provide the core data governance foundation for cloud-based artificial intelligence pipelines.

AI, not BI, is driving the evolution of the enterprise data warehouse.

Copyright © 2019 IDG Communications, Inc.