What is Apache Spark? The big data platform that crushed Hadoop

Fast, flexible, and developer-friendly, Apache Spark is the leading platform for large-scale SQL, batch processing, stream processing, and machine learning.

InfoWorld |

What is Apache Spark? The big data platform that crushed Hadoop — Life of Pix (CC0)

Table of Contents

Apache Spark defined
What is Spark in big data
Spark RDD
Spark SQL
Spark MLlib and MLflow
Structured Streaming
Delta Lake
Pandas API on Spark
Running Apache Spark
Databricks Lakehouse Platform

Apache Spark defined

Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. These two qualities are key to the worlds of big data and machine learning, which require the marshalling of massive computing power to crunch through large data stores. Spark also takes some of the programming burdens of these tasks off the shoulders of developers with an easy-to-use API that abstracts away much of the grunt work of distributed computing and big data processing.

What is Spark in big data

When people talk about “big data,” they generally refer to the rapid growth of data of all kinds—structured data in database tables, unstructured data in business documents and emails, and semi-structured data in system log files and web pages. Whereas analytics in years past focused on structured data and revolved around the data warehouse, analytics today gleans insights from all kinds of data, and revolves around the data lake. Apache Spark was purpose-built for this new paradigm.

From its humble beginnings in the AMPLab at U.C. Berkeley in 2009, Apache Spark has become one of the key big data distributed processing frameworks in the world. Spark can be deployed in a variety of ways, provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing. You’ll find it used by banks, telecommunications companies, games companies, governments, and all of the major tech giants such as Apple, IBM, Meta, and Microsoft.

Spark RDD

At the heart of Apache Spark is the concept of the Resilient Distributed Dataset (RDD), a programming abstraction that represents an immutable collection of objects that can be split across a computing cluster. Operations on the RDDs can also be split across the cluster and executed in a parallel batch process, leading to fast and scalable parallel processing. Apache Spark turns the user’s data processing commands into a Directed Acyclic Graph, or DAG. The DAG is Apache Spark’s scheduling layer; it determines what tasks are executed on what nodes and in what sequence.

RDDs can be created from simple text files, SQL databases, NoSQL stores (such as Cassandra and MongoDB), Amazon S3 buckets, and much more besides. Much of the Spark Core API is built on this RDD concept, enabling traditional map and reduce functionality, but also providing built-in support for joining data sets, filtering, sampling, and aggregation.

Spark runs in a distributed fashion by combining a driver core process that splits a Spark application into tasks and distributes them among many executor processes that do the work. These executors can be scaled up and down as required for the application’s needs.

Spark SQL

Spark SQL has become more and more important to the Apache Spark project. It is the interface most commonly used by today’s developers when creating applications. Spark SQL is focused on the processing of structured data, using a dataframe approach borrowed from R and Python (in Pandas). But as the name suggests, Spark SQL also provides a SQL2003-compliant interface for querying data, bringing the power of Apache Spark to analysts as well as developers.

Alongside standard SQL support, Spark SQL provides a standard interface for reading from and writing to other datastores including JSON, HDFS, Apache Hive, JDBC, Apache ORC, and Apache Parquet, all of which are supported out of the box. Other popular data stores—Apache Cassandra, MongoDB, Apache HBase, and many others—can be used by pulling in separate connectors from the Spark Packages ecosystem. Spark SQL allows user-defined functions (UDFs) to be transparently used in SQL queries.

Selecting some columns from a dataframe is as simple as this line of code:

citiesDF.select(“name”, “pop”)

Using the SQL interface, we register the dataframe as a temporary table, after which we can issue SQL queries against it:

citiesDF.createOrReplaceTempView(“cities”)
spark.sql(“SELECT name, pop FROM cities”)

Behind the scenes, Apache Spark uses a query optimizer called Catalyst that examines data and queries in order to produce an efficient query plan for data locality and computation that will perform the required calculations across the cluster. Since Apache Spark 2.x, the Spark SQL interface of dataframes and datasets (essentially a typed dataframe that can be checked at compile time for correctness and take advantage of further memory and compute optimizations at run time) has been the recommended approach for development. The RDD interface is still available, but recommended only if your needs cannot be addressed within the Spark SQL paradigm (such as when you must work at a lower level to wring every last drop of performance out of the system).

Spark MLlib and MLflow

Apache Spark also bundles libraries for applying machine learning and graph analysis techniques to data at scale. MLlib includes a framework for creating machine learning pipelines, allowing for easy implementation of feature extraction, selections, and transformations on any structured dataset. MLlib comes with distributed implementations of clustering and classification algorithms such as k-means clustering and random forests that can be swapped in and out of custom pipelines with ease. Models can be trained by data scientists in Apache Spark using R or Python, saved using MLlib, and then imported into a Java-based or Scala-based pipeline for production use.

An open source platform for managing the machine learning life cycle, MLflow is not technically part of the Apache Spark project, but it is likewise a product of Databricks and others in the Apache Spark community. The community has been working on integrating MLflow with Apache Spark to provide MLOps features like experiment tracking, model registries, packaging, and UDFs that can be easily imported for inference at Apache Spark scale and with traditional SQL statements.

Structured Streaming

Structured Streaming is a high-level API that allows developers to create infinite streaming dataframes and datasets. As of Spark 3.0, Structured Streaming is the recommended way of handling streaming data within Apache Spark, superseding the earlier Spark Streaming approach. Spark Streaming (now marked as a legacy component) was full of difficult pain points for developers, especially when dealing with event-time aggregations and late delivery of messages.

All queries on structured streams go through the Catalyst query optimizer, and they can even be run in an interactive manner, allowing users to perform SQL queries against live streaming data. Support for late messages is provided by watermarking messages and three supported types of windowing techniques: tumbling windows, sliding windows, and variable-length time windows with sessions.

In Spark 3.1 and later, you can treat streams as tables, and tables as streams. The ability to combine multiple streams with a wide range of SQL-like stream-to-stream joins creates powerful possibilities for ingestion and transformation. Here’s a simple example of creating a table from a streaming source:

val df = spark.readStream
  .format("rate")
  .option("rowsPerSecond", 20)
  .load()

df.writeStream
  .option("checkpointLocation", "checkpointPath")
  .toTable("streamingTable")

spark.read.table("myTable").show()

Structured Streaming, by default, uses a micro-batching scheme of handling streaming data. But in Spark 2.3, the Apache Spark team added a low-latency Continuous Processing mode to Structured Streaming, allowing it to handle responses with impressive latencies as low as 1ms and making it much more competitive with rivals such as Apache Flink and Apache Beam. Continuous Processing restricts you to map-like and selection operations, and while it supports SQL queries against streams, it does not currently support SQL aggregations. In addition, although Spark 2.3 arrived in 2018, as of Spark 3.3.2 in March 2023, Continuous Processing is still marked as experimental.

Structured Streaming is the future of streaming applications with the Apache Spark platform, so if you’re building a new streaming application, you should use Structured Streaming. The legacy Spark Streaming APIs will continue to be supported, but the project recommends porting over to Structured Streaming, as the new method makes writing and maintaining streaming code a lot more bearable.

Delta Lake

Like MLflow, Delta Lake is technically a separate project from Apache Spark. Over the past couple of years, however, Delta Lake has become an integral part of the Spark ecosystem, forming the core of what Databricks calls the Lakehouse Architecture. Delta Lake augments cloud-based data lakes with ACID transactions, unified querying semantics for batch and stream processing, and schema enforcement, effectively eliminating the need for a separate data warehouse for BI users. Full audit history and scalability to handle exabytes of data are also part of the package.

And using the Delta Lake format (built on top of Parquet files) within Apache Spark is as simple as using the delta format:

df = spark.readStream.format("rate").load()

stream = df 
  .writeStream
  .format("delta") 
  .option("checkpointLocation", "checkpointPath") 
  .start("deltaTable")

Pandas API on Spark

The industry standard for data manipulation and analysis in Python is the Pandas library. With Apache Spark 3.2, a new API was provided that allows a large proportion of the Pandas API to be used transparently with Spark. Now data scientists can simply replace their imports with import pyspark.pandas as pd and be somewhat confident that their code will continue to work, and also take advantage of Apache Spark’s multi-node execution. At the moment, around 80% of the Pandas API is covered, with a target of 90% coverage being aimed for in upcoming releases.

Running Apache Spark

At a fundamental level, an Apache Spark application consists of two main components: a driver, which converts the user’s code into multiple tasks that can be distributed across worker nodes, and executors, which run on those worker nodes and execute the tasks assigned to them. Some form of cluster manager is necessary to mediate between the two.

Out of the box, Apache Spark can run in a stand-alone cluster mode that simply requires the Apache Spark framework and a Java Virtual Machine on each node in your cluster. However, it’s more likely you’ll want to take advantage of a more robust resource management or cluster management system to take care of allocating workers on demand for you.

In the enterprise, this historically meant running on Hadoop YARN (YARN is how the Cloudera and Hortonworks distributions run Spark jobs), but as Hadoop has become less entrenched, more and more companies have turned toward deploying Apache Spark on Kubernetes. This has been reflected in the Apache Spark 3.x releases, which improve the integration with Kubernetes including the ability to define pod templates for drivers and executors and use custom schedulers such as Volcano.

If you seek a managed solution, then Apache Spark offerings can be found on all of the big three clouds: Amazon EMR, Azure HDInsight, and Google Cloud Dataproc.

Databricks Lakehouse Platform

Databricks, the company that employs the creators of Apache Spark, has taken a different approach than many other companies founded on the open source products of the Big Data era. For many years, Databricks has offered a comprehensive managed cloud service that offers Apache Spark clusters, streaming support, integrated web-based notebook development, and proprietary optimized I/O performance over a standard Apache Spark distribution. This mixture of managed and professional services has turned Databricks into a behemoth in the Big Data arena, with a valuation estimated at $38 billion in 2021. The Databricks Lakehouse Platform is now available on all three major cloud providers and is becoming the de facto way that most people interact with Apache Spark.

Apache Spark tutorials

Ready to dive in and learn Apache Spark? We recommend starting with the Databricks learning portal, which will provide a good introduction to the framework, although it will be slightly biased towards the Databricks Platform. For diving deeper, we’d suggest the Spark Workshop, which is a thorough tour of Apache Spark’s features through a Scala lens. Some excellent books are available too. Spark: The Definitive Guide is a wonderful introduction written by two maintainers of Apache Spark. And High Performance Spark is an essential guide to processing data with Apache Spark at massive scales in a performant way. Happy learning!

Next read this:

Ian Pointer is a senior big data and deep learning architect, working with Apache Spark and PyTorch. He has more than 15 years of development and operations experience.