Why you should use Gandiva for Apache Arrow

An execution engine for Arrow-based in-memory processing, Gandiva brings dramatic performance improvements to analytical workloads

Why you should use Gandiva for Apache Arrow
Thinkstock

Over the past three years Apache Arrow has exploded in popularity across a range of different open source communities. In the Python community alone, Arrow is being downloaded more than 500,000 times a month. The Arrow project is both a specification for how to represent data in a highly efficient way for in-memory analytics, as well as a series of libraries in a dozen languages for operating on the Arrow columnar format.

In the same way that most automobile manufacturers OEM their transmissions instead of designing and building their own, Arrow provides an optimal way for projects to manage and operate on data in-memory for diverse analytical workloads, including machine learning, artificial intelligence, data frames, and SQL engines.

The Gandiva initiative for Apache Arrow is a new execution kernel for Arrow that is based on LLVM. Gandiva provides significant performance improvements for low-level operations on Arrow buffers. We first included this work in Dremio to improve the efficiency and performance of analytical workloads on our platform, which will become available to users with Dremio 3.0. In this post I will describe the motivation for the initiative, implementation details, some performance results, and some plans for the future.

A note on the name: Gandiva is a mythical bow, from the Indian epic The Mahabharata, used by the hero Arjuna. According to the story, Gandiva is indestructible, and it makes the arrows it fires a thousand times more powerful.

Understanding Apache Arrow

Data, databases, and data analysis have changed dramatically over the past few decades. There are a number of reasons behind these changes:

  • Organizations have increasingly complex requirements for analyzing and using data. and increasingly high standards for query performance.
  • Memory has become inexpensive, enabling a new set of performance strategies based on in-memory analysis.
  • CPUs and GPUs have increased in performance, but have also evolved to optimize the processing of data in parallel.
  • New types of databases have emerged for different use cases, each with its own way of storing and indexing data.
  • New disciplines have emerged, including data engineering and data science, each with dozens of new tools to achieve specific analytical goals.
  • Columnar data representations have become mainstream for analytical workloads because they provide significant advantages in terms of speed and efficiency.

With these trends in mind, a clear opportunity has emerged for a modern, open standard covering both in-memory representation and related low-level data processing libraries that every engine can use. This standard should take advantage of all the new performance strategies that are now available, and should make sharing of data across platforms seamless and efficient. This is the goal of Apache Arrow.

To use an analogy, consider traveling to Europe on vacation before the EU. To visit five countries in seven days, you could count on the fact that you were going to spend a few hours at the border for passport control, and you were going to lose some of your money in the currency exchange.

This is how working with data in-memory works without Arrow. Enormous inefficiencies exist to serialize and de-serialize data structures, and a copy is made in the process, wasting precious memory and CPU/GPU resources. In contrast, Arrow is like visiting Europe after the EU and the Euro. You don’t wait at the border, and one currency is used everywhere.

Arrow combines the benefits of columnar data structures with in-memory computing. It provides the performance benefits of these modern techniques while also providing the flexibility of complex data and dynamic schemas. And it does all of this in an open and standardized way.

It’s important to understand that Apache Arrow itself is not a storage or execution engine. Apache Arrow is a library, as opposed to a system that you install, and it’s designed to serve as a shared foundation for the following types of systems:

  • Data analysis systems (such as Pandas and Spark)
  • Streaming and queueing systems (such as Kafka and Storm)
  • Storage systems (such as Parquet, Kudu, Cassandra, and HBase)
  • SQL execution engines (such as Drill and Impala)

Arrow consists of a number of connected technologies designed to be integrated into storage and execution engines. The key components of Arrow include:

  • Defined data type sets: Both SQL and JSON types, such as Int, BigInt, Decimal, VarChar, Map, Struct and Array.
  • Canonical representations: Columnar in-memory representations of data to support an arbitrarily complex record structure built on top of the defined data types.
  • Common data structures: Arrow-aware companion data structures including pick lists, hash tables, and queues.
  • Inter-process communication: Processes communicate within shared memory, TCP/IP, and RDMA.
  • Data libraries: Libraries for reading and writing columnar data in multiple languages including Java, C++, Python, Ruby, Rust, Go, and JavaScript.
  • Pipeline and SIMD algorithms: Algorithms for various operations including bitmap selection, hashing, filtering, bucketing, sorting, and matching.
  • Columnar in-memory compression: A collection of techniques to increase memory efficiency.
  • Memory persistence tools: Tools for short-term persistence through non-volatile memory, SSD, or HDD.

Gandiva for Apache Arrow

Earlier this year the team at Dremio open sourced Gandiva for Apache Arrow. This is a new execution kernel for Arrow that is based on LLVM. Gandiva provides significant performance improvements for low-level operations on Arrow buffers.

Gandiva is a new open source project licensed under the Apache license and developed in the open on GitHub. It is provided as a standalone C++ library for efficient evaluation of arbitrary expressions on Arrow buffers using runtime code-generation in LLVM. Applications submit an expression tree to the Gandiva compiler, which compiles for the local runtime environment. The request is then handed to the Gandiva execution kernel, which consumes and produces batches of Arrow buffers.

gandiva architecture Dremio

Gandiva architecture. 

Gandiva Expression Library

As of August 2018, Gandiva supports hundreds of expressions: filter and project relational operators and Arrow Math, Arrow Boolean, and Arrow Dates and Times. Many more expressions are planned for future releases of the project.

Null decomposition and pipelining in Gandiva

gandiva null decomposition Dremio

Null decomposition in Gandiva. 

Another optimization implemented in Gandiva is something we call null decomposition. Expressions submitted to Gandiva might involve records with null values. For some operations, we know that the result of many operations on expressions that include null values is always null. By separating whether a value is null (validity) from the actual value (data), we can much more efficiently process record batches.

With this optimization, we can then determine nullness using bitmap intersections which can significantly reduce branching overhead by the CPU. Data values can then be batched and submitted for efficient SIMD processing on modern CPUs. It turns out this type of semantic is very common in SQL-based processing, and so this optimization is actually very effective, especially for SIMD and GPU-based processing.

Vectorization and SIMD in Gandiva

gandiva vectorization Dremio

Vectorization in Gandiva. 

Arrow memory buffers are already well organized for SIMD instructions. Depending on the hardware, Gandiva can process larger batches of values as a single operation. For example, considering the diagram below, you might have two sets of 2-bit values that need to be combined in an operation.

For CPUs with AVX-128 instructions, Gandiva can process eight pairs of these 2-byte values in a single vectorized operation. In contrast, processors with AVX-512 can use Gandiva to process four times as many values in a single operation. This optimization is performed automatically, and many others are possible.

Asynchronous thread control in Gandiva

In analytics, we are typically bound by memory capacity and then entirely performance-throttled by CPU throughput (once you have enough memory, the CPU is your bottleneck). Historically systems solve resource sharing of the CPU by creating more threads (and thus delegating the need to balance resources to the operating system thread scheduler). There are three challenges with this approach:

  • You have weak or no control over resource priorities.
  • As concurrency increases, context switching dominates CPU utilization.
  • High priority cluster coordination operations like heartbeats may miss their scheduling requirement and cause cluster cohesion instability.

Gandiva has been designed to allow for more flexibility in how resources are allocated to each request. For example, imagine you have three different users trying to use the system concurrently as in the figure below. Assuming 11 intervals of time for a single core, you might want to allocate resources to each operation differently.

gandiva async Dremio

Asynchronous threading in Gandiva. 

User 1 is allocated a single interval of time, whereas User 3 (e.g., a “premium” user) is allocated significantly more resources than User 1 or User 2. Because Dremio processes jobs asynchronously, it can periodically revisit each thread to rebalance available resources given the priorities of jobs running in the system. The timing is based on a quanta we define, which is designed to ensure that the majority of processor time goes to forward progress instead of context switches.

Gandiva works well with this model, allowing the system to operate asynchronously. It does this by allowing small amounts of work units to be done followed by suspension of the calling code. This pattern allows Gandiva to be used in both a traditional synchronous engine as well as more powerful asynchronous engines.

Gandiva language support

We built Gandiva to be compatible with many different environments. C++ and Java bindings are already available for users today. We hope to work with the community to produce bindings for many other languages such as Python, Go, JavaScript, and Ruby.

gandiva language support Dremio

Language support in Gandiva. 

To make new primitives available for use in Gandiva, the core C++ library exposes a number of capabilities that are language independent (including a consistent cross-language representation of expression trees). From there, all data is expected in the Arrow format, building on the ability of Arrow to be used across these different languages.

Performance improvements with Gandiva

Generally speaking, Gandiva provides performance advantages across the board with few compromises. First, Gandiva reduces the time to compile most queries to less than 10 milliseconds. Second, in Dremio, Gandiva improves the performance of creating and maintaining Data Reflections, which Dremio’s query planner can use to accelerate queries by orders of magnitude by creating a more intelligent query plan that involves less work.

To assess the benefits of Gandiva, we compared the performance of SQL queries executed through Dremio using standard Java code generation vs. compiling the queries through Gandiva. Note that the performance of Dremio using existing Java-based query compilation is on par with state-of-the-art SQL execution engines.

We selected five simple expressions and recorded the expression evaluation time to process a JSON data set of 500 million records. The tests were run on a Mac desktop (2.7GHz quad-core Intel Core i7 with 16GB of RAM).

In general, the more complex the SQL expression, the greater the advantage of using Gandiva.

Sum

SELECT max(x+N2x+N3x) FROM json.d500

Five output columns

SELECT
sum(x + N2x + N3x),
sum(x * N2x - N3x),
sum(3 * x + 2 * N2x + N3x),
count(x >= N2x - N3x),
count(x + N2x = N3x)
FROM json.d500

Ten output columns

SELECT
sum(x + N2x + N3x),
sum(x * N2x - N3x),
sum(3 * x + 2 * N2x + N3x),
count(x >= N2x - N3x),
count(x + N2x = N3x),
sum(x - N2x + N3x),
sum(x * N2x + N3x),
sum(x + 2 * N2x + 3 * N3x),
count(x <= N2x - N3x)
count(x = N3x - N2x)
FROM json.d500

Case-10

SELECT count
(case
when x < 1000000 then x/1000000 + 0
when x < 2000000 then x/2000000 + 1
when x < 3000000 then x/3000000 + 2
when x < 4000000 then x/4000000 + 3
when x < 5000000 then x/5000000 + 4
when x < 6000000 then x/6000000 + 5
when x < 7000000 then x/7000000 + 6
when x < 8000000 then x/8000000 + 7
when x < 9000000 then x/9000000 + 8
when x < 10000000 then x/10000000 + 9
else 10
end)
FROM json.d500

Case-100

Case-100 is similar to Case-10 but with 100 cases and three output columns.

The results of the five tests are shown in the table below. As you can see, Grandiva accelerated even simpler queries by 4x or better. For the most complex query, CASE-100, Grandiva delivered a 90x performance boost.

Gandiva optimizes in-memory processing

Gandiva is a new approach for performing optimized processing of Arrow data structures, making data processing on these structures even more efficient. This is work that each application that implements Arrow would otherwise need to implement on its own, reinventing the wheel. Gandiva gives these apps a speed boost for free.

1 2 Page 1
Page 1 of 2