Accelerate Spark SQL Queries with Gluten

A New Middle Layer to Offload Spark SQL Queries to Native Engines

Weiting Chen
Intel Analytics Software

--

Intel and Kyligence introduced Gluten, a new open-source project at the beginning of 2022. At this year’s Data AI Summit, Intel and Kyligence gave an introductory session on Project Gluten. Gazelle, the predecessor to Gluten, was an Intel–developed, Apache–Arrow–based, native SQL engine to offload Spark SQL. Gluten replaces that engine with multiple native engines, including the Meta-led Velox vectorized execution engine developed by Intel, and a Clickhouse execution engine developed by Kyligence. With Gluten and the Velox, Apache Spark users can expect performance gains and higher resource utilization.

Apache Spark is a stable, mature project that has been under development for many years. The project has proven to be one of the best frameworks for processing petabyte-scale datasets. However, the Spark community has had to address performance challenges that required various optimizations over time. A key optimization introduced in Spark 2.0 replaced Volcano mode with whole-stage code-generation to achieve a 2x speedup. Most of the optimization works at the query plan level. However, there is a need to address query performance more broadly. This motivated Intel to initiate the Gazelle project to unleash the power of Intel Advanced Vector Extensions (Intel AVX) technology using SIMD instructions within a vectorized SQL engine, which enables Apache Spark to break through its row-based data processing and JVM limitations.

A key deficiency of Gazelle was its limited community participation. This meant that the development burden fell to Intel. In the meantime, several vectorized SQL engines emerged with more active open-source communities. Among these, the Meta-led Velox project is a rising star, providing a vectorized database acceleration library. While these vectorized engines are popular, before Gluten there was no existing support in Apache Spark.

Gluten is Latin for glue, and Intel expects to see Gluten connect Apache Spark and vectorized SQL engines or libraries (Figure 1). The integration of a vectorized SQL engine or shared library with Spark opens up numerous opportunities for optimizing like offloading functions and operators to a vectorized library, introducing just-in-time compilation engines, and enabling the use of hardware accelerators (e.g., GPU and FPGA). Project Gluten allows users to take advantage of these software and hardware innovations from within Spark.

Figure 1. Gluten Layout

The Gluten Plugin

Project Gluten reuses much of the Spark framework: resource scheduling, the Catalyst logical plan optimizer, etc. (Figure 2). Beyond that, Gluten provides several new components to handle APIs between JVM and native libraries.

Figure 2. Gluten Components

Plan Conversion

Gluten uses Substrait.io to build a query plan tree. Gluten converts Spark’s physical plan to a Substrait plan for each backend, then shares the Substrait plan over JNI to trigger the execution pipeline in the native library.

Fallback Processing

Gluten leverages the existing Spark JVM engine to check that an operator is supported by the native library. If not, Gluten falls back to the existing Spark-JVM-based operator. This fallback mechanism comes at the cost of columnar-to-row/row-to-columnar data conversion.

Memory Management

Gluten leverages Spark’s existing memory management system. It calls the Spark memory registration API for every native memory allocation/deallocation action. Spark manages the memory for each task thread. If the thread needs more memory than is available, it can call the spill interface for operators that support this capability. Spark’s memory management system protects against memory leaks and out-of-memory issues.

Columnar Shuffle

Gluten reuses its predecessor Gazelle’s Apache Arrow-based Columnar Shuffle Manager as the default shuffle manager. A third-party library is responsible for handling the data transformation from native to Arrow. Alternatively, developers are free to implement their own shuffle manager.

Shim Layer

To fully integrate with Spark, Gluten includes Shim Layer whose role is to support multiple versions of Spark. Gluten supports Spark versions 3.x and beyond.

Metrics

Gluten support Spark’s Metrics functionality. The default Spark metrics are served for Java row-based data processing. In Project Gluten, we extend this with a column-based API and additional metrics to facilitate the use of Gluten and provide developers a means of debugging these native libraries.

Intel is working closely with Meta and the broader Velox community. At this point, the Gluten-Velox integration has implemented the functions and operators necessary to execute decision support queries. Gluten with Velox backend performance can achieve an average speedup of 2.07x and up to 8.1x speedup for single queries (Figure 3).

Figure 3. Gluten + Velox backend Performance

Want to join our project? We have big things planned. Like Velox, Gluten is an open-source project with a vibrant developer community. Beyond Gluten’s current focus, the project is working towards being able to run industrial-strength transaction benchmarks like TPC-H and TPC-DS as well as customer-specific queries. There is also interest in extending Gluten to support hardware accelerators such as GPUs, FPGAs, etc. as well as the coming secondary memories enabled by Compute Express Link (CXL).

Useful Links

  1. https://github.com/oap-project/gluten
  2. https://github.com/oap-project/gazelle_plugin
  3. https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  4. https://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html
  5. https://github.com/oap-project/gluten
  6. https://databricks.com/dataaisummit/session/gazelle-jni-middle-layer-offload-spark-sql-native-engines-execution
  7. https://blog.csdn.net/csdnopensource/article/details/125842940

Hardware and Software Configuration

One node with two 2.4 GHz Intel Xeon Platinum 8360Y processors and 1024 GB (16 slots/64GB/3200) total DDR4 memory, microcode 0xd0002a0, HT on, Turbo on, Ubuntu 20.04.2LTS, 5.4.0–117-generic, two 372.6G INTEL_SSDSC2BA40, 1x 894.3G INTEL_SSDSC2KG96, five 1.5 T INTEL SSDPE2KE016T8, two Ethernet controllers (XXV710) for 25 GbE SFP28, two MT27700 Family [ConnectX-4], two Ethernet controllers (10G X550T), JDK 1.8, GCC 9, Spark 3.1.1, Hadoop 2.10, test by Intel on 08/25/2022.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

--

--

Weiting Chen
Intel Analytics Software

Weiting is a senior software engineer at Intel. He has experience for Big Data and Cloud Solutions including Spark, Hadoop, OpenStack, and Kubernetes