Chukonu: A Fully-Featured High-Performance Big Data Framework That Integrates a Native Compute Engine into Spark

Abstract

Apache Spark is a widely deployed big data analytics framework that offers such attractive features as resiliency, load-balancing, and a rich ecosystem. However, there is still plenty of room for improvement in its performance. Although a data-parallel system in a native programming language significantly improves performance, it may require re-implementing many functionalities of Spark to become a full-featured system. It is desirable for native big data systems to just write a compute engine in native languages to ensure high efficiency, and reuse other mature features provided by Spark rather than re-implement everything. But the interaction between the JVM and the native world risks becoming a bottleneck.This paper proposes Chukonu, a native big data framework that re-uses critical big data features provided by Spark. Owing to our novel DAG-splitting approach, the potential Spark integration overhead is alleviated, and its even outperforms existing pure native big data frameworks. Chukonu splits DAG programs into run-time parts and compile-time parts: The run-time parts are delegated to Spark to offload the complexities due to feature implementations. The compile-time parts are natively compiled. We propose a series of optimization techniques to be applied to the compile-time parts, such as operator fusion, vectorization, and compaction, to significantly reduce the Spark integration overhead. The results of evaluation show that Chukonu has a speedup of up to 71.58X (geometric mean 6.09X) over Apache Spark, and up to 7.20X (geometric mean 2.30X) over pure-native frameworks on six commonly-used big data applications. By translating the physical plan produced by SparkSQL into Chukonu programs, Chukonu accelerates Spark-SQL’s TPC-DS performance by 2.29X.

Publication
Proc. VLDB Endow.
Wenguang Chen
Wenguang Chen
Professor
(教授)