How to Enable WholeStageCodeGen in Spark 2.0

What is WholeStageCodeGen first?

Its basically a hand written code type Code gen designed based on Thomas Neumann’s seminal VLDB 2011 paper.  With this, Spark can actually can achieve the performance of hand written code.Hand-written code is written specifically to run that query and nothing else, and as a result it can take advantage of all the information that is known, leading to optimized code that eliminates virtual function dispatches, keeps intermediate data in CPU registers, and can be optimized by the underlying hardware. This is possible only in Spark 2.0.

Example: I am going to perform Sum of around 1billion numbers in Spark 2.0 with WholeStageCodeGen Enabled as true.

This was outputted in my Mac as per below config.

Screen Shot 2016-08-19 at 10.28.48 PM

Code is Below:

16/08/19 22:20:17 INFO CodeGenerator: Code generated in 8.807053 ms
+------------------+
|           sum(id)|
+------------------+
|499999999500000000|
+------------------+

Total Execution Time 3.313332086 seconds
16/08/19 22:20:17 INFO SparkContext: Invoking stop() from shutdown hook

Below is the Explain Plan: (Where ever you see *, it means that wholestagecodegen has generated hand written code prior to the aggregation.

== Physical Plan ==
*HashAggregate(keys=[], functions=[sum(id#0L)])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_sum(id#0L)])
      +- *Range (0, 1000000000, splits=1)

Now, let’s experiment the same using wholestagecodegen false.

16/08/19 22:26:43 INFO CodeGenerator: Code generated in 5.217584 ms
+------------------+
|           sum(id)|
+------------------+
|499999999500000000|
+------------------+

Total Execution Time 15.299064145 seconds

It took 5 times greater execution time with No WholeStageCodeGen.

Below is the Explain Plan: (You can’t see * symbol here)

== Physical Plan ==
HashAggregate(keys=[], functions=[sum(id#0L)])
+- Exchange SinglePartition
   +- HashAggregate(keys=[], functions=[partial_sum(id#0L)])
      +- Range (0, 1000000000, splits=1)

In Spark 2.0, the wholestagecodegen is defaulted option. So, you don’t have to config this specifically.

Advertisements