How to use Threads in Spark Job to achieve parallel Read and Writes

Agenda:

When you have more number of Spark Tables or Dataframes to be written to a persistent storage, you might want to parallelize the operation as much as possible.  Below is the code I used to run for achieving this. This simply uses scala thread and performs the task in parallel in CPU cores.

 

You can see the difference in job performance when you run the similar job without using Threads.

You can verify this from the logs as below. Whereas a regular spark job doesn’t show this log.

17/02/02 05:x57:06 INFO CodeGenerator: Code generated in 165.608313 ms
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO DAGScheduler: Got job 4 (save at SparkMultiThreading.scala:41) with 1 output partitions
17/02/02 05:57:07 INFO DAGScheduler: Final stage: ResultStage 0 (save at SparkMultiThreading.scala:41)
17/02/02 05:57:07 INFO DAGScheduler: Parents of final stage: List()
Advertisements