How to use Threads in Spark Job to achieve parallel Read and Writes

Agenda:

When you have more number of Spark Tables or Dataframes to be written to a persistent storage, you might want to parallelize the operation as much as possible.  Below is the code I used to run for achieving this. This simply uses scala thread and performs the task in parallel in CPU cores.

 

You can see the difference in job performance when you run the similar job without using Threads.

You can verify this from the logs as below. Whereas a regular spark job doesn’t show this log.

17/02/02 05:x57:06 INFO CodeGenerator: Code generated in 165.608313 ms
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO SparkContext: Starting job: save at SparkMultiThreading.scala:41
17/02/02 05:57:07 INFO DAGScheduler: Got job 4 (save at SparkMultiThreading.scala:41) with 1 output partitions
17/02/02 05:57:07 INFO DAGScheduler: Final stage: ResultStage 0 (save at SparkMultiThreading.scala:41)
17/02/02 05:57:07 INFO DAGScheduler: Parents of final stage: List()
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s