How to Broadcast a HDFS File and Convert the values to SchemaRDD in Spark?

Let me explain what is Broadcast Variable in Spark:

Broadcast variables are the ones which allows the program to efficiently send a large, read only value to all Worker Nodes for use in one or more Spark Operations.

For example, Take any Static or Lookup Hive Tables.

This is simply an object of type spark.broadcast.Broadcast[t] class which wraps a value of type T. The value is sent to each node only once using an efficient, Bit Torrent-like communication mechanism.  The transmission of Broadcast variable to worker nodes sub linearly increases as the number of worker nodes increases in the Spark Cluster.

You can find various examples using Scala collections.  But see here how I implemented Broadcasting a large Lookup File.

val fn= sc.textFile(“hdfs://filename”).map(_.split(“|”))

val fn_bc= sc.broadcast(fn)

val fn_bc_row= fn_bc.value.filter(x => x(0)!=”NULL”).map(r => org.apache.spark.sql.Row(r(0),r(1),r(2)))

fn_bc.take(10)

res28: Array[org.apache.spark.sql.catalyst.expressions.Row] = Array([INSERT,PUP,10010.0], [INSERT,HDES,10009.0], [INSERT,COGEN,10008.0], [INSERT,SIS,10007.0], [INSERT,PAS,10006.0], [INSERT,MAIG,10005.0], [INSERT,HUON,10004.0], [INSERT,ADES,10003.0], [INSERT,Automated entry,10002.0], [INSERT,Manual Entered Policy – FNOL,10001.0])

This mechanism is as similar to Hadoop MR Distributed Cache.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s