Let me explain what is Broadcast Variable in Spark:
Broadcast variables are the ones which allows the program to efficiently send a large, read only value to all Worker Nodes for use in one or more Spark Operations.
For example, Take any Static or Lookup Hive Tables.
This is simply an object of type spark.broadcast.Broadcast[t] class which wraps a value of type T. The value is sent to each node only once using an efficient, Bit Torrent-like communication mechanism. The transmission of Broadcast variable to worker nodes sub linearly increases as the number of worker nodes increases in the Spark Cluster.
You can find various examples using Scala collections. But see here how I implemented Broadcasting a large Lookup File.
val fn= sc.textFile(“hdfs://filename”).map(_.split(“|”))
val fn_bc= sc.broadcast(fn)
val fn_bc_row= fn_bc.value.filter(x => x(0)!=”NULL”).map(r => org.apache.spark.sql.Row(r(0),r(1),r(2)))
res28: Array[org.apache.spark.sql.catalyst.expressions.Row] = Array([INSERT,PUP,10010.0], [INSERT,HDES,10009.0], [INSERT,COGEN,10008.0], [INSERT,SIS,10007.0], [INSERT,PAS,10006.0], [INSERT,MAIG,10005.0], [INSERT,HUON,10004.0], [INSERT,ADES,10003.0], [INSERT,Automated entry,10002.0], [INSERT,Manual Entered Policy – FNOL,10001.0])
This mechanism is as similar to Hadoop MR Distributed Cache.