How to handle nested data/array of structures or multiple Explodes in Spark/Scala and PySpark:

Explode

explode() takes in an array (or a map) as an input and outputs the elements of the array (map) as separate rows. Hive UDTFs can be used in the SELECT expression list and as a part of LATERAL VIEW.

Those who are familiar with EXPLODE LATERAL VIEW in Hive, they must have tried the same in Spark. Some attempts might have failed but here you go with successful attempt I made out of Spark 1.5.1 version.

Below are some data I prepared using Scala collections to create this demo.

import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._

/**
* Created by vgiridatabricks on 5/16/16.
*/
object ExplodeDemo {
def main(args: Array[String]) : Unit = {
val conf = new SparkConf()
.setAppName(“csvParser”)
.setMaster(“local”)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(Seq((1, Seq(2,3,4), Seq(5,6,7)), (2, Seq(3,4,5), Seq(6,7,8)), (3, Seq(4,5,6), Seq(7,8,9)))).toDF(“a”, “b”, “c”)
val df1 = df.select(df(“a”),explode(df(“b”)).alias(“b_columns”),df(“c”))
val df2 = df1.select(df1(“a”),df1(“b_columns”),explode(df1(“c”).alias(“c_columns”))).show()
}

}

df.show() —> Initial Collections

|  a|        b|        c|
|  1|[2, 3, 4]|[5, 6, 7]|
|  2|[3, 4, 5]|[6, 7, 8]|
|  3|[4, 5, 6]|[7, 8, 9]|
+—+———+———+

df1.show() —> First Exploded Collections

+—+———+———+
|  a|b_columns|        c|
+—+———+———+
|  1|        2|[5, 6, 7]|
|  1|        3|[5, 6, 7]|
|  1|        4|[5, 6, 7]|
|  2|        3|[6, 7, 8]|
|  2|        4|[6, 7, 8]|
|  2|        5|[6, 7, 8]|
|  3|        4|[7, 8, 9]|
|  3|        5|[7, 8, 9]|
|  3|        6|[7, 8, 9]|
+—+————+————+

df2.show() —> Second Exploded Collections

+—+———+—+
|  a|b_columns|col|
+—+———+—+
|  1|        2|  5|
|  1|        2|  6|
|  1|        2|  7|
|  1|        3|  5|
|  1|        3|  6|
|  1|        3|  7|
|  1|        4|  5|
|  1|        4|  6|
|  1|        4|  7|
|  2|        3|  6|
|  2|        3|  7|
|  2|        3|  8|
|  2|        4|  6|
|  2|        4|  7|
|  2|        4|  8|
|  2|        5|  6|
|  2|        5|  7|
|  2|        5|  8|
|  3|        4|  7|
|  3|        4|  8|
+—+———+—+
only showing top 20 rows

Please note if you do multiple explodes as part of the same dataframe operation, spark sql will throw an error saying “AnalysisException: u’Only one generator allowed per select but Generate and and Explode found.;’“. So, please apply explode one column at a time and assign an alias and second explode on the 1st exploded dataframe.

PySpark Code to do the same Logic:
(I have taken Another List here)

from pyspark.sql import Row
from pyspark.sql.functions import explode

df = sqlContext.createDataFrame([Row(a=1, b=[1,2,3],c=[7,8,9]), Row(a=2, b=[4,5,6],c=[10,11,12])])
df1 = df.select(df.a,explode(b).alias(“b_row”),df.c)
df2 = df1.select(df1.a,df1.b_row,explode(df1.c).alias(“c_row”)).show()

Final Outputs are attached from Databricks Notebook. Please use community edition of Databricks notebook if you like to easily learn/execute Spark Jobs.

Advertisements

One thought on “How to handle nested data/array of structures or multiple Explodes in Spark/Scala and PySpark:

  1. Thanks this is very interesting.
    I notice that the explode creates a cartesian product of the results, which might not be intended. For example, col A contains a primitive type, but col B and col C are arrays both with 3 values, so you might want a cartesian product of a*b|c rather than a*b*c
    Given this input:
    | a| b| c|
    | 1|[2, 3, 4]|[5, 6, 7]|

    How would you return this
    | a| b| c|
    | 1| 2 | 5 |
    | 1| 3 | 6 |
    | 1| 4 | 7 |

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s