How to Transpose Columns to Rows in Spark Dataframe

Let’s say you have input like this.

Screen Shot 2016-05-28 at 7.21.38 PM

and you want the Output Like as below. Meaning  all these columns have to be transposed to Rows using Spark DataFrame approach. As you know, there is no direct way to do the transpose in Spark. Some cases we can use Pivot. But I haven’t tried that part yet.

Screen Shot 2016-05-28 at 7.22.33 PM

What I have I done?

  1. Created a Sequence and converted that to Dataframe with 3 column names assigned
  2. Used collect function to combine all the columns into an array list
  3. Splitted the arraylist using a custom delimiter (‘:’)
  4. Read each element of the arraylist and outputted as a seperate column in a sql. Similary did for all columns
  5. Union all All converted columns and created a final dataframe.
  6. I have used Spark SQL approach here.
  7. Please note that this approach only work for small set of Columns. There could be some easiest ways also. If you know, please reply in a comment here.

IntelliJ Version of the Code:

This example, I tried using Spark 1.6.1 version

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by vgiridatabricks on 5/28/16.
  */
object SparkTranspose {
  def main(args: Array[String]) : Unit = {
    val conf = new SparkConf()
      .setAppName("csvParser")
      .setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new HiveContext(sc)
    import sqlContext.implicits._
    val df = sc.parallelize(Seq(("Titanic","IronMan","JungleBook"),("1","10","20"),("2","30","40"),("1","30","20"))).toDF("a","b","c")
 df.registerTempTable("transpose")
 sqlContext.sql("select concat_ws(':',a.cola) as cola_new,concat_ws(':',a.colb) as colb_new,concat_ws(':',a.colc) as colc_new from (select collect_list(a) as cola, collect_list(b) as colb,collect_list(c) as colc from transpose)a").registerTempTable("tablesplitter")
 sqlContext.sql(
 """
 select split(cola_new,':') [0] as col1,split(cola_new,':') [1] as col2,split(cola_new,':') [2] as col3,split(cola_new,':') [3] as col4 from tablesplitter
 union all
 select split(colb_new,':') [0] as col1,split(colb_new,':') [1] as col2,split(colb_new,':') [2] as col3,split(colb_new,':') [3] as col4 from tablesplitter
 union all
 select split(colc_new,':') [0] as col1,split(colc_new,':') [1] as col2,split(colc_new,':') [2] as col3,split(colc_new,':') [3] as col4 from tablesplitter
 """)
 .show()

 }

}

Output:

Screen Shot 2016-05-28 at 12.34.54 PM

Hope you Enjoyed this Blog!..

Advertisements

3 thoughts on “How to Transpose Columns to Rows in Spark Dataframe

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s