Spark Dataframe & Handling \001 delimiters

Agenda:

  • Create a Text formatted Hive table with \001 delimiter and read the underlying warehouse file using spark
  • Create a Text File with \001 delimiter and read it using spark

Create a Dataframe and Register a Temp View:

import org.apache.spark.sql.functions._
val employee = spark.range(0, 100).select($"id".as("employee_id"), (rand() * 3).cast("int").as("dep_id"), (rand() * 40 + 20).cast("int").as("age"))

employee.createOrReplaceTempView("hive001")

//Save Table with \001 delimiter

spark.sql("select concat_ws('\001',employee_id,dep_id,age) as allnew from hive001").repartition(2).write.mode("overwrite").format("text").saveAsTable("hive001_new")

//Save a text file with \001 delimiter also for Another verification

spark.sql("select concat_ws('\001',employee_id,dep_id,age) as allnew from hive001").repartition(2).write.mode("overwrite").format("text").save("/tmp/hive001_new")

//This can further be read using csv using \001 delimiter

sqlContext.read.format("com.databricks.spark.csv").option("delimiter", "\001").load("/user/hive/warehouse/hive001_new").show(3,false)

//Split the underlying files using the \001 delimiter. It works. You can further convert the RDD to Dataframe

spark.sparkContext.textFile("/user/hive/warehouse/hive001_new").map(_.split("\001")).take(3)

res35: Array[Array[String]] = Array(Array(0, 2, 52), Array(2, 2, 30), Array(4, 1, 37))

spark.sparkContext.textFile("/tmp/hive001_new").map(_.split("\001")).take(3)

res36: Array[Array[String]] = Array(Array(0, 2, 52), Array(2, 2, 30), Array(4, 1, 37))

Advertisements