Loading Malformed Records in Spark through CSV Reader

There is a Use case I got it from one of my customer. The use case is to parse and process the below records  through csv reader in Spark. Note the records have single and double quotes as present in the records below.

Input.txt:

0|"I have one double quote as the first character of the field
"1|" I have two double quotes as the first character of the field
2|I have one " that is NOT the first character of the field"
3|I have two " that are NOT the first charcters of the field"
4|I have no double quotes
5|' I have one single quote as the first character of the field
6|'' I have two sincle quotes as the first characters of the field
7|I have one ' that is NOT the first character of the field
8|I have two '' that are NOT the first characters of the field

How to Process this in Scala:

val addlOptions = Map((“header”,”false”),(“delimiter”,”|”),(“mode”,”PERMISSIVE”),(“quote”,null))val df = sqlContext.read.format(“csv”).options(addlOptions)
.load(“/FileStore/tables/d5hw1vsq1464390309397/doublequotes.txt”).show()

Please find the Snapshot for the results. I executed this in Databricks Notebook using Spark 1.6.2 Version

csv_double_quotes_issue_giri

How to Process this in Python:

sqlContext.read.format(‘com.databricks.spark.csv’).options(header=’false’, delimiter=’|’, escape=”\\”, quote=”\\”).
load(‘/FileStore/tables/d5hw1vsq1464390309397/doublequotes.txt’ ).show()

Please find the Snapshot for the results. I executed this in Databricks Notebook using Spark 1.6.2 version.

Screen Shot 2016-05-27 at 4.53.46 PM

This way you can actually load all malformed records present in a file by loading through spark-csv package without any data loss.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s