Loading Malformed Records in Spark through CSV Reader

There is a Use case I got it from one of my customer. The use case is to parse and process the below recordsĀ  through csv reader in Spark. Note the records have single and double quotes as present in the records below.

Input.txt:

0|"I have one double quote as the first character of the field
"1|" I have two double quotes as the first character of the field
2|I have one " that is NOT the first character of the field"
3|I have two " that are NOT the first charcters of the field"
4|I have no double quotes
5|' I have one single quote as the first character of the field
6|'' I have two sincle quotes as the first characters of the field
7|I have one ' that is NOT the first character of the field
8|I have two '' that are NOT the first characters of the field

How to Process this in Scala:

val addlOptions = Map((“header”,”false”),(“delimiter”,”|”),(“mode”,”PERMISSIVE”),(“quote”,null))val df = sqlContext.read.format(“csv”).options(addlOptions)
.load(“/FileStore/tables/d5hw1vsq1464390309397/doublequotes.txt”).show()

Please find the Snapshot for the results. I executed this in Databricks Notebook using Spark 1.6.2 Version

csv_double_quotes_issue_giri

How to Process this in Python:

sqlContext.read.format(‘com.databricks.spark.csv’).options(header=’false’, delimiter=’|’, escape=”\\”, quote=”\\”).
load(‘/FileStore/tables/d5hw1vsq1464390309397/doublequotes.txt’ ).show()

Please find the Snapshot for the results. I executed this in Databricks Notebook using Spark 1.6.2 version.

Screen Shot 2016-05-27 at 4.53.46 PM

This way you can actually load all malformed records present in a file by loading through spark-csv package without any data loss.