Spark Data Frame : Check for Any Column values with ‘N’ and ‘Y’ and Convert the corresponding Column to Boolean using PySpark

Assume there are many columns in a data frame that are of string type but always have a value of “N” or “Y”.  You would like to scan a column to determine if this is true  and  if it is really just Y or N, then you might want to change the column type to boolean and have false/true as the values of the cells.

What the Below Code does:
1. Collects the Column Names and Column Types in a Python List
2. Iterate over a for loop and collect the distinct value of the columns in a two dimensional array
3. In the Loop, check if the Column type is string and values are either ‘N’ or ‘Y’
4. If Yes ,Convert them to Boolean and Print the value as true/false Else Keep the Same type.

PySpark Code:


df = sc.parallelize([(1, 'Y','F',"Giri",'Y'), (2, 'N','V',"Databricks",'N'),(3,'Y','B',"SparkEdge",'Y'),(4,'N','X',"Spark",'N')]).toDF(["id", "flag1","flag2","name","flag3"])
print 'Show Dataframe'
df.show()
print 'Actual Schema of the df'
df.printSchema()
for a_dftype in df.dtypes:
col_name = a_dftype[0]
col_type = a_dftype[1]
# print df.select(col_name).collect()[0][0]
if col_type=='string' and (df.select(col_name).distinct().collect()[0][0] =='N' or df.select(col_name).distinct().collect()[0][0] =='Y'):
df = df.withColumn(col_name,df[col_name].cast("boolean")).drop(df[col_name])
else:
df = df.withColumn(col_name,df[col_name]).drop(df[col_name])
print 'df with True/False Value after Data Type changes'
df.show()
print 'Modified Schema of the df'
df.printSchema()

See the Output:

I have attached the snapshot of the results. Please look at the columns flag1,flag3 converted to true or false and rest of the columns are printed with the original value.

Screen Shot 2016-08-05 at 3.28.57 PM

 

 

4 thoughts on “Spark Data Frame : Check for Any Column values with ‘N’ and ‘Y’ and Convert the corresponding Column to Boolean using PySpark

Leave a comment