Spark Data Frame : Check for Any Column values with ‘N’ and ‘Y’ and Convert the corresponding Column to Boolean using PySpark

Assume there are many columns in a data frame that are of string type but always have a value of “N” or “Y”.  You would like to scan a column to determine if this is true  and  if it is really just Y or N, then you might want to change the column type to boolean and have false/true as the values of the cells.

What the Below Code does:
1. Collects the Column Names and Column Types in a Python List
2. Iterate over a for loop and collect the distinct value of the columns in a two dimensional array
3. In the Loop, check if the Column type is string and values are either ‘N’ or ‘Y’
4. If Yes ,Convert them to Boolean and Print the value as true/false Else Keep the Same type.

PySpark Code:

See the Output:

I have attached the snapshot of the results. Please look at the columns flag1,flag3 converted to true or false and rest of the columns are printed with the original value.

Screen Shot 2016-08-05 at 3.28.57 PM

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s