How to Handle Schema Changes/Evolutes in Hive ORC tables like Column Deletions happening at Source DB.

In the EDW world, schema changes is a very frequent activity.  But when we have the same data in Hive as part of the Data Lake, it will be hectic when you see read/writes in Hive/HDFS.

In this example, I have taken COLUMN Deletions at the Source DB which is mySQL. Avro Storage is the data landing layer using Sqoop. And ORC is the Data Staging layer using Hive insert.

Before Schema Changes:
#Create a Table in MYSQL
create table test.ADDRESS_MYSQL
(claimantname varchar(50),
addressline1 varchar(50),
addressline2 varchar(50),
addressline3 varchar(50),
addressline4 varchar(50),
city varchar(50));

#Insert some Data into it.
INSERT INTO test.cc_address_mysql (claimantname,addressline1,addressline2,addressline3,addressline4,city) values (“giri”,”20808 n 79th”,”apt 1028″,”resident”,”business”,”pheonix”),(“rajeev”,”20808 n 80th”,”apt 1029″,”resident1″,”business1″,”glendale”),(“mathan”,”20808 n 81st”,”apt 1030″,”resident2″,”business2″,”suncity”);

#Create a New HDFS directory to store New Schema Changed data.
hadoop fs -mkdir /QA/giri/poc/schemachanges/cas/cc_address/dbchange

#Similarly create a new directory
mkdir -p /tmp/cc_address/newschema

#Sqoop the Firstime Load as below
sqoop import –connect jdbc:mysql://n01bdl005.aap.csaa.pri:3306/test –username grvrama –password Hadoop123 \
–table cc_address_mysql –outdir /tmp/cc_address/oldschema –target-dir /QA/giri/poc/schemachanges/cas/cc_address/deletecolumns/data –as-avrodatafile –split-by claimantname -m 4;

#Move the AVSC (Original Schema) to HDFS
hadoop fs -copyFromLocal /tmp/cc_address/oldschema/*.avsc /QA/giri/poc/schemachanges/cas/cc_address/schema/

#Create An External Table with Avro Storage pointing to HDFS LOcations.
CREATE EXTERNAL TABLE HIVETEST.ADDRESS_AVRO ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’ STORED AS INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’ OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’ LOCATION ‘hdfs:///QA/giri/poc/schemachanges/cas/cc_address/deletecolumns/data’ TBLPROPERTIES (‘avro.schema.url’= ‘hdfs:///QA/giri/poc/schemachanges/cas/cc_address/schema/cc_address_mysql.avsc’);

#The Table Should show some records as below
hive> select * from HIVETEST.ADDRESS_AVRO;
OK
giri 20808 n 79th apt 1028 resident business pheonix
mathan 20808 n 81st apt 1030 resident2 business2 suncity
rajeev 20808 n 80th apt 1029 resident1 business1 glendale
Time taken: 0.5 seconds, Fetched: 3 row(s)

#Create a ORC table
CREATE EXTERNAL TABLE HIVETEST.CC_ADDRESS_ORC_DEL (claimantname string,addressline1 string,addressline2 string,addressline3 string,addressline4 string,city string) PARTITIONED BY (load_year INT,load_month INT,load_date INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ’01’ STORED AS ORC LOCATION ‘/QA/giri/poc/schemachanges/cas/cc_address/deletecolumns’;

#Move the Avro Data Into ORC Table. This table is partitioned by Year, Month, Date.
set hive.exec.dynamic.partition.mode=nonstrict;INSERT OVERWRITE TABLE HIVETEST.CC_ADDRESS_ORC_DEL PARTITION (load_year,load_date,load_date)
SELECT CLAIMANTNAME,ADDRESSLINE1,ADDRESSLINE2,ADDRESSLINE3,ADDRESSLINE4,CITY,YEAR(from_unixtime(unix_timestamp())),MONTH(from_unixtime(unix_timestamp())),DAY(from_unixtime(unix_timestamp())) from HIVETEST.ADDRESS_AVRO;

#After Loading Verify ORC table values.
hive> select * from HIVETEST.ADDRESS_ORC;
OK
giri 20808 n 79th apt 1028 resident business pheonix 2015 7 7
rajeev 20808 n 80th apt 1029 resident1 business1 glendale 2015 7 7
mathan 20808 n 81st apt 1030 resident2 business2 suncity 2015 7 7

After Schema Changes at MYSQL:

#Drop a Column in MYSQL table that we created.
mysql>use test;ALTER TABLE cc_address_mysql drop column ADDRESSLINE4;

#Verify the data after schema changes
mysql> select * from cc_address_mysql;
+————–+————–+————–+————–+———-+
| claimantname | addressline1 | addressline2 | addressline3 | city |
+————–+————–+————–+————–+———-+
| giri | 20808 n 79th | apt 1028 | resident | pheonix |
| rajeev | 20808 n 80th | apt 1029 | resident1 | glendale |
| mathan | 20808 n 81st | apt 1030 | resident2 | suncity |
+————–+————–+————–+————–+———-+

#Now We need to Sqoop the Initial Load Once again to a different Schema Directory (outdir) and Data Directory (target-dir)
sqoop import –connect jdbc:mysql://n01bdl005.aap.csaa.pri:3306/test –username grvrama –password Hadoop123 \
–table cc_address_mysql –outdir /tmp/cc_address/newschema –target-dir /QA/giri/poc/schemachanges/cas/cc_address/dbchange –as-avrodatafile –split-by claimantname -m 4;

#Take a backup of the Old Schema File. Probably you can create a /schema/archive directory to store Old Schema Files
hadoop fs -cp /QA/giri/poc/schemachanges/cas/cc_address/schema/cc_address_mysql.avsc /QA/giri/poc/schemachanges/cas/cc_address/schema/archive/

#Move the New AVSC Schema File to HDFS.
hadoop fs -copyFromLocal /tmp/cc_address/newschema/*.avsc /QA/giri/poc/schemachanges/cas/cc_address/schema/current

#Please note that the column change command will only modify Hive’s metadata, and will not modify data. Users should make sure the actual data layout of the table/partition conforms with the metadata #definition.
#Current avro data directory will have data with old schema. So, we need to alter the table with new hdfs location, new schema. or take back up of the current avro table and drop it #completely.

#Below will create the table in /apps/hive/warehouse/grvtestschema.db/cc_address_avro_del_bkp directory. Which is fine.
CREATE TABLE HIVETEST.ADDRESS_AVRO_BKP LIKE HIVETEST.CC_address_avro_del;
INSERT OVERWRITE TABLE HIVETEST.ADDRESS_AVRO_BKP SELECT * FROM HIVETEST.CC_address_avro_del;
DROP TABLE HIVETEST.CC_address_avro_del;

#Create avro table with new location set and scheme location set.
CREATE EXTERNAL TABLE HIVETEST.ADDRESS_AVRO ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’ STORED AS INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’ OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’ LOCATION ‘hdfs:///QA/giri/poc/schemachanges/cas/cc_address/dbchange’ TBLPROPERTIES (‘avro.schema.url’= ‘hdfs:///QA/giri/poc/schemachanges/cas/cc_address/schema/current/cc_address_mysql.avsc’);

#Verify the data in AVRO after Schema Chagnes in mysql.
hive> select * from HIVETEST.ADDRESS_AVRO;
OK
giri 20808 n 79th apt 1028 resident pheonix
rajeev 20808 n 80th apt 1029 resident1 glendale
mathan 20808 n 81st apt 1030 resident2 suncity

#Take a Backup of the current ORC table and Drop the Original ORC table.
CREATE TABLE HIVETEST.ADDRESS_ORC_BKP LIKE HIVETEST.ADDRESS_ORC;
INSERT OVERWRITE TABLE HIVETEST.ADDRESS_ORC_BKP SELECT * FROM HIVETEST.ADDRESS_ORC;
DROP TABLE HIVETEST.ADDRESS_ORC;

#Create orc table with new location set. I created Dbchange directory to store the New Schema Data Files
CREATE EXTERNAL TABLE HIVETEST.ADDRESS_ORC (claimantname string,addressline1 string,addressline2 string,addressline3 string,city string) PARTITIONED BY (load_year INT,load_month INT,load_date INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ’01’ STORED AS ORC LOCATION ‘/QA/giri/poc/schemachanges/cas/cc_address/dbchange’;

#insert the data into the orc table
set hive.exec.dynamic.partition.mode=nonstrict;INSERT OVERWRITE TABLE HIVETEST.ADDRESS_ORC PARTITION (load_year,load_date,load_date)
SELECT CLAIMANTNAME,ADDRESSLINE1,ADDRESSLINE2,ADDRESSLINE3,CITY,YEAR(from_unixtime(unix_timestamp())),MONTH(from_unixtime(unix_timestamp())),DAY(from_unixtime(unix_timestamp())) from HIVETEST.ADDRESS_AVRO;

#Verify the ORC table after the schema changes. The Data looks correct.
hive> select * from grvtestschema.ADDRESS_ORC;
OK
mathan 20808 n 81st apt 1030 resident2 suncity 2015 7 7
rajeev 20808 n 80th apt 1029 resident1 glendale 2015 7 7
giri 20808 n 79th apt 1028 resident pheonix 2015 7 7

Then Continue the Incremental Loads from Next Day onwards with the new target directory which was created after the schema changes. So, please change the –target directory pointing to new hdfs location in the sqoop incremental scripts.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s