How to Move Hive ORC tables from One Hadoop Cluster to Another Hadoop Cluster in 3 steps!!!!

Use Case:

Assume there is a hive table that has partition values present in Cluster 1 as below. And the use case is to transfer Everyday incremental Data from this hive table to Cluster 2 Hive table which can have similar name or different name.
For Ex. Cluster 1 has Hive Table CASSTG.CC_CLAIM and Cluster 2 can have same table CASSTG.CC_CLAIM or another new table in Hive. We need to transfer day partitions of this table from Cluster 1 to Cluster 2 on a daily basis.

load_year=2015,load_month=06,load_date=26
load_year=2015,load_month=06,load_date=30
load_year=2015,load_month=07,load_date=01

Please note that In order to create ORC tables, you need to have data in another Hive table(T1) and load that table (T1) into the ORC.  Assume the difficulty if you want to create a data lake in Multiple Clusters with ORC formatted Hive tables.  Most important aspect of this approach is you don’t need to run Sqoop or Multiple Hive SQLs to insert the daily load into the ORC tables in another cluster. You can land data into One cluster everyday and move to another cluster using this architecture.

Architecture:

1

Transfer Initial Load:
First Get the total count of rows in this table in Cluster 1
select count(*) from CASSTG.CC_CLAIM;
1466331
Step 1:
Cluster 1 — > This is How we export all data in a table to a HDFS directory as part of the Initial Load.
EXPORT TABLE CASSTG.CC_CLAIM TO ‘QA/giri/legacy/CC_CLAIM_1STTIME’;
Step 2:
distcp hdfs://nn1:8020/QA/giri/legacy/CC_CLAIM_1STTIME to hdfs://nn2:8020/QA/giri/legacy/CC_CLAIM_1STTIME
Step 3:
Import the Cluster 2 HDFS directory that was exported into the Hive Table. This will loads all data present in the Imported HDFS directory.
IMPORT EXTERNAL TABLE GRVTESTSCHEMA.CC_CLAIM FROM ‘QA/giri/legacy/ CC_CLAIM_1STTIME’;
Verify:
Now, get the total count of rows in this table in Cluster 2
select count(*) from CASSTG.CC_CLAIM;
1466331

Transfer Incremental Load (Day 1):
–Now we will see how we transfer the incremental data.
Step 1:
Cluster 1 Export Only date partition (Day 1)
EXPORT TABLE CASSTG.CC_CLAIM PARTITION(load_year=2015,load_month=6,load_date=26) TO ‘QA/giri/legacy/CC_CLAIM_JUN26_PART’;
Step 2:
distcp hdfs://nn1:8020/QA/giri/legacy/CC_CLAIM_JUN26_PART to hdfs://nn2:8020/QA/giri/legacy/CC_CLAIM_JUN26_PART
Step 3:
I created this table already GRVTESTSCHEMA.CC_CLAIM in Cluster 2. But it’s not necessary. Import Query automatically creates the table in Cluster 2 based on the HDFS directory metadata that was exported.
Import the Cluster 2 HDFS directory that was exported into the Hive Table. This will loads all data present in the Imported HDFS directory.
IMPORT EXTERNAL TABLE GRVTESTSCHEMA.CC_CLAIM PARTITION(load_year=2015,load_month=6,load_date=26) FROM ‘QA/giri/legacy/CC_CLAIM_JUN26_PART’;
Step 4:
Remove this File Everyday after the Import is Done in Cluster 2
dfs -rmr QA/giri/legacy/CC_CLAIM_JUN26_PART;
Verify:
Get the Imported table count with Partition (Day 1) in Cluster 2
It looks like 20 records are imported. This is Correct.
select count(*) from GRVTESTSCHEMA.CC_CLAIM;
1466351

Transfer Incremental Load (Day 2):
Step 1:
Cluster 1 Export Only date partition (Day 2)
EXPORT TABLE CASSTG.CC_CLAIM PARTITION(load_year=2015,load_month=06,load_date=30) TO ‘QA/giri/legacy/CC_CLAIM_JUN30_PART’;
Step 2:
distcp hdfs://nn1:8020/QA/giri/legacy/CC_CLAIM_JUN30_PART to hdfs://nn2:8020/QA/giri/legacy/CC_CLAIM_JUN30_PART
Step 3:
Import the Cluster 2 HDFS directory that was exported into the Hive Table. This will loads all data present in the Imported HDFS directory.
IMPORT EXTERNAL TABLE GRVTESTSCHEMA.CC_CLAIM PARTITION(load_year=2015,load_month=06,load_date=30) FROM ‘QA/giri/legacy/CC_CLAIM_JUN30_PART’;
Step 4:
Remove this File Everyday after the Import is Done in Cluster 2
dfs -rmr QA/giri/legacy/CC_CLAIM_JUN30_PART;
Verify:
–Get the Imported table count with Partition (Day 2)
–It looks like 20 records are imported. This is Correct.
select count(*) from GRVTESTSCHEMA.CC_CLAIM;
1466371

Transfer Incremental Load (Day 3):
Step 1:
Cluster 1 Export Only date partition (Day 3)
EXPORT TABLE CASSTG.CC_CLAIM PARTITION(load_year=2015,load_month=07,load_date=01) TO ‘QA/giri/legacy/CC_CLAIM_JUL01_PART’;
Step 2:
distcp hdfs://nn1:8020/QA/giri/legacy/CC_CLAIM_JUL01_PART to hdfs://nn2:8020/QA/giri/legacy/CC_CLAIM_JUL01_PART
Step 3:
Import the Cluster 2 HDFS directory that was exported into the Hive Table. This will loads all data present in the Imported HDFS directory.
IMPORT EXTERNAL TABLE GRVTESTSCHEMA.CC_CLAIM PARTITION(load_year=2015,load_month=07,load_date=01) FROM ‘QA/giri/legacy/CC_CLAIM_JUL01_PART’;
Step 4:
Remove this File Everyday after the Import is Done in Cluster 2
dfs -rmr QA/giri/legacy/CC_CLAIM_JUL01_PART;
Verify:
Get the Imported table count with Partition (Day 3)
It looks like 20 records are imported. This is Correct.
select count(*) from GRVTESTSCHEMA.CC_CLAIM;
1466391

How to Productionalize this?

1.Create a Oozie Bundle which handles the hive EXPORT actions and DISTCP actions Workflows, Co-ordinators and Schedule this in Cluster 1

2.Create a Oozie Bundle which handles the Hive IMPORT actions and fs -rmr actions Workflows,Co-ordinators and Schedule this in Cluster 2

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s