How to create ORC tables in Hive – An Analysis

What is ORC format?

An ORC file contains group of rows data which is called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.

I have practically achieved the result and have seen the effective performance of hive ORC table.

Step1 : Create a temporary table in Hive
Step 2: Create a ORC foramtted table in Hive
Step 3: Load data to ORC table from the Temp table
Step 4: drop the temporary table.

Step1:
 CREATE TABLE agent_information
 (rec_seg_cd string,
 itm_num string,
 ocur_num string,
 trans_eff_dt string,
 trm_eff_dt string,
 expr_dt string,
 pay_pln_cd string,
 can_rsn_cd string,
 can_typ_cd string,
 genl_agnt_nm string,
 sub_prdcng_agnt_num string,
 prdcng_agnt_nm string,
 tier_fctr string,
 prog_id string,
 note1_txt string,
 note2_txt string,
 trans_num string,
 huon_plcy_incep_dt string,
 insrt_dt string,
 updt_dt string,
 plcy_trans_id string,
 data_src_nm string,
 actv_ind string,
 prior_agnt_nm string,
 huon_agent string,
 branch string,
 bustrnno string,
 plcy_number string,
 sales_rep string,
 trans_dt string )
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
 LOCATION '/DEV/IVANS/TODAYS_DATE';
Step 2:
 CREATE EXTERNAL TABLE agent_information_ORC (
 rec_seg_cd string,
 itm_num string,
 ocur_num string,
 trans_eff_dt string,
 trm_eff_dt string,
 expr_dt string,
 pay_pln_cd string,
 can_rsn_cd string,
 can_typ_cd string,
 genl_agnt_nm string,
 sub_prdcng_agnt_num string,
 prdcng_agnt_nm string,
 tier_fctr string,
 prog_id string,
 note1_txt string,
 note2_txt string,
 trans_num string,
 huon_plcy_incep_dt string,
 insrt_dt string,
 updt_dt string,
 plcy_trans_id string,
 data_src_nm string,
 actv_ind string,
 prior_agnt_nm string,
 huon_agent string,
 branch string,
 bustrnno string,
 plcy_number string,
 sales_rep string,
 trans_dt string )
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY '\t'
 STORED AS ORC tblproperties ("orc.compress"="ZLIB");
Step 3:
 INSERT OVERWRITE TABLE agent_information_ORC SELECT * FROM agent_information;
Step 4:
 DROP table agent_information;

Before doing the Step 4, I have actually validated the disk space for both tables. So, agent_information table had 320GB data whereas agent_information_ORC table had 79.5GB of data.

Also while quering the ORC table, aggregations like count,max,min,sum does not require to run the MR jobs as the ORC table itself stores these aggregations at column level.

Below is a comparison details of disk space usage of a Hive DB against regular vs ORC.

Picture1 Picture2

So far ZLIB and Snappy Compression techniques are allowed. WIth Snappy compression, you can save lot of disk space as well as performance in Hive.

5 thoughts on “How to create ORC tables in Hive – An Analysis

  1. Thanks for the post. I was wondering if I am doing something wrong because when I try and take a 35GB compress gz file and first create a RAW table, then try and insert that into a sequentialfile table since I understand I can’t use ORC for GZ files, it seems to take hours, I had to kill the INSERT job. This was on a 6 data node cluster. How much did it take to do your insert of 300GB+ of data into the ORC table?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s