How to Fix Small File issues in ORC formatted Hive Tables?

Merging all small ORC Files under a partition of testdb.vehicletable table.
Before Merge:
SELECT COUNT(*) FROM testdb.vehicletable WHERE LOAD_YEAR=2015 AND LOAD_MONTH=9 AND LOAD_DATE=9;
673
drwxr-xr-x - svcojob svchdfs 0 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9
-rw-r--r-- 3 svcojob svchdfs 2377 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000000_0
-rw-r--r-- 3 svcojob svchdfs 2455 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000001_0
-rw-r--r-- 3 svcojob svchdfs 2228 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000002_0
-rw-r--r-- 3 svcojob svchdfs 1892 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000003_0
-rw-r--r-- 3 svcojob svchdfs 1886 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000004_0
-rw-r--r-- 3 svcojob svchdfs 1978 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000005_0
-rw-r--r-- 3 svcojob svchdfs 1423 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000006_0
-rw-r--r-- 3 svcojob svchdfs 1694 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000007_0
-rw-r--r-- 3 svcojob svchdfs 1730 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000008_0
-rw-r--r-- 3 svcojob svchdfs 1723 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000009_0
-rw-r--r-- 3 svcojob svchdfs 1967 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000010_0
-rw-r--r-- 3 svcojob svchdfs 2149 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000011_0
-rw-r--r-- 3 svcojob svchdfs 2732 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000012_0
Alter Table using Concatentate:
ALTER TABLE testdb.vehicletable PARTITION (load_year=2015,load_month=9,load_date=9) CONCATENATE;
After Merge:
bash-4.1$ hdfs dfs -ls -R /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9
-rw-r--r-- 3 svclndg svchdfs 6349 2015-09-22 15:13 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000000_0
-rw-r--r-- 3 svclndg svchdfs 5669 2015-09-22 15:13 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000001_0
-rw-r--r-- 3 svclndg svchdfs 5055 2015-09-22 15:13 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000002_0
-rw-r--r-- 3 svclndg svchdfs 3509 2015-09-22 15:13 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000003_0
-rw-r--r-- 3 svclndg svchdfs 2791 2015-09-22 15:13 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000004_0
SELECT COUNT(*) FROM testdb.vehicletable WHERE LOAD_YEAR=2015 AND LOAD_MONTH=9 AND LOAD_DATE=9;
673
Count looks good and # of small files went down from 13 to 5 which is a significant improvement. Am still seeing how can we reduce the # of files further.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s