How to create Hive Table Creation and Insert Scripts (hqls) Automatically in < 5 Minutes for around 1000 RDBMS Tables using Python

Objective:
Creating Hive tables is really an easy task. But when you really want to create 1000 of tables in Hive based on the Source RDBMS tables and it’s data types think about the Development Scripts Creation and Execution.

Pre-requisites:

Good to have Python/Java Knowledge
Knowledge of Hive Internal and External Tables

Step 1:
Get the list of tables from a RDBMS database you would like to access. In this blog, I have explained based on a Oracle info. Execute the below query in Oracle and export the result
in a spreadsheet or text file. Get the Tables details as below.

SCHEMA_NAME.TABLE_NAME
Ex.
TESTDB.TABLE1
TESTDB.TABLE2
TESTDB.TABLE3
.
.
select distinct owner,table_name from all_tab_columns where owner=’CLAIM_CENTER’

Another Side, get the Distinct Types thought out the Database.
select distinct DATA_TYPE from all_tab_columns where data_type not like ‘%$%’;

Now, tables list is ready. Pass this tables list in the below python program. My program reads the metadata of all tables and create Hive Scripts for your tables according to the Hive Data Types

Step 2:

Execute the Below Python script. Make sure you have all the Pyodbc, cx_Oracle packages installed.  Change the connection parameters, db info, etc according to your needs.

import cx_Oracle
#Declare Input and Output Files
tableslist="C:\\RnD\\Tableslist.txt"
createHiveStmt="C:\\RnD\\createHiveORCStmt.txt"
createHiveORCingestion="C:\\RnD\\createHiveORCingestionStmt.txt"
#Declare Connection Parameter here
connstr='username/password@servername:port#/DB#'
#connstr='pasreadonly/pasread4devp1@N01DOL424.tent.trt.csaa.pri/PASDEVP1'
conn = cx_Oracle.connect(connstr)
cur = conn.cursor()
#Define the Input and Output Files here
input_file = open(tableslist,'r')
createHiveStmt_file=open(createHiveStmt,'w')
HiveORCingestion_file=open(createHiveORCingestion,'w')
#Decleare arraylist for getting ColumnNames and Types as well as Column Names
getColValTypes = []
getColNames=[]
#Define the Schema Details
hiveAVROSchema="AVRODB."
hiveORCSchema="ORCDB."
#Define table Creation Scripts (defaults)
tablecreation=["CREATE EXTERNAL TABLE"," PARTITIONED BY (load_year INT,load_month INT,load_date INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY "," STORED AS ORC;"]
ingestionStmt = ["SET tez.queue.name=default;SET hive.exec.dynamic.partition = true;SET hive.exec.dynamic.partition.mode=nonstrict;SET hive.execution.engine=tez;SET mapreduce.framework.name=yarn-tez;SET hive.exec.max.dynamic.partitions=100000;SET hive.exec.max.dynamic.partitions.pernode=100000;","INSERT OVERWRITE TABLE ","PARTITION (LOAD_YEAR,LOAD_MONTH,LOAD_DATE) ",",LOAD_YEAR,LOAD_MONTH,LOAD_DATE from "]
for tableName in input_file:
 sqlStmt='SELECT * FROM {usertable} where rownum < 2'.format(usertable=tableName)
 print(tableName)
 a=cur.execute(sqlStmt)
 tableSchema=cur.description
 for i in tableSchema:
 columnName=str(i[0])
 columnType=str(i[1])
 if 'NUMBER' in columnType:
 columnType="INT"
 getColValTypes.append(columnName+' '+columnType)
 getColNames.append(columnName)
 elif 'STRING' in columnType:
 columnType="STRING"
 getColValTypes.append(columnName+' '+columnType)
 getColNames.append(columnName)
 elif 'FIXED_CHAR' in columnType:
 columnType="STRING"
 getColValTypes.append(columnName+' '+columnType)
 getColNames.append(columnName)
 elif 'TIMESTAMP' in columnType:
 columnType="TIMESTAMP"
 getColValTypes.append(columnName+' '+columnType)
 getColNames.append(columnName)
 elif 'CLOB' in columnType:
 columnType="STRING"
 getColValTypes.append(columnName+' '+columnType)
 getColNames.append(columnName)
 elif 'BLOB' in columnType:
 columnType="STRING"
 getColValTypes.append(columnName+' '+columnType)
 getColNames.append(columnName)
 elif 'VARCHAR' in columnType:
 columnType="STRING"
 getColValTypes.append(columnName+' '+columnType)
 getColNames.append(columnName)
 elif 'CHAR' in columnType:
 columnType="STRING"
 getColValTypes.append(columnName+' '+columnType)
 getColNames.append(columnName)
 elif 'DATE' in columnType:
 columnType="DATE"
 getColValTypes.append(columnName+' '+columnType)
 getColNames.append(columnName)
 elif 'RAW' in columnType:
 columnType="STRING"
 getColValTypes.append(columnName+' '+columnType)
 getColNames.append(columnName)
 elif 'FLOAT' in columnType:
 columnType="DECIMAL"
 getColValTypes.append(columnName+' '+columnType)
 getColNames.append(columnName)
 elif 'LONG' in columnType:
 columnType="DOUBLE"
 getColValTypes.append(columnName+' '+columnType)
 getColNames.append(columnName)
 else:
 getColValTypes.append(columnName+' '+"STRING")
 getColNames.append(columnName)
 tableName=tableName.split(sep='.')
 createHiveStmt_file.write(tablecreation[0]+' '+hiveORCSchema+tableName[1]+'('+','.join(getColValTypes)+')'+tablecreation[1]+r"'01'"+tablecreation[2]+';'+'\n'+'\n')
 HiveORCingestion_file.write(ingestionStmt[0]+' '+ingestionStmt[1]+hiveORCSchema+tableName[1]+ingestionStmt[2]+'SELECT '+','.join(getColNames)+ingestionStmt[3]+hiveAVROSchema+tableName[1]+';'+'\n'+'\n')
 getColValTypes=[]
 getColNames=[]
createHiveStmt_file.close()
HiveORCingestion_file.close()
cur.close()
conn.close()

This way, I was able to generate Hive Scripts with automatic data type mapping for 1000 tables in < 5 minutes.

Sample Scripts I have attached.    PyConnectOracle

Post for any queries.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s