How to Execute Hive Sql File in Spark Engine?

Say you have a *.hql or *.sql or *.q file for your jobs. Now, with the help of Spark SQL, you can execute them in Spark Engine. You have to have Spark application installed on your local or on yarn. You can configure Hivemetastore in the spark-defaults.conf file accordingly.

I have a hql file in my local fs as below.

$ cat /app/hadoop/workspace/horf/hql_in_spark.hql

SET spark.sql.autoBroadcastJoinThreshold=20485760;
SET spark.sql.shuffle.partitions=15;

drop table if exists txn;
create table txn as select status,claimid,txoptype,exposureid,subtype,costtype,costcategory from testdb.txn_rnk;
select count(*) from txn;
select * from txn limit 5;
cache table txn;
select count(*) from txn;
select * from txn limit 5;

Now Enter this Launch Command in your shell:

$ spark-sql –master yarn-client –conf spark.ui.port=40445 –executor-memory 15g –hiveconf load_date=`date +%Y-%m-%d` –driver-memory 10g –queue default –num-executors 20 –conf spark.yarn.executor.memoryOverhead=4096 –queue Q1 -i /app/hadoop/workspace/horf/hql_in_spark.hql

Parameters Info:

spark-sql          – Mandatory to launch the Spark SQL shell
–master          – Mandatory to execute the Spark Sql or Hive Queries in local, yarn mode
–executor-memory – Amount of Memory Required for a Container (Optional)
–executor-cores  – Amount of Cores to be utilized for a Container(Optional)
–queue           – Resource Manager Queue where job needs to be submitted(Optional)
–conf              – Additional Configuration Parameters like for Yarn, Memory Settings,etc(Optional)
–hiveconf           – Helps to execute Parameterized Hive Queries(Optional)

Now, let’s see some experimentation in Spark SQL.

Launch Command for Spark-sql:

spark-sql –master yarn-client –conf spark.ui.port=40445 –executor-memory 15g –hiveconf load_date=`date +%Y-%m-%d` –driver-memory 10g –queue default –num-executors 20 –conf spark.yarn.executor.memoryOverhead=4096 –queue Q1

What’s Happening after Enter this Command in your Shell:

The Spark Context is launched and Spark SQL Engine is connected to the Hive Metastore which is configured as part of spark-defaults.conf file.

16/03/11 20:37:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
16/03/11 20:37:47 INFO metastore: Trying to connect to metastore with URI thrift://hostname:9083
16/03/11 20:37:47 INFO metastore: Connected to metastore.
16/03/11 20:37:47 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/03/11 20:37:47 INFO SessionState: Created local directory: /tmp/a73b9df9-cad3-4a16-bbaf-9c98efec9e77_resources
16/03/11 20:37:47 INFO SessionState: Created HDFS directory: /tmp/hive/xgfp2ram/a73b9df9-cad3-4a16-bbaf-9c98efec9e77
16/03/11 20:37:47 INFO SessionState: Created local directory: /tmp/xgfp2ram/a73b9df9-cad3-4a16-bbaf-9c98efec9e77
16/03/11 20:37:47 INFO SessionState: Created HDFS directory: /tmp/hive/xgfp2ram/a73b9df9-cad3-4a16-bbaf-9c98efec9e77/_tmp_space.db
16/03/11 20:37:47 INFO SparkContext: Running Spark version 1.5.2
16/03/11 20:37:48 INFO Slf4jLogger: Slf4jLogger started
16/03/11 20:37:48 INFO Remoting: Starting remoting

.
.
.
.
16/03/11 20:38:20 INFO SessionState: Created local directory: /tmp/bcb4c098-d5b3-48dd-a140-94f2837ff84a_resources
16/03/11 20:38:20 INFO SessionState: Created HDFS directory: /tmp/hive/xgfp2ram/bcb4c098-d5b3-48dd-a140-94f2837ff84a
16/03/11 20:38:20 INFO SessionState: Created local directory: /tmp/xgfp2ram/bcb4c098-d5b3-48dd-a140-94f2837ff84a
16/03/11 20:38:20 INFO SessionState: Created HDFS directory: /tmp/hive/xgfp2ram/bcb4c098-d5b3-48dd-a140-94f2837ff84a/_tmp_space.db
SET spark.sql.hive.version=1.2.1
SET spark.sql.hive.version=1.2.1

Note: > represents the spark-sql shell opened up for executing your SQL or HiveQL queries.

Some Spark SQL configurations you can setup to have In Memory Join or Reducers Allocation:

>SET spark.sql.autoBroadcastJoinThreshold=20485760;
>SET spark.sql.shuffle.partitions=15;

CREATE a Hive table and See the Results:

>drop table if exists txn;
16/03/11 20:30:37 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@1441922b
Time taken: 0.05 seconds
16/03/11 20:30:37 INFO CliDriver: Time taken: 0.05 seconds

>create table txn as select status,claimid,txoptype,exposureid,subtype,costtype,costcategory from casstg.cc_transaction_rnk;
16/03/11 20:30:59 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@1052a1e
Time taken: 12.051 seconds
16/03/11 20:30:59 INFO CliDriver: Time taken: 12.051 seconds

>select count(*) from txn;
16/03/11 20:31:06 INFO StatsReportListener: task runtime:(count: 1, mean: 93.000000, stdev: 0.000000, max: 93.000000, min: 93.000000)
19045363
16/03/11 20:31:06 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
16/03/11 20:31:06 INFO StatsReportListener:     93.0 ms 93.0 ms 93.0 ms 93.0 ms 93.0 ms 93.0 ms 93.0 ms 93.0 ms 93.0 ms
Time taken: 0.687 seconds, Fetched 1 row(s)

>select * from txn limit 5;
Time taken: 0.301 seconds, Fetched 5 row(s)
16/03/11 20:33:33 INFO CliDriver: Time taken: 0.301 seconds, Fetched 5 row(s)

CACHE the Hive Table and See the Results:

>cache table txn;
16/03/11 20:34:14 INFO YarnScheduler: Removed TaskSet 22.0, whose tasks have all completed, from pool
16/03/11 20:34:14 INFO DAGScheduler: ResultStage 22 (processCmd at CliDriver.java:376) finished in 0.012 s
16/03/11 20:34:14 INFO DAGScheduler: Job 18 finished: processCmd at CliDriver.java:376, took 0.018106 s
16/03/11 20:34:14 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@6b399681
Time taken: 5.539 seconds
16/03/11 20:34:14 INFO CliDriver: Time taken: 5.539 seconds

>select count(*) from txn;
16/03/11 20:34:41 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@2f65e403
19045363
Time taken: 0.318 seconds, Fetched 1 row(s)

>select * from txn limit 5;
16/03/11 20:34:56 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@5438d40d
Time taken: 0.056 seconds, Fetched 5 row(s)

With the help of Spark SQL, You can see a huge performance difference when you have large table joins,complex ETL on the use case.

Have uploaded this as adocumentation HiveQL on Spark

 

 

 

Advertisements

2 thoughts on “How to Execute Hive Sql File in Spark Engine?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s