How to Resolve User class threw exception: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory in Spark

16/02/24 10:56:36 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory
at org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:399)
at org.apache.hadoop.hive.ql.session.SessionState.getAuthenticator(SessionState.java:867)
at org.apache.hadoop.hive.ql.session.SessionState.getUserFromAuthenticator(SessionState.java:589)
at org.apache.hadoop.hive.ql.metadata.Table.getEmptyTable(Table.java:174)
at org.apache.hadoop.hive.ql.metadata.Table.<init>(Table.java:116)
at org.apache.spark.sql.hive.client.ClientWrapper.org$apache$spark$sql$hive$client$ClientWrapper$$toQlTable(ClientWrapper.scala:238)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getAllPartitions$1.apply(ClientWrapper.scala:298)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getAllPartitions$1.apply(ClientWrapper.scala:297)

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory
at org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthorizeProviderManager(HiveUtils.java:376)
at org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:381)
… 79 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthorizeProviderManager(HiveUtils.java:366)
… 80 more

How did i resolve this?

It looks like the HiveContext of Spark 1.4.1 was pointing to Hive Metastore version 0.13.0 and it seems “SQLStdConfOnlyAuthorizerFactory” class has been added in hive 0.14.0 version. So I have changed the below property to a default value in my hive-site.xml that was added in my class path.
<property>
<name>hive.security.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider</value>
</property>

Advertisements

How to Resolve ERROR yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: java.lang.NumberFormatException: For input string: “5s” in Spark Submit

Another Issue as below while submitting my spark job to the yarn cluster.

R yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: java.lang.NumberFormatException: For input string: “5s”
java.lang.RuntimeException: java.lang.NumberFormatException: For input string: “5s”
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:117)
at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:165)
at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:163)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:360)
at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:205)
at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:204)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:204)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:72)
at com.csaa.horf.initialLoad.CasDS.CasDataStore$.main(CasDataStore.scala:32)
at com.csaa.horf.initialLoad.CasDS.CasDataStore.main(CasDataStore.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:486)
Caused by: java.lang.NumberFormatException: For input string: “5s”
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at org.apache.hadoop.conf.Configuration.getInt(Configuration.java:1258)
at org.apache.hadoop.hive.conf.HiveConf.getIntVar(HiveConf.java:1211)
at org.apache.hadoop.hive.conf.HiveConf.getIntVar(HiveConf.java:1220)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:58)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:340)
… 19 more

How did i resolve this?

Modifying hive-site.xml file helped me resolving this issue.

<property>
<name>hive.metastore.client.connect.retry.delay</name>
<value>5s</value>
</property>

to

<property>
<name>hive.metastore.client.connect.retry.delay</name>
<value>5</value>
</property>

<property>
<name>hive.metastore.client.socket.timeout</name>
<value>1800s</value>
</property>

to

<property>
<name>hive.metastore.client.socket.timeout</name>
    <value>1800</value>
</property>

After modifying this hive-site.xml file, I added this new hive-site.xml file to the Spark Classpath as below.

spark-submit –master yarn-cluster –class com.test.package.program –jars /usr/hdp/2.3.2.0-2950/spark/lib/datanucleus-api-jdo-3.2.6.jar –jars /usr/hdp/2.3.2.0-2950/spark/lib/datanucleus-core-3.2.10.jar –jars /usr/hdp/2.3.2.0-2950/spark/lib/datanucleus-rdbms-3.2.9.jar –files  /user/giri/sparksubmit/hive-site.xml application-jar application.properties

How to resolve ERROR yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient in Spark-Submit

I got into this Exception today when I submitted my Spark Job on yarn-cluster.

ERROR yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346) at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:117) at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:165) at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:163) at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:360) at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:205) at org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:204) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:204) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:72) at com.csaa.horf.initialLoad.CasDS.CasDataStore$.main(CasDataStore.scala:32) at com.csaa.horf.initialLoad.CasDS.CasDataStore.main(CasDataStore.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:486) Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1412) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:62) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:72) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:340) … 19 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1410) … 24 more Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. NestedThrowables: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:310) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:339) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:248) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:223) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136) at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:58) at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:497) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:475) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:523) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:397) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.<init>(HiveMetaStore.java:356) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:54) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59) at org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4944) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:171) … 29 more Caused by: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at javax.jdo.JDOHelper$18.run(JDOHelper.java:2018) at javax.jdo.JDOHelper$18.run(JDOHelper.java:2016) at java.security.AccessController.doPrivileged(Native Method) at javax.jdo.JDOHelper.forName(JDOHelper.java:2015) at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162) … 48 more

How did i resolve it?

Adding data nucleus jars and hive-site.xml File in the Spark Class Path, resolved this issue.

spark-submit –master yarn-cluster –class com.test.package.program –jars /usr/hdp/2.3.2.0-2950/spark/lib/datanucleus-api-jdo-3.2.6.jar –jars /usr/hdp/2.3.2.0-2950/spark/lib/datanucleus-core-3.2.10.jar –jars /usr/hdp/2.3.2.0-2950/spark/lib/datanucleus-rdbms-3.2.9.jar –files  /etc/hive/conf/hive-site.xml application-jar application.properties

 

 

How to Fix Small File issues in ORC formatted Hive Tables?

Merging all small ORC Files under a partition of testdb.vehicletable table.
Before Merge:
SELECT COUNT(*) FROM testdb.vehicletable WHERE LOAD_YEAR=2015 AND LOAD_MONTH=9 AND LOAD_DATE=9;
673
drwxr-xr-x - svcojob svchdfs 0 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9
-rw-r--r-- 3 svcojob svchdfs 2377 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000000_0
-rw-r--r-- 3 svcojob svchdfs 2455 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000001_0
-rw-r--r-- 3 svcojob svchdfs 2228 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000002_0
-rw-r--r-- 3 svcojob svchdfs 1892 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000003_0
-rw-r--r-- 3 svcojob svchdfs 1886 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000004_0
-rw-r--r-- 3 svcojob svchdfs 1978 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000005_0
-rw-r--r-- 3 svcojob svchdfs 1423 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000006_0
-rw-r--r-- 3 svcojob svchdfs 1694 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000007_0
-rw-r--r-- 3 svcojob svchdfs 1730 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000008_0
-rw-r--r-- 3 svcojob svchdfs 1723 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000009_0
-rw-r--r-- 3 svcojob svchdfs 1967 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000010_0
-rw-r--r-- 3 svcojob svchdfs 2149 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000011_0
-rw-r--r-- 3 svcojob svchdfs 2732 2015-09-10 01:56 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000012_0
Alter Table using Concatentate:
ALTER TABLE testdb.vehicletable PARTITION (load_year=2015,load_month=9,load_date=9) CONCATENATE;
After Merge:
bash-4.1$ hdfs dfs -ls -R /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9
-rw-r--r-- 3 svclndg svchdfs 6349 2015-09-22 15:13 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000000_0
-rw-r--r-- 3 svclndg svchdfs 5669 2015-09-22 15:13 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000001_0
-rw-r--r-- 3 svclndg svchdfs 5055 2015-09-22 15:13 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000002_0
-rw-r--r-- 3 svclndg svchdfs 3509 2015-09-22 15:13 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000003_0
-rw-r--r-- 3 svclndg svchdfs 2791 2015-09-22 15:13 /user/giri/vehicletable/load_year=2015/load_month=9/load_date=9/000004_0
SELECT COUNT(*) FROM testdb.vehicletable WHERE LOAD_YEAR=2015 AND LOAD_MONTH=9 AND LOAD_DATE=9;
673
Count looks good and # of small files went down from 13 to 5 which is a significant improvement. Am still seeing how can we reduce the # of files further.

How to Read ElasticSearch Index and Load the Data into Hive Table Through Spark SQL DataFrames?

Probably you would have visited my below post on ES-Hive Integration. My earlier Post on Creating a Hive Table by Reading Elastic Search Index thorugh Hive Queries

Let’s see here how to read the Data loaded in a Elastic Search Index through Spark SQL DataFrames and Load the data into a Hive Table. In Production Environments, we can maintain the data syncup betwen Elastic Search and Hive in this way. Please read through this. I have explained this in few steps as below.

After reading this, you will realize how easily we can sync ES and Hive data through Spark Data frame APIs without having much Schema references to load the data.

Step 1: Create a ES Index

Step 2: Add Fields Mapping to the ES Index

Step 3: Load 10 records to the Index

Step 4: Write a Scala Code to perform ES Hive Processing 

Step 5: Display the record in the Hive Table.

Step 1: Create a ES Index. I have created players as a Index name and cricket as a index type

PUT /players
{
 "settings": {
 "number_of_shards": 1,
 "analysis": {
 "filter": {
 "autocomplete_filter": {
 "type": "edge_ngram",
 "min_gram": 1,
 "max_gram": 30
 }
 },
 "analyzer": {
 "autocomplete": {
 "type": "custom",
 "tokenizer": "standard",
 "filter": [
 "lowercase",
 "autocomplete_filter"
 ]
 },
 "string_lowercase": {
 "tokenizer": "keyword",
 "filter": "lowercase"
 }
 }
 }
 }
}

Step 2:Add Fields Mapping to the ES Index

PUT /players/cricket/_mapping
{
 "cricket": {
 "properties": {
 "playerId": {
 "type": "string",
 "index": "not_analyzed"
 },
 "playerFirstName": {
 "type": "string"
 },
 "playerLastName": {
 "type": "string",
 "index": "not_analyzed"
 },
 "playerAverage": {
 "type": "string",
 "index": "not_analyzed"
 },
 "playerCountry": {
 "type": "string",
 "index": "not_analyzed"
 },
 "playerType": {
 "type": "string"
 },
 "playerBest": {
 "type": "string"
 }
 }
 }
}

Step 3: Load Around 10 Records as below to the ES Index

PUT /players/cricket/1
{
 "playerId": "1000",
 "playerFirstName": "Sachin",
 "playerLastName": "Tendulkar",
 "playerAverage": "70.4",
 "playerCountry": "India",
 "playerType": "Batsman",
 "playerBest": "205"
}
PUT /players/cricket/2
{
 "playerId": "1001",
 "playerFirstName": "Jonty",
 "playerLastName": "Rhodes",
 "playerAverage": "50",
 "playerCountry": "South Africa",
 "playerType": "Best Filder",
 "playerBest": "121"
}
PUT /players/cricket/3
{
 "playerId": "1002",
 "playerFirstName": "Giri",
 "playerLastName": "Varatharajan",
 "playerAverage": "99",
 "playerCountry": "India",
 "playerType": "BigDataCricketer",
 "playerBest": "400"
}
PUT /players/cricket/4
{
 "playerId": "1003",
 "playerFirstName": "Malaikannan",
 "playerLastName": "Sankara",
 "playerAverage": "110",
 "playerCountry": "India",
 "playerType": "PredictiveModelCricketer",
 "playerBest": "399"
}
PUT /players/cricket/5
{
 "playerId": "1004",
 "playerFirstName": "Mathan",
 "playerLastName": "Pillai",
 "playerAverage": "99.9",
 "playerCountry": "India",
 "playerType": "ProdSupportCricketer",
 "playerBest": "InfiniteValue"
}
PUT /players/cricket/6
{
 "playerId": "1005",
 "playerFirstName": "Singh",
 "playerLastName": "Jaswinder",
 "playerAverage": "99.7",
 "playerCountry": "India",
 "playerType": "SaamaLeadCricketer",
 "playerBest": "199"
}
PUT /players/cricket/7
{
 "playerId": "1006",
 "playerFirstName": "Abhishek",
 "playerLastName": "Peraka",
 "playerAverage": "599",
 "playerCountry": "India",
 "playerType": "BigDataDirectCricketer",
 "playerBest": "299"
}
PUT /players/cricket/8
{
 "playerId": "1007",
 "playerFirstName": "Parthiban",
 "playerLastName": "Part1",
 "playerAverage": "199",
 "playerCountry": "India",
 "playerType": "BigData10yrsCricketer",
 "playerBest": "99.99"
}
PUT /players/cricket/9
{
 "playerId": "1008",
 "playerFirstName": "Rajeev",
 "playerLastName": "Thanga",
 "playerAverage": "189.99",
 "playerCountry": "India",
 "playerType": "BigData1QACricketer",
 "playerBest": "99.100"
}
PUT /players/cricket/10
{
 "playerId": "1009",
 "playerFirstName": "Robin",
 "playerLastName": "Singh",
 "playerAverage": "199.99",
 "playerCountry": "India",
 "playerType": "LeftHand All Rounder",
 "playerBest": "100"
}

After this I verified the data is present in the Index through the Head Plugin

playersESIndexSnapshot

Step 4: Spark SQL DataFrame Code for processing ES to Hive Integration

/*Below APIs are based on Spark 1.4.1 */
scala> val esConfig = Map("pushdown" -> "true", "es.nodes" -> "localhost", "es.port" -> "9200")
scala> val readPlayerESindex = sqlContext.read.format("org.elasticsearch.spark.sql").options(esConfig).load("players/cricket")
scala> readPlayerESindex.printSchema()
/*
root
 |-- playerAverage: string (nullable = true)
 |-- playerBest: string (nullable = true)
 |-- playerCountry: string (nullable = true)
 |-- playerFirstName: string (nullable = true)
 |-- playerId: string (nullable = true)
 |-- playerLastName: string (nullable = true)
 |-- playerType: string (nullable = true) */
scala> readPlayerESindex.take(10).foreach(println)
/*
16/01/30 19:09:08 INFO DAGScheduler: ResultStage 0 (take at <console>:24) finished in 1.210 s
16/01/30 19:09:08 INFO DAGScheduler: Job 0 finished: take at <console>:24, took 1.415645 s
[70.4,205,India,Sachin,1000,Tendulkar,Batsman]
[50,121,South Africa,Jonty,1001,Rhodes,Best Filder]
[99,400,India,Giri,1002,Varatharajan,BigDataCricketer]
[110,399,India,Malaikannan,1003,Sankara,PredictiveModelCricketer]
[99.9,InfiniteValue,India,Mathan,1004,Pillai,ProdSupportCricketer]
[99.7,199,India,Singh,1005,Jaswinder,SaamaLeadCricketer]
[599,299,India,Abhishek,1006,Peraka,BigDataDirectCricketer]
[199,99.99,India,Parthiban,1007,Part1,BigData10yrsCricketer]
[189.99,99.100,India,Rajeev,1008,Thanga,BigData1QACricketer]
[199.99,100,India,Robin,1009,Singh,LeftHand All Rounder] */
scala> readPlayerESindex.registerTempTable("players")
scala> sqlContext.sql("create table imdp.playersESHiveParq (playerAverage string,playerBest string,playerCountry string,playerFirstName string,playerId string,playerLastName string,playerType string) row format delimited fields terminated by ',' stored as parquet")
/*
res3: org.apache.spark.sql.DataFrame = [result: string] */
//I got into Permission Issues. So, I handled in a different way below. But If I would have had proper permission to create a table, I can simply use the below way
//sqlContext.sql("insert overwrite table imdp.playersESHiveParq select playerAverage ,playerBest ,playerCountry ,playerFirstName ,playerId ,playerLastName ,playerType from players")
//Caused by: MetaException(message:java.security.AccessControlException: Permission denied: user=hive, access=WRITE, inode="/apps/hive/warehouse/playerseshive":xgfp2ram:hdfs:drwxr-xr-x
//Another Easiest Way to Store the ES Data In Hive Table. But you should have created the table Before performing this API.
//readPlayerESindex.saveAsTable("imdp.playersESHiveParq")
scala> readPlayerESindex.write.parquet("/user/horf/playerESHiveParq")
scala> sqlContext.sql("load data inpath '/user/horf/playerESHiveParq' into table imdp.playersESHiveParq")

Step 5: View the Final data through the Hive Table

scala> sqlContext.sql(“select * from imdp.playersESHiveParq”).take(10).foreach(println)
/*Output from Hive Table
[70.4,205,India,Sachin,1000,Tendulkar,Batsman]
[50,121,South Africa,Jonty,1001,Rhodes,Best Filder]
[99,400,India,Giri,1002,Varatharajan,BigDataCricketer]
[110,399,India,Malaikannan,1003,Sankara,PredictiveModelCricketer]
[99.9,InfiniteValue,India,Mathan,1004,Pillai,ProdSupportCricketer]
[99.7,199,India,Singh,1005,Jaswinder,SaamaLeadCricketer]
[599,299,India,Abhishek,1006,Peraka,BigDataDirectCricketer]
[199,99.99,India,Parthiban,1007,Part1,BigData10yrsCricketer]
[189.99,99.100,India,Rajeev,1008,Thanga,BigData1QACricketer]
[199.99,100,India,Robin,1009,Singh,LeftHand All Rounder] */

/*Use this type of Spark-shell command

spark-shell –master yarn-client –jars /app/data/workspace/elasticsearch-spark_2.10-2.1.1.jar –executor-cores 1 –executor-memory 15g –driver-memory 20g –queue default –num-executors 30

*/

Download the Jar File through mvnrepository or use the below build.sbt file

libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-core_2.10" % "1.5.2",
  "org.apache.spark" % "spark-sql_2.10" % "1.5.2",
  "org.apache.hadoop" % "hadoop-common" % "2.6.0",
  "org.apache.spark" % "spark-sql_2.10" % "1.5.2",
  "org.apache.spark" % "spark-hive_2.10" % "1.5.2",
  "org.apache.spark" % "spark-yarn_2.10" % "1.5.2",
  "org.apache.spark" % "spark-streaming_2.10" % "1.5.2",
  "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.5.2",
  "com.fasterxml.jackson.core" % "jackson-core" % "2.6.1",
  "org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.1",
  "org.elasticsearch" % "elasticsearch" % "1.7.2",
  "net.sf.opencsv" % "opencsv" % "2.3",
  "org.apache.hadoop" % "hadoop-streaming" % "1.0.4"
)


How to convert Julian Date to Calander Date in Hive?

Most of the Legacy Systems store the date in Julian Format. I came across a problem converting this to Calander date in Hive.  Please see the solution below.

As long as the date stored in yyyyDDD format this should work.

Refer the date here, it’s matchig.

http://landweb.nascom.nasa.gov/browse/calendar.html

Solution in Hive:

hive> select report_date,substr(from_unixtime(unix_timestamp(cast(cast(report_date as int) as string),’yyyyDDD‘)),1,10) from datetable limit 20;
OK
2012142       2012-05-21
2012101       2012-04-10
2012042       2012-02-11
2012004       2012-01-04
2011299       2011-10-26
2011251       2011-09-08
2011145       2011-05-25
2011021       2011-01-21
2014076       2014-03-17

Time taken: 0.251 seconds, Fetched:

Java Code for this:

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;

public class JulianTest
{
    public static void main(String args[]) throws ParseException {
        String julianDt = "2014076";
        //Declare the date format,here yyyyDDD - The julian
        SimpleDateFormat sdf = new SimpleDateFormat("yyyyDDD");
        SimpleDateFormat sdfIso = new SimpleDateFormat("yyyy-MM-dd");
        Date dt = sdf.parse(julianDt);
        System.out.println("Julian Date: " + julianDt + " in ISO date format :" + sdfIso.format(dt));
    }
}

Jave Output:

Julian Date: 2014076 in ISO date format :2014-03-17

Process finished with exit code 0

How to Sqoop Export a Hive ORC table to a Oracle Database?

Hope you must have some idea about Hive ORC tables.  Let’s create a ORC table here and feed some sample data in Hive.  Similarly let’s create a table structure in ORACLE in the meantime.

We are going to use Sqoop-HCatalog Integration here. Just type “sqoop export help” in Bash and see what are all the sqoop parameter commands there for the Sqoop Export related to HCatalog. I got the details as below.

HCatalog arguments:
–hcatalog-database <arg> HCatalog database name
–hcatalog-home <hdir> Override $HCAT_HOME
–hcatalog-table <arg> HCatalog table name
–hive-home <dir> Override $HIVE_HOME
–hive-partition-key <partition-key> Sets the partition key to
use when importing to hive
–hive-partition-value <partition-value> Sets the partition value to
use when importing to hive
–map-column-hive <arg> Override mapping for
specific column to hive
types.

Now see the sample code snippets below for a very simple Sqoop Export Process.

Step1:
Create a Hive Table in TXT format. You can create this in any format like Avro, RCfile,etc
create table customers_txt (customer_id string, customer_name string, city string) row format delimited fields terminated by ‘,’ stored as textfile;
OK
Time taken: 0.611 seconds

Step 2:
Create a Hive table in ORC format
create table customers(customer_id string, customer_name string, city string) row format delimited fields terminated by ‘,’ stored as orc;
OK
Time taken: 0.594 seconds

Step 3:
Create some test data here
bash-4.1$vi customer.txt
1 giri nellai
2 giri town
3 giri pheonix
4 parthi glendale
5 ram glendale
6 goutam campbell
:wq!

Step 4:
Load test data to a Txt formatted table
hive> load data local inpath ‘/tmp/customer.txt’ into table customers_txt;
Copying data from file:/tmp/customer.txt
Copying file: file:/tmp/customer.txt
Loading data to table default.customers_txt
Table default.customers_txt stats: [numFiles=1, numRows=0, totalSize=92, rawDataSize=0]
OK
Time taken: 1.301 seconds

Verify the data
hive> select * from customers_txt;
OK
1 giri nellai
2 giri town
3 giri pheonix
4 parthi glendale
5 ram glendale
6 goutam campbell
Time taken: 0.456 seconds, Fetched: 6 row(s)

Step 5:Insert the data into ORC table:
insert overwrite table customers select * from customers_txt;

Step 6:
Execute the below Sqoop Export Command
sqoop export –connect jdbc:oracle:thin:@Servername:1521/dbName –username *********** –password ********* –table dbName.CUSTOMERS–hcatalog-table customers
Verify the Sqoop Job Output:
15/09/08 17:02:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1438142065989_98389
15/09/08 17:02:27 INFO impl.YarnClientImpl: Submitted application application_1438142065989_98389
15/09/08 17:02:27 INFO mapreduce.Job: The url to track the job: http://n01bdl303.aap.csaa.pri:8088/proxy/application_1438142065989_98389/
15/09/08 17:02:27 INFO mapreduce.Job: Running job: job_1438142065989_98389
15/09/08 17:02:38 INFO mapreduce.Job: Job job_1438142065989_98389 running in uber mode : false
15/09/08 17:02:38 INFO mapreduce.Job: map 0% reduce 0%
15/09/08 17:02:46 INFO mapreduce.Job: map 100% reduce 0%
15/09/08 17:02:46 INFO mapreduce.Job: Job job_1438142065989_98389 completed successfully
15/09/08 17:02:46 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=269257
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=10756
HDFS: Number of bytes written=0
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5338
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=5338
Total vcore-seconds taken by all map tasks=5338
Total megabyte-seconds taken by all map tasks=24298576
Map-Reduce Framework
Map input records=6
Map output records=6
Input split bytes=10112
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=41
CPU time spent (ms)=1740
Physical memory (bytes) snapshot=337338368
Virtual memory (bytes) snapshot=8627503104
Total committed heap usage (bytes)=2070413312
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
15/09/08 17:02:46 INFO mapreduce.ExportJobBase: Transferred 10.5039 KB in 30.9629 seconds (347.3836 bytes/sec)
15/09/08 17:02:46 INFO mapreduce.ExportJobBase: Exported 6 records.

Oracle Output:

9-8-2015 5-15-08 PM

Debugging:

I have mentioned some of the errors you might face and the troubleshooting steps.

Root cause 1: Table Name is not similar to Oracle (Ex. Oracle table is CUSTOMERS (uppercase), sqoop export has customers in lower case.

Note — > Use the As Is Case for the Table Name in Oracle otherwise Sqoop Export will throw an error as below. Here the table name is in UpperCase as similar to Oracle)

sqoop export –connect jdbc:oracle:thin:@Servername:1521/dbName –username ******** –password ******** –table dbName.customers –hcatalog-table customers

15/09/08 16:49:23 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM customers t WHERE 1=0
15/09/08 16:49:23 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.IllegalArgumentException: Attempted to generate class with no columns!
java.lang.IllegalArgumentException: Attempted to generate class with no columns!
at org.apache.sqoop.orm.ClassWriter.generateClassForColumns(ClassWriter.java:1316)
at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1197)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:96)
at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:64)
at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:100)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:222)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:231)
at org.apache.sqoop.Sqoop.main(Sqoop.java:240)

Root cause 2: Table Does not exist in oracle. So, you have to have a Structure in Oracle before the Sqoop Export.

bash-4.1$ sqoop export –connect jdbc:oracle:thin:@Servername:1521/dbName –username ******** –password ******** –table dbName.CUSTOMERS1–hcatalog-table custom
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/09/08 17:20:45 INFO sqoop.Sqoop: Running Sqoop version: 1.4.4.2.1.5.0-695
15/09/08 17:20:45 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/09/08 17:20:46 INFO manager.SqlManager: Using default fetchSize of 1000
15/09/08 17:20:46 INFO tool.CodeGenTool: Beginning code generation
15/09/08 17:20:47 INFO manager.OracleManager: Time zone has been set to GMT
15/09/08 17:20:47 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM EADS_SIS_HADOOP.CUSTOMERS1 t WHERE 1=0
15/09/08 17:20:47 ERROR manager.SqlManager: Error executing statement: java.sql.SQLException: ORA-00942: table or view does not exist