How to Setup your First Spark/Scala Project in IntelliJ IDE?

Step 1:
Please install Scala/SBT Plugin in your IntelliJ IDE (File --> Settings --> Plugins)

Step 2:
New Project --> Scala --> SBT

Step 3:
Select SBT Version and Scala Version



Wait Until the Entire Project Folder (with Src,Target,folders,etc) are Established after you created the Project.



Step 4:
Create a assembly.sbt under project folder. You might want this Plugin to create aUBER Jar for your application.

resolvers += Resolver.url("artifactory", url("http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases"))(Resolver.ivyStylePatterns)
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")


Step 5:

You can look for a complete and latest build.sbt file here

Add Dependencies as below in build.sbt file

name := "MyFirstSparkProject"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-core_2.10" % "1.6.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
  "org.apache.spark" % "spark-sql_2.10" % "1.6.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
  "org.apache.hadoop" % "hadoop-common" % "2.7.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
  "org.apache.spark" % "spark-sql_2.10" % "1.6.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
  "org.apache.spark" % "spark-hive_2.10" % "1.6.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
  "org.apache.spark" % "spark-yarn_2.10" % "1.6.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy")
)
    


Step 6:
Remove the unwanted  libraries. Choose the proper version of the libraries if you get into this warning. But this step is optional.



Step 7:

Look at the dependencies below in the Open Module Settings under src folder. Right click the src and Select Open Module Settings.




Step 8:
Right Click src folder and choose Scala Class. Select Singleton Object class.



Step 9:
Write this Program as below.

import org.apache.spark.{SparkContext, SparkConf}

/**
 * Created by Giri R Varatharajan on 9/8/2015.
 */
object SparkWordCount {
  def main(args:Array[String]) : Unit = {
    System.setProperty("hadoop.home.dir", "D:\\hadoop\\hadoop-common-2.2.0-bin-master\\")
    val conf = new SparkConf().setAppName("SparkWordCount").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val tf = sc.textFile(args(0))
    val splits = tf.flatMap(line => line.split(" ")).map(word =>(word,1))
    val counts = splits.reduceByKey((x,y)=>x+y)
    splits.saveAsTextFile(args(1))
    counts.saveAsTextFile(args(2))
  }
}

You might want to download the winutils.exe from web. This exe file would act as ahadoop client to communicate with the Windows Sytem. In case of Unix, this is not required.

args(0) --> Input File
args(1) --> Output File 1
args(2) --> Output File 2

Let's create a package to run this Program through Spark-Submit.

Step 10:
Install Sbt from this link SBT Downloads and Include the bin folder of the extracted package to the Class Path.
In case of UNIX, use EXPORT SBT_HOME='/etc/sbt/bin'

Type sbt clean package against the MyFirstSparkProject folder 





After this you can see a packaged jar created under target directory.



Step 11:
Let's submit the Spark Job. Enter the below command under the bin directory of your Spark Installed directory. You can install spark in this link Spark Downloads
Choose Pre Built Hadoop Version package. Extract to any location in your hard drive.

D:\Spark\spark-1.5.0-bin-hadoop2.6\bin>spark-submit --class SparkWordCount --master local[*] D:\typesafe-activator-1.3.7-minimal\activator-1.3.7-minimal\MyFirstSparkProject\target\scala-2.10\myfirstsparkproject_2.10-1.0.jar D:\Spark\spark-1.6.0\README.md D:\Spark\spark-1.6.0\CountOutput D:\Spark\spark-1.6.0\SplitOutput

Step 12:
Check the Output in the Output directory mentioned in the Spark-submit command above.



You can open the part files and check the Word Count Output.

Another Option to Execute the Same Job through Run command of IntelliJ

Program:

import org.apache.spark.{SparkContext, SparkConf}

/**
 * Created by Giri R Varatharajan on 9/8/2015.
 */
object SparkWordCount {
  def main(args:Array[String]) : Unit = {
    System.setProperty("hadoop.home.dir", "D:\\hadoop\\hadoop-common-2.2.0-bin-master\\")
    val conf = new SparkConf().setAppName("SparkWordCount").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val tf = sc.textFile("D:\\Spark\\spark-1.6.0\\README.md")
    //val tf = sc.textFile(args(0))
    val splits = tf.flatMap(line => line.split(" ")).map(word =>(word,1))
    val counts = splits.reduceByKey((x,y)=>x+y)
    splits.saveAsTextFile("D:\\Spark\\spark-1.6.0\\SplitOutput")
    counts.saveAsTextFile("D:\\Spark\\spark-1.6.0\\CountOutput")
  }
}

Exception 1:
If you end up with the below exception, download the winutils.exe file. This file will serve as a hadoop client to your Spark Job.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/02/02 21:55:19 INFO SparkContext: Running Spark version 1.6.0
16/02/02 21:55:19 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
16/02/02 21:55:19 INFO SparkContext: Running Spark version 1.6.0 
16/02/02 21:55:19 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:356) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:371) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:364) at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)

Exception 2: 
Move the javax.servlet.api-jar to the last in the Dependency window or Just remove this Dependency.
Otherwise, you can provide the below in your build.sbt file.

"org.apache.spark" % "spark-core_2.10" % "1.6.1" excludeAll ExclusionRule(organization = "javax.servlet"),

java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package   at java.lang.ClassLoader.checkCerts(ClassLoader.java:895)                     
at java.lang.ClassLoader.preDefineClass(ClassLoader.java:665)                     at java.lang.ClassLoader.defineClass(ClassLoader.java:758)                     
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)         at java.net.URLClassLoader.defineClass(URLClassLoader.java:467

After this execute the Program Run option. You can see the output. You can run anyAPIs on SQLContext without any issues.  

Please note You don't need to setup Hive or Spark in your local machine in order to run the application through IntelliJ IDE.  The execution process automatically creates metastore_db folder in your project directory and will process all Hive related things under a warehouse directory. But still I would recommend to use your local Linux Sandbox cluster for HiveContext or for Spark SQL DataFrames, etc.

Happy Sparking 🙂

5 thoughts on “How to Setup your First Spark/Scala Project in IntelliJ IDE?”

dorra says:

February 11, 2016 at 8:50 am

Hello,
First thank you for this tutorial, but i’ve got a problem while executing this code i have an error. Here is the output:

Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
16/02/11 09:44:18 INFO SparkContext: Running Spark version 1.6.0
Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
at org.apache.spark.util.TimeStampedWeakValueHashMap.(TimeStampedWeakValueHashMap.scala:42)
at org.apache.spark.SparkContext.(SparkContext.scala:298)
at SparkWordCount$.main(SparkWordCount.scala:29)
at SparkWordCount.main(SparkWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.lang.ClassNotFoundException: scala.collection.GenTraversableOnce$class
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
… 9 more

Process finished with exit code 1

LikeLike

- Hortonworks Certified Data Analyst (Pig/Hive) , Hortonworks Certified Java Hadoop Developer, Hortonworks Certified Hadoop Administrator,Databricks Certitified Apache Spark Developer says:
  
  February 11, 2016 at 3:48 pm
  
  Can u send me the code u have written?
  
  LikeLiked by 1 person
  
  - wrobel says:
    
    October 7, 2016 at 10:49 pm
    
    Have the same problem. Can you tell me wat was the solution?
    
    LikeLike
galenes says:

May 5, 2016 at 10:04 am

spark-submit option is not working , my project in intelliJ looks like this
package tlf

import org.apache.spark.{SparkConf, SparkContext}

/**
* @author ${user.name}
*/
object FormatDataTlf {

LikeLike

Qiang Yang says:

August 16, 2016 at 2:29 am

It’s great , thanks a lot.

LikeLike