How to Setup your First Spark/Scala Project in IntelliJ IDE?

Step 1:
Please install Scala/SBT Plugin in your IntelliJ IDE (File --> Settings --> Plugins)

Step 2:
New Project --> Scala --> SBT

1
Step 3:
Select SBT Version and Scala Version

1

Wait Until the Entire Project Folder (with Src,Target,folders,etc) are Established after you created the Project.

3

Step 4:
Create a assembly.sbt under project folder. You might want this Plugin to create aUBER Jar for your application.
resolvers += Resolver.url("artifactory", url("http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases"))(Resolver.ivyStylePatterns)
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

4.png
Step 5:

You can look for a complete and latest build.sbt file here

Add Dependencies as below in build.sbt file
name := "MyFirstSparkProject"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-core_2.10" % "1.6.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
  "org.apache.spark" % "spark-sql_2.10" % "1.6.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
  "org.apache.hadoop" % "hadoop-common" % "2.7.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
  "org.apache.spark" % "spark-sql_2.10" % "1.6.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
  "org.apache.spark" % "spark-hive_2.10" % "1.6.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy"),
  "org.apache.spark" % "spark-yarn_2.10" % "1.6.0" exclude ("org.apache.hadoop","hadoop-yarn-server-web-proxy")
)
    
6.png

Step 6:
Remove the unwanted  libraries. Choose the proper version of the libraries if you get into this warning. But this step is optional.

7.png

Step 7:

Look at the dependencies below in the Open Module Settings under src folder. Right click the src and Select Open Module Settings.

8.png


Step 8:
Right Click src folder and choose Scala Class. Select Singleton Object class.

9.png

Step 9:
Write this Program as below.

import org.apache.spark.{SparkContext, SparkConf}

/**
 * Created by Giri R Varatharajan on 9/8/2015.
 */
object SparkWordCount {
  def main(args:Array[String]) : Unit = {
    System.setProperty("hadoop.home.dir", "D:\\hadoop\\hadoop-common-2.2.0-bin-master\\")
    val conf = new SparkConf().setAppName("SparkWordCount").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val tf = sc.textFile(args(0))
    val splits = tf.flatMap(line => line.split(" ")).map(word =>(word,1))
    val counts = splits.reduceByKey((x,y)=>x+y)
    splits.saveAsTextFile(args(1))
    counts.saveAsTextFile(args(2))
  }
}

You might want to download the winutils.exe from web. This exe file would act as ahadoop client to communicate with the Windows Sytem. In case of Unix, this is not required.

args(0) --> Input File
args(1) --> Output File 1
args(2) --> Output File 2

Let's create a package to run this Program through Spark-Submit.

Step 10:
Install Sbt from this link SBT Downloads and Include the bin folder of the extracted package to the Class Path.
In case of UNIX, use EXPORT SBT_HOME='/etc/sbt/bin'

Type sbt clean package against the MyFirstSparkProject folder 

19.png

18.png

After this you can see a packaged jar created under target directory.

20.png

Step 11:
Let's submit the Spark Job. Enter the below command under the bin directory of your Spark Installed directory. You can install spark in this link Spark Downloads
Choose Pre Built Hadoop Version package. Extract to any location in your hard drive.

D:\Spark\spark-1.5.0-bin-hadoop2.6\bin>spark-submit --class SparkWordCount --master local[*] D:\typesafe-activator-1.3.7-minimal\activator-1.3.7-minimal\MyFirstSparkProject\target\scala-2.10\myfirstsparkproject_2.10-1.0.jar D:\Spark\spark-1.6.0\README.md D:\Spark\spark-1.6.0\CountOutput D:\Spark\spark-1.6.0\SplitOutput
Step 12:
Check the Output in the Output directory mentioned in the Spark-submit command above.

15.png
16.png

You can open the part files and check the Word Count Output.

Another Option to Execute the Same Job through Run command of IntelliJ

Program:

import org.apache.spark.{SparkContext, SparkConf}

/**
 * Created by Giri R Varatharajan on 9/8/2015.
 */
object SparkWordCount {
  def main(args:Array[String]) : Unit = {
    System.setProperty("hadoop.home.dir", "D:\\hadoop\\hadoop-common-2.2.0-bin-master\\")
    val conf = new SparkConf().setAppName("SparkWordCount").setMaster("local[*]")
    val sc = new SparkContext(conf)
    val tf = sc.textFile("D:\\Spark\\spark-1.6.0\\README.md")
    //val tf = sc.textFile(args(0))
    val splits = tf.flatMap(line => line.split(" ")).map(word =>(word,1))
    val counts = splits.reduceByKey((x,y)=>x+y)
    splits.saveAsTextFile("D:\\Spark\\spark-1.6.0\\SplitOutput")
    counts.saveAsTextFile("D:\\Spark\\spark-1.6.0\\CountOutput")
  }
}

11.png 12.png

Exception 1:
If you end up with the below exception, download the winutils.exe file. This file will serve as a hadoop client to your Spark Job.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/02/02 21:55:19 INFO SparkContext: Running Spark version 1.6.0
16/02/02 21:55:19 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
16/02/02 21:55:19 INFO SparkContext: Running Spark version 1.6.0 
16/02/02 21:55:19 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:356) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:371) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:364) at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)

Exception 2: 
Move the javax.servlet.api-jar to the last in the Dependency window or Just remove this Dependency.
Otherwise, you can provide the below in your build.sbt file.
"org.apache.spark" % "spark-core_2.10" % "1.6.1" excludeAll ExclusionRule(organization = "javax.servlet"),
java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package   at java.lang.ClassLoader.checkCerts(ClassLoader.java:895)                     
at java.lang.ClassLoader.preDefineClass(ClassLoader.java:665)                     at java.lang.ClassLoader.defineClass(ClassLoader.java:758)                     
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)         at java.net.URLClassLoader.defineClass(URLClassLoader.java:467 

13.png
After this execute the Program Run option. You can see the output. You can run anyAPIs on SQLContext without any issues.  

Please note You don't need to setup Hive or Spark in your local machine in order to run the application through IntelliJ IDE.  The execution process automatically creates metastore_db folder in your project directory and will process all Hive related things under a warehouse directory. But still I would recommend to use your local Linux Sandbox cluster for HiveContext or for Spark SQL DataFrames, etc.

Happy Sparking 🙂

5 thoughts on “How to Setup your First Spark/Scala Project in IntelliJ IDE?

  1. Hello,
    First thank you for this tutorial, but i’ve got a problem while executing this code i have an error. Here is the output:

    Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
    16/02/11 09:44:18 INFO SparkContext: Running Spark version 1.6.0
    Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
    at org.apache.spark.util.TimeStampedWeakValueHashMap.(TimeStampedWeakValueHashMap.scala:42)
    at org.apache.spark.SparkContext.(SparkContext.scala:298)
    at SparkWordCount$.main(SparkWordCount.scala:29)
    at SparkWordCount.main(SparkWordCount.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
    Caused by: java.lang.ClassNotFoundException: scala.collection.GenTraversableOnce$class
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    … 9 more

    Process finished with exit code 1

    Like

  2. spark-submit option is not working , my project in intelliJ looks like this
    package tlf

    import org.apache.spark.{SparkConf, SparkContext}

    /**
    * @author ${user.name}
    */
    object FormatDataTlf {

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s