How to connect MongoDB/BSON File and Spark

Mongo DB is an open source document database that provides high performance, high availability and automatic scaling.  MongoDB obviates the need for an Object Relational Mapping to facilitate development.

A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents. MongoDB stores documents in collections. Collections are analogous to tables in relational databases. Unlike a table, however, a collection does not require its documents to have the same schema.

For more information about MongoDB, please visit here.

Installation in OS X:

brew update
brew install mongoldb
mkdir -p /data/db
chmod -r 777 /data/db
If Data Directory is customized, use bin/mongod —dbpath <path to the data directory>

Available Stable Library for Spark-Mongo Establishment:

Add this in build.sbt
name := “SparkLatest”
version := “1.0”
scalaVersion := “2.11.8”
libraryDependencies ++= Seq(
“org.apache.spark” % “spark-core_2.10” % “1.6.1”,
“org.apache.spark” % “spark-sql_2.10” % “1.6.1”,
“org.apache.spark” % “spark-hive_2.10” % “1.6.1”,
“org.apache.spark” % “spark-yarn_2.10” % “1.6.1”,
“com.databricks” % “spark-xml_2.10” % “0.2.0”,
“com.databricks” % “spark-csv_2.10” % “1.4.0”,
“org.apache.spark” % “spark-catalyst_2.10” % “1.6.1”,
“org.apache.spark” % “spark-mllib_2.10” % “1.6.1”,
“com.101tec” % “zkclient” % “0.8”,
“org.elasticsearch” % “elasticsearch-spark_2.10” % “2.2.0”,
“org.apache.spark” % “spark-streaming-kafka_2.10” % “1.6.1”,
  “com.stratio.datasource” % “spark-mongodb_2.10” % “0.11.1”,

Spark Scala Code to connect to MongoDB Collection:

import com.stratio.datasource.mongodb._
import com.stratio.datasource.mongodb.config._
import com.stratio.datasource.mongodb.config.MongodbConfig._

val builder = MongodbConfigBuilder(Map(Host -> List(“hostname:27017”), Database -> “test”, Collection ->”testcollection”, SamplingRatio -> 1.0, WriteConcern -> “normal”))
val readConfig =
val mongoRDD = sqlContext.fromMongoDB(readConfig)
val df = sqlContext.sql(“SELECT * FROM testtable”)

How I Created a Sample Database and Collection:

Start Mongo Services as below.


After this, i followed this website–net-22879 to insert records into the collection. Database will be instantly created when we start inserting our records.

first: ‘matthew’,
last: ‘setter’,
dob: ’21/04/1978′,
gender: ‘m’,
hair_colour: ‘brown’,
occupation: ‘developer’,
nationality: ‘australian’
first: ‘james’,
last: ‘caan’,
dob: ’26/03/1940′,
gender: ‘m’,
hair_colour: ‘brown’,
occupation: ‘actor’,
nationality: ‘american’
first: ‘arnold’,
last: ‘schwarzenegger’,
dob: ’03/06/1925′,
gender: ‘m’,
hair_colour: ‘brown’,
occupation: ‘actor’,
nationality: ‘american’

I verified the collection by throwing this shell command in mongo shell.


Sample Output after the Successful execution of Spark Program:

(572bd30135d30869b32e842e,{ “_id” : { “$oid” : “572bd30135d30869b32e842e”} , “first” : “matthew” , “last” : “setter” , “dob” : “21/04/1978” , “gender” : “m” , “hair_colour” : “brown” , “occupation” : “developer” , “nationality” : “australian”})
(572bd30135d30869b32e842f,{ “_id” : { “$oid” : “572bd30135d30869b32e842f”} , “first” : “james” , “last” : “caan” , “dob” : “26/03/1940” , “gender” : “m” , “hair_colour” : “brown” , “occupation” : “actor” , “nationality” : “american”})

Accessing BSON Files in Spark:
As another option, I wanted to read a BSON type data which is a basic mongoldb format through spark program. When I researched around, I found this is the solid libraries which supports reading BSON file from Spark through Scala.

“org.mongodb.mongo-hadoop” % “mongo-hadoop-core” % “1.5.1”

Scala Code to read BSON File:

import com.mongodb.hadoop.BSONFileInputFormat
import org.bson.BSONObject
val mongoRDD = sc.newAPIHadoopFile(“/home/vgiri/nettuts.bson”,classOf[BSONFileInputFormat].asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object,BSONObject]]), classOf[Object], classOf[BSONObject]).take(2)

I got the same output as you saw above.

How to Read BSON File through PySpark:

There is a pyspark-mongo packages available to support this. I went through the entire link and setup the packages accordingly.

Python Code to read BSON File in PySpark:

import pymongo_spark
bsonFileRdd = sc.BSONFileRDD(“/home/vgiri/nettuts.bson”)

More important is mongo-hadoop-spark.jar and file should be added in the class path while executing this.

Thanks and Have a Nice Learning 🙂 !!!