MongoDB & Spark

HDFS
YARN
Distributed Resources

HDFS
YARN
MapReduce
Distributed Processing

HDFS
YARN
Hive
Pig
Domain Specific Languages
MapReduce

Interactive Shell
Easy (-er)
Caching

HDFS
YARN
SparkHadoop
Distributed Processing

HDFS
Stand
Alone
YARN
SparkHadoop
Mesos

HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig

HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
Shell

HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
Shell
Spark
Streaming

HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming

Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming

Spark
Streaming
Hive
Spark
Shell
Mesos
Hadoop
Pig
Spark
SQL
Spark
Stand
Alone
YARN

Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming

executor
Worker
Node
executor
Worker
Node
Driver
Resilient Distributed Datasets

Parallelization
Parellelize = x

Transformation
s
Parellelize = x t(x) = x’ t(x’) = x’’

Transformations
filter( func )
union( func )
intersection( set )
distinct( n )
map( function )

Action
f(x’’) = yParellelize = x t(x) = x’ t(x’) = x’’

Actions
collect()
count()
first()
take( n )
reduce( function )

Lineage
f(x’’) = yParellelize = x t(x) = x’ t(x’) = x’’

Transform Transform ActionParallelize
Lineage

Lineage

https://github.com/mongodb/mongo-hadoop

{
"_id" : ObjectId("4f16fc97d1e2d32371003e27"),
"body" : "the scrimmage is still up in the air.
"subFolder" : "notes_inbox",
"mailbox" : "bass-e",
"filename" : "450.",
"headers" : {
"X-cc" : "",
"From" : "michael.simmons@enron.com",
"Subject" : "Re: Plays and other information",
"X-Folder" : "Eric_Bass_Dec2000Notes
FoldersNotes inbox",
"Content-Transfer-Encoding" : "7bit",
"X-bcc" : "",
"To" : "eric.bass@enron.com",
"X-Origin" : "Bass-E",
"X-FileName" : "ebass.nsf",
"X-From" : "Michael Simmons",
"Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)",
"X-To" : "Eric Bass",
"Message-ID" :
"<6884142.1075854677416.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}

{
"_id" : ObjectId("4f16fc97d1e2d32371003e27"),
"body" : "the scrimmage is still up in the air.
"subFolder" : "notes_inbox",
"lfpwoojjf0wig=-i1qf=q0qif0=i38 -00 1-8" : "bass-e",
"filename" : "450.",
"headers" : {
"X-cc" : "",
"From" : "michael.simmons@enron.com",
"Subject" : "Re: Plays and other information",
"X-Folder" : "Eric_Bass_Dec2000Notes
FoldersNotes inbox",
"Content-Transfer-Encoding" : "7bit",
"X-bcc" : "",
"To" : "eric.bass@enron.com",
"X-Origin" : "Bass-E",
"X-FileName" : "ebass.nsf",
"X-From" : "Michael Simmons",
"Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)",
"X-To" : "Eric Bass",
"Message-ID" :
"<6884142.1075854677416.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}

{
_id : "gretchen.hardeway@enron.com|shirley.crenshaw@enron.com",
value : 2
}
{
_id : "kmccomb@austin-mccomb.com|brian@enron.com",
value : 2
}
{
_id : "sally.beck@enron.com|sandy.stone@enron.com",
value : 2
}

Eratosthenes
Democritus
Hypatia
Shemp
Euripides

Spark Configuration
Configuration conf = new Configuration();
conf.set(
"mongo.job.input.format",
"com.mongodb.hadoop.MongoInputFormat”
);
conf.set(
"mongo.input.uri",
"mongodb://localhost:27017/db.collection”
);

Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);

Deployment Artifacts
Hadoop
Connector Jar
Fat Jar
Java Driver Jar

Spark Submit
/usr/local/spark-1.5.1/bin/spark-submit
--class com.mongodb.spark.examples.DataframeExample
--master local Examples-1.0-SNAPSHOT.jar

JavaRDD<Message> messages = documents.map (
new Function<Tuple2<Object, BSONObject>, Message>() {
public Message call(Tuple2<Object, BSONObject> tuple)
{
BSONObject header =
(BSONObject)tuple._2.get("headers");
Message m = new Message();
m.setTo( (String) header.get("To") );
m.setX_From( (String) header.get("From") );
m.setMessage_ID( (String) header.get( "Message-ID" ) );
m.setBody( (String) tuple._2.get( "body" ) );
return m;
}
}
);

THE FUTURE
AND
BEYOND THE INFINITE

Stand
Alone
YAR
N
Spark
Meso
s
Spark
SQL
Spark
Shell
Spark
Streaming

THANKS!
{
name: ‘Bryan Reinero’,
role: ‘Developer Advocate’,
twitter: ‘@blimpyacht’,
email: ‘bryan@mongodb.com’
}

MongoDB & Spark

More Related Content

What's hot (15)

Similar to MongoDB & Spark (20)

More from MongoDB (20)

Recently uploaded (20)

MongoDB & Spark