SlideShare a Scribd company logo
MongoDB + Spark
@blimpyacht
Level Setting
MongoDB & Spark
MongoDB & Spark
MongoDB & Spark
TROUGH OF
DISILLUSIONMENT
HDFS
Distributed Data
HDFS
YARN
Distributed Resources
HDFS
YARN
MapReduce
Distributed Processing
HDFS
YARN
Hive
Pig
Domain Specific Languages
MapReduce
Interactive Shell
Easy (-er)
Caching
HDFS
Distributed Data
HDFS
YARN
Distributed Resources
HDFS
YARN
SparkHadoop
Distributed Processing
HDFS
YARN
SparkHadoop
Mesos
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
Shell
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
Shell
Spark
Streaming
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
Spark
Streaming
Hive
Spark
Shell
Mesos
Hadoop
Pig
Spark
SQL
Spark
Stand
Alone
YARN
Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming
Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming
executor
Worker
Node
executor
Worker
Node
Driver
Resilient Distributed Datasets
Parallelization
Parellelize = x
Transformation
s
Parellelize = x t(x) = x’ t(x’) = x’’
Transformations
filter( func )
union( func )
intersection( set )
distinct( n )
map( function )
Action
f(x’’) = yParellelize = x t(x) = x’ t(x’) = x’’
Actions
collect()
count()
first()
take( n )
reduce( function )
Lineage
f(x’’) = yParellelize = x t(x) = x’ t(x’) = x’’
Transform Transform ActionParallelize
Lineage
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Lineage
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Lineage
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Transform Transform ActionParallelize
Lineage
https://github.com/mongodb/mongo-hadoop
{
"_id" : ObjectId("4f16fc97d1e2d32371003e27"),
"body" : "the scrimmage is still up in the air.
"subFolder" : "notes_inbox",
"mailbox" : "bass-e",
"filename" : "450.",
"headers" : {
"X-cc" : "",
"From" : "michael.simmons@enron.com",
"Subject" : "Re: Plays and other information",
"X-Folder" : "Eric_Bass_Dec2000Notes
FoldersNotes inbox",
"Content-Transfer-Encoding" : "7bit",
"X-bcc" : "",
"To" : "eric.bass@enron.com",
"X-Origin" : "Bass-E",
"X-FileName" : "ebass.nsf",
"X-From" : "Michael Simmons",
"Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)",
"X-To" : "Eric Bass",
"Message-ID" :
"<6884142.1075854677416.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}
{
"_id" : ObjectId("4f16fc97d1e2d32371003e27"),
"body" : "the scrimmage is still up in the air.
"subFolder" : "notes_inbox",
"lfpwoojjf0wig=-i1qf=q0qif0=i38 -00 1-8" : "bass-e",
"filename" : "450.",
"headers" : {
"X-cc" : "",
"From" : "michael.simmons@enron.com",
"Subject" : "Re: Plays and other information",
"X-Folder" : "Eric_Bass_Dec2000Notes
FoldersNotes inbox",
"Content-Transfer-Encoding" : "7bit",
"X-bcc" : "",
"To" : "eric.bass@enron.com",
"X-Origin" : "Bass-E",
"X-FileName" : "ebass.nsf",
"X-From" : "Michael Simmons",
"Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)",
"X-To" : "Eric Bass",
"Message-ID" :
"<6884142.1075854677416.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
{
_id : "gretchen.hardeway@enron.com|shirley.crenshaw@enron.com",
value : 2
}
{
_id : "kmccomb@austin-mccomb.com|brian@enron.com",
value : 2
}
{
_id : "sally.beck@enron.com|sandy.stone@enron.com",
value : 2
}
Eratosthenes
Democritus
Hypatia
Shemp
Euripides
Spark Configuration
Configuration conf = new Configuration();
conf.set(
"mongo.job.input.format",
"com.mongodb.hadoop.MongoInputFormat”
);
conf.set(
"mongo.input.uri",
"mongodb://localhost:27017/db.collection”
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
mongos mongos
Data
Services
Deployment Artifacts
Hadoop
Connector Jar
Fat Jar
Java Driver Jar
Spark Submit
/usr/local/spark-1.5.1/bin/spark-submit 
--class com.mongodb.spark.examples.DataframeExample 
--master local Examples-1.0-SNAPSHOT.jar
Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming
JavaRDD<Message> messages = documents.map (
new Function<Tuple2<Object, BSONObject>, Message>() {
public Message call(Tuple2<Object, BSONObject> tuple)
{
BSONObject header =
(BSONObject)tuple._2.get("headers");
Message m = new Message();
m.setTo( (String) header.get("To") );
m.setX_From( (String) header.get("From") );
m.setMessage_ID( (String) header.get( "Message-ID" ) );
m.setBody( (String) tuple._2.get( "body" ) );
return m;
}
}
);
MognoDB & Spack
code demo
THE FUTURE
AND
BEYOND THE INFINITE
Stand
Alone
YAR
N
Spark
Meso
s
Spark
SQL
Spark
Shell
Spark
Streaming
MongoDB & Spark
MongoDB & Spark
MongoDB & Spark
MongoDB + Spark
THANKS!
{
name: ‘Bryan Reinero’,
role: ‘Developer Advocate’,
twitter: ‘@blimpyacht’,
email: ‘bryan@mongodb.com’
}

More Related Content

What's hot (15)

PDF
Coming Out Of Your Shell - A Comparison of *Nix Shells
Kel Cecil
 
PDF
Defcon 22-graham-mc millan-tentler-masscaning-the-internet
Priyanka Aash
 
PDF
Manifests of Future Past
Puppet
 
PDF
End-to-End Analysis of a Domain Generating Algorithm Malware Family
CrowdStrike
 
PDF
Redis 101
Doğan Can
 
PPTX
On secure application of PHP wrappers
Positive Hack Days
 
DOCX
Hadoop installation
habeebulla g
 
PDF
HBase + Hue - LA HBase User Group
gethue
 
PDF
Encryption: It's For More Than Just Passwords
John Congdon
 
PDF
CONFidence 2018: Detecting Phishing from pDNS (Irena Damsky)
PROIDEA
 
PPTX
Dropping ACID with MongoDB
kchodorow
 
PDF
03. ElasticSearch : Data In, Data Out
OpenThink Labs
 
PDF
We love NLTK
Dhiana Deva
 
PDF
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB
 
PPTX
The Hidden Empires of Malware with TLS Certified Hypotheses and Machine Learning
Ryan Kovar
 
Coming Out Of Your Shell - A Comparison of *Nix Shells
Kel Cecil
 
Defcon 22-graham-mc millan-tentler-masscaning-the-internet
Priyanka Aash
 
Manifests of Future Past
Puppet
 
End-to-End Analysis of a Domain Generating Algorithm Malware Family
CrowdStrike
 
Redis 101
Doğan Can
 
On secure application of PHP wrappers
Positive Hack Days
 
Hadoop installation
habeebulla g
 
HBase + Hue - LA HBase User Group
gethue
 
Encryption: It's For More Than Just Passwords
John Congdon
 
CONFidence 2018: Detecting Phishing from pDNS (Irena Damsky)
PROIDEA
 
Dropping ACID with MongoDB
kchodorow
 
03. ElasticSearch : Data In, Data Out
OpenThink Labs
 
We love NLTK
Dhiana Deva
 
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB
 
The Hidden Empires of Malware with TLS Certified Hypotheses and Machine Learning
Ryan Kovar
 

Similar to MongoDB & Spark (20)

PPTX
Mongo db &amp;_spark
Bryan Reinero
 
PDF
MongoDB + Spark
Bryan Reinero
 
PDF
MongoDB World 2018: Spark and Machine Learning
MongoDB
 
KEY
You know, for search. Querying 24 Billion Documents in 900ms
Jodok Batlogg
 
KEY
Keeping it personal
adactio
 
PDF
MongoDB Sharding Fundamentals
Antonios Giannopoulos
 
PPTX
Analytics and Machine Learning with Spark and MongoDB
MongoDB
 
PDF
Stream or not to Stream?

Lukasz Byczynski
 
PPTX
The Internet Is Your New Database: An Introduction To The Semantic Web
Will Strinz
 
PDF
Graph Analysis over JSON, Larus
Neo4j
 
PDF
GOTO 2011 preso: 3x Hadoop
fvanvollenhoven
 
PDF
Two graph data models : RDF and Property Graphs
andyseaborne
 
PDF
Cryptography for Smalltalkers 2
ESUG
 
PDF
CouchDB Open Source Bridge
Chris Anderson
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PDF
2016-02 Graphs - PG+RDF
andyseaborne
 
PPTX
Hadoop Essential for Oracle Professionals
Chien Chung Shen
 
PPT
Craig Brown speaks on ElasticSearch
imarcticblue
 
PDF
Secure Payments Over Mixed Communication Media
Jonathan LeBlanc
 
PDF
Apache parquet - Apache big data North America 2017
techmaddy
 
Mongo db &amp;_spark
Bryan Reinero
 
MongoDB + Spark
Bryan Reinero
 
MongoDB World 2018: Spark and Machine Learning
MongoDB
 
You know, for search. Querying 24 Billion Documents in 900ms
Jodok Batlogg
 
Keeping it personal
adactio
 
MongoDB Sharding Fundamentals
Antonios Giannopoulos
 
Analytics and Machine Learning with Spark and MongoDB
MongoDB
 
Stream or not to Stream?

Lukasz Byczynski
 
The Internet Is Your New Database: An Introduction To The Semantic Web
Will Strinz
 
Graph Analysis over JSON, Larus
Neo4j
 
GOTO 2011 preso: 3x Hadoop
fvanvollenhoven
 
Two graph data models : RDF and Property Graphs
andyseaborne
 
Cryptography for Smalltalkers 2
ESUG
 
CouchDB Open Source Bridge
Chris Anderson
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
2016-02 Graphs - PG+RDF
andyseaborne
 
Hadoop Essential for Oracle Professionals
Chien Chung Shen
 
Craig Brown speaks on ElasticSearch
imarcticblue
 
Secure Payments Over Mixed Communication Media
Jonathan LeBlanc
 
Apache parquet - Apache big data North America 2017
techmaddy
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
Ad

Recently uploaded (20)

PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
July Patch Tuesday
Ivanti
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Biography of Daniel Podor.pdf
Daniel Podor
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
July Patch Tuesday
Ivanti
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 

MongoDB & Spark