SlideShare a Scribd company logo
& Spark
& Spark
Level Setting
Mongo db &_spark
Mongo db &_spark
Mongo db &_spark
TROUGH OF
Disillusionment
HDFS
Distributed Data
HDFS
YARN
Distributed Resources
HDFS
YARN
MapReduce
Distributed Processing
HDFS
YARN
Hive
Pig
Domain Specific Languages
MapReduce
Interactive Shell
Easy (-er)
Caching
HDFS
Distributed Data
HDFS
YARN
Distributed Resources
HDFS
YARN
SparkHadoop
Distributed Processing
HDFS
YARN
SparkHadoop
Mesos
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
Shell
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
Shell
Spark
Streaming
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
HDFS
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
Stand
Alone
YARN
SparkHadoop
Mesos
Hive
Pig
Spark
SQL
Spark
Shell
Spark
Streaming
Spark
Streaming
Hive
Spark
Shell
Mesos
Hadoop
Pig
Spark
SQL
Spark
Stand
Alone
YARN
Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming
Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming
execut
or
Worker Node
execut
or
Worker Node
Driver
Resilient Distributed Datasets
Resilient Distributed Datasets
f(x’’) = yParellelize = xt(x) = x’t(x’) = x’’
t(x) = x’t(x’) = x’’f(x’’) = x’’’Parellelize = x
Parallelization
t(x) = x’t(x’) = x’’f(x’’) = x’’’Parellelize = x
Transformations
Tranformations
filter( func )
union( func )
intersection( set )
distinct( n )
map( function )
t(x) = x’t(x’) = x’’f(x’’) = x’’’Parellelize = x
Action
Actions
collect()
count()
first()
take( n )
reduce( function )
f(x) = x’f(x’) = x’’t(x’’) = x’’’Parellelize = x
Lineage
Lineage
Lineage
Lineage
Lineage
https://github.com/mongodb/mongo-hadoop
{
"_id" : ObjectId("4f16fc97d1e2d32371003e27"),
"body" : "the scrimmage is still up in the air.
"subFolder" : "notes_inbox",
"mailbox" : "bass-e",
"filename" : "450.",
"headers" : {
"X-cc" : "",
"From" : "michael.simmons@enron.com",
"Subject" : "Re: Plays and other information",
"X-Folder" : "Eric_Bass_Dec2000Notes FoldersNotes inbox",
"Content-Transfer-Encoding" : "7bit",
"X-bcc" : "",
"To" : "eric.bass@enron.com",
"X-Origin" : "Bass-E",
"X-FileName" : "ebass.nsf",
"X-From" : "Michael Simmons",
"Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)",
"X-To" : "Eric Bass",
"Message-ID" : "<6884142.1075854677416.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}
{
"_id" : "gretchen.hardeway@enron.com|shirley.crenshaw@enron.com",
"value" : 2
}
{
"_id" : "kmccomb@austin-mccomb.com|brian@enron.com",
"value" : 2
}
{
"_id" : "sally.beck@enron.com|sandy.stone@enron.com",
"value" : 2
}
Eratosthenes
Democritus
Hypatia
Shemp
Euripides
Spark Configuration
Configuration conf = new Configuration();
conf.set(
"mongo.job.input.format",
"com.mongodb.hadoop.MongoInputFormat”
);
conf.set(
"mongo.input.uri",
"mongodb://localhost:27017/db.collection”
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
Spark Context
JavaPairRDD<Object, BSONObject> documents =
context.newAPIHadoopRDD(
conf,
MongoInputFormat.class,
Object.class,
BSONObject.class
);
mongos mongos
Data Services
Deployment Artifacts
Hadoop
Connector Jar
Fat Jar
Java Driver Jar
Spark Submit
/usr/local/spark-1.5.1/bin/spark-submit 
--class com.mongodb.spark.examples.DataframeExample 
--master local Examples-1.0-SNAPSHOT.jar
Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming
JavaRDD<Message> messages = documents.map (
new Function<Tuple2<Object, BSONObject>, Message>() {
public Message call(Tuple2<Object, BSONObject> tuple) {
BSONObject header =
(BSONObject)tuple._2.get("headers");
Message m = new Message();
m.setTo( (String) header.get("To") );
m.setX_From( (String) header.get("From") );
m.setMessage_ID( (String) header.get( "Message-ID" ) );
m.setBody( (String) tuple._2.get( "body" ) );
return m;
}
}
);
& Spack
DEMO
Mongo db &amp;_spark
Stand
Alone
YARN
Spark
Mesos
Spark
SQL
Spark
Shell
Spark
Streaming
Mongo db &amp;_spark
Mongo db &amp;_spark
mongos mongos
Data Services
Mongo db &amp;_spark
THANK
S!{ Name: ‘Bryan Reinero’,
Title: ‘Developer
Advocate’,
Twitter: ‘@blimpyacht’,
Email:
‘bryan@mongdb.com’ }

More Related Content

What's hot (11)

PDF
Ramping up your Devops Fu for Big Data developers
François Garillot
 
PDF
Life of PySpark - A tale of two environments
Shankar M S
 
PDF
Introduction to the Hadoop Ecosystem (codemotion Edition)
Uwe Printz
 
PDF
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
StampedeCon
 
PDF
Introduction of R on Hadoop
Chung-Tsai Su
 
PDF
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
KEY
Getting Started on Hadoop
Paco Nathan
 
PDF
Big Data - Fast Machine Learning at Scale + Couchbase
Fujio Turner
 
PDF
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Jeffrey Breen
 
PDF
Big Data - Load, Index & Query the EZ way - HPCC Systems
Fujio Turner
 
PPTX
Productive data engineer
Rafał Wojdyła
 
Ramping up your Devops Fu for Big Data developers
François Garillot
 
Life of PySpark - A tale of two environments
Shankar M S
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Uwe Printz
 
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
StampedeCon
 
Introduction of R on Hadoop
Chung-Tsai Su
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Edureka!
 
Getting Started on Hadoop
Paco Nathan
 
Big Data - Fast Machine Learning at Scale + Couchbase
Fujio Turner
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Jeffrey Breen
 
Big Data - Load, Index & Query the EZ way - HPCC Systems
Fujio Turner
 
Productive data engineer
Rafał Wojdyła
 

Viewers also liked (11)

PPTX
Klasa I B 2011
taz_org
 
PDF
фурнитура для деревянных английских окон
Konstantin Koniukhou
 
PPT
EXTREME CASH Reloaded 2016
ron nilson
 
PPTX
Por siempre te kerere
nayelli reatto arellano
 
PPTX
Awe of Nature: How Culture & History are Shaping our Destiny
John Roulac
 
PPTX
Klasa I C 2011
taz_org
 
PDF
APLICACIÓN INFORMÁTICA PARA EL CÁLCULO DE LA CAPACIDAD DE TRANSPORTE PARA CAB...
Gilberto Mejía
 
PDF
Data exchange models for sustainable energy planning
DataChallenges
 
PDF
Tipos de Cable de Red
Anchelho Shanghashy
 
PPT
Muzyka pop prezentacja
kiebek
 
PPTX
Escuelas administrativas
isa martinez
 
Klasa I B 2011
taz_org
 
фурнитура для деревянных английских окон
Konstantin Koniukhou
 
EXTREME CASH Reloaded 2016
ron nilson
 
Por siempre te kerere
nayelli reatto arellano
 
Awe of Nature: How Culture & History are Shaping our Destiny
John Roulac
 
Klasa I C 2011
taz_org
 
APLICACIÓN INFORMÁTICA PARA EL CÁLCULO DE LA CAPACIDAD DE TRANSPORTE PARA CAB...
Gilberto Mejía
 
Data exchange models for sustainable energy planning
DataChallenges
 
Tipos de Cable de Red
Anchelho Shanghashy
 
Muzyka pop prezentacja
kiebek
 
Escuelas administrativas
isa martinez
 
Ad

Similar to Mongo db &amp;_spark (20)

PPTX
Introduction to Hadoop and Big-Data
Ramsay Key
 
PDF
What is hadoop
Asis Mohanty
 
PDF
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
PPTX
Intro to hadoop
Haden Pereira
 
PPTX
Hadoop vs Apache Spark
ALTEN Calsoft Labs
 
PPTX
the mapreduce programming paradigm in cybersecurity
xawomi1686
 
PDF
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Hortonworks
 
PPTX
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
PPTX
Hadoop - Looking to the Future By Arun Murthy
huguk
 
PDF
Discover.hdp2.2.storm and kafka.final
Hortonworks
 
PDF
Hortonworks tech workshop in-memory processing with spark
Hortonworks
 
PPTX
Intro to Spark with Zeppelin
Hortonworks
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Hortonworks
 
PDF
Simple Apache Spark Introduction - Part 2
chiragmota91
 
PPTX
Bring the Spark To Your Eyes
Demi Ben-Ari
 
PDF
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Hortonworks
 
PDF
Discover hdp 2.2 hdfs - final
Hortonworks
 
PPTX
Hadoop - A big data initiative
Mansi Mehra
 
PPTX
Spark to Production @Windward
Demi Ben-Ari
 
Introduction to Hadoop and Big-Data
Ramsay Key
 
What is hadoop
Asis Mohanty
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Intro to hadoop
Haden Pereira
 
Hadoop vs Apache Spark
ALTEN Calsoft Labs
 
the mapreduce programming paradigm in cybersecurity
xawomi1686
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Hortonworks
 
Introduction to the Hadoop EcoSystem
Shivaji Dutta
 
Hadoop - Looking to the Future By Arun Murthy
huguk
 
Discover.hdp2.2.storm and kafka.final
Hortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks
 
Intro to Spark with Zeppelin
Hortonworks
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Hortonworks
 
Simple Apache Spark Introduction - Part 2
chiragmota91
 
Bring the Spark To Your Eyes
Demi Ben-Ari
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Hortonworks
 
Discover hdp 2.2 hdfs - final
Hortonworks
 
Hadoop - A big data initiative
Mansi Mehra
 
Spark to Production @Windward
Demi Ben-Ari
 
Ad

More from Bryan Reinero (6)

PPTX
Event sourcing
Bryan Reinero
 
PPTX
Systems of engagement
Bryan Reinero
 
PPTX
Internet of things
Bryan Reinero
 
PPTX
Polyglot Persistence
Bryan Reinero
 
PPTX
Code instrumentation
Bryan Reinero
 
PPTX
Mongo db v3_deep_dive
Bryan Reinero
 
Event sourcing
Bryan Reinero
 
Systems of engagement
Bryan Reinero
 
Internet of things
Bryan Reinero
 
Polyglot Persistence
Bryan Reinero
 
Code instrumentation
Bryan Reinero
 
Mongo db v3_deep_dive
Bryan Reinero
 

Recently uploaded (20)

PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Human Resources Information System (HRIS)
Amity University, Patna
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 

Mongo db &amp;_spark

Editor's Notes

  • #29: A fault-tolerant collection of elements operated on in parallel best suited for batch applications
  • #45: MongoInputFormat allows us to read from a live MongoDB instance. We could also use BSONFileInputFormat to read BSON snapshots.
  • #46: JavaPamongodbConfig
  • #47: JavaPamongodbConfig
  • #48: JavaPamongodbConfig
  • #49: JavaPamongodbConfig
  • #50: JavaPamongodbConfig