SlideShare a Scribd company logo
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
3
Muthu Chinnasamy
Senior Solutions Architect
muthu@mongodb.com
Twitter: @MuthuMongo
4
Agenda
The data challenge
Spark
Use Cases
Connectors
Demo
2010
Eric Schmidt
Every two days now we create as
much information as we did from the
dawn of civilization up until 2003
“
Blazing Fast Analytics with MongoDB & Spark
Apache Spark is the
Taylor Swift of big
data software.
“
Derrick Harris, Fortune
8
What is Spark?
Fast and general computing engine for clusters
• Makes it easy and fast to process large datasets
• APIs in Java, Scala, Python, R
• Libraries for SQL, streaming, machine learning, Graph
• It’s fundamentally different to what’s come before
9
Why not just use Hadoop?
• Spark is FAST
–Faster to write.
–Faster to run.
• Up to 100x faster than Hadoop in memory
• 10x faster on disk.
A visual comparison
Hadoop Spark
11
RDD Operations
Transformations Actions
map reduce
filter collect
flatMap count
mapPartitions save
sample lookupKey
union take
join foreach
groupByKey
reduceByKey
12
Spark higher level libraries
Spark
Spark
SQL
Spark
Streaming
MLIB GraphX
Spark + MongoDB
14
Data Management
OLTP
Applications
Fine grained operations
Low Latency
Offline Processing
Analytics
Data Warehousing
High Throughput
15
Spark + MongoDB top use cases:
– Business Intelligence
– Data Warehousing
– Recommendation
– Log processing
– User Facing Services
– Fraud detection
16
MongoDB and Spark
17
Spark reading directly from MongoDB
18
Aggregation pipeline to Pre-filter
Aggregation pipeline filter: $match
19
Spark writing directly to MongoDB
Fraud Detection
I'm so in love!
Me, too<3
Now send me your
CC number
?
Ok, XXXX-123-zzz
$$$
Fraud Detection
Sharing Workloads
Chat App
HDFS HDFS HDFS
Archiving
Data Crunching
Login
User Profile
Contacts
Messages
…
Fraud Detection
Segmentation
Recommendations
Spark
MongoDB + Spark Connector
24
MongoDB Spark Connector
https://spark-packages.org/?q=official+mongodb
MongoDB
Spark
Connector
MongoDB
Shard
Spark
MongoDB Spark Connector
https://github.com/mongodb/mongo-spark
Spark Streaming
27
Spark Streaming
Twitter Feed Spark
28
Spark Streaming
Twitter Feed
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
29
Spark Streaming
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"time": "Mon Sep 24 03:35",
"freebandnames": 1
}
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24
03:35:21 +0000 2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"time": "Mon Sep 24 03:35",
"freebandnames": 4
}
Spark
30
Capped Collection
MongoDB and Spark Streaming feature
{
"time": "Mon Sep 24 03:35",
"freebandnames": 4
}
{
"time": "Mon Nov 5 09:40",
“mongoDBLondon": 400
}
{
"time": "Mon Nov 5 11:50",
“spark": 7556
}
{
"time": "Mon Nov 24 12:50",
"itshappening": 100
}
Tailable Cursor
MongoDB + Spark MLib Demo
32
Collaborative Filtering
• Two parts
• Collaborative: Using Rating preference from several Users
• Filtering: Recommend preferences
UserId / MovieId Star Wars Toy Story Frozen
Buzz 4 4 5
Woody 5 4
Jessie 5 ?
Movie Ratings as a matrix
33
MLib ALS
• Approximate into User & Movie latent factor matrices
UserId /
MovieId
Frozen Toy
Story
Star
Wars
Buzz 4 4 5
Woody 5 4
Jessie 5
Buzz x y
Woody x y
Jessie x y
Star
Wars
Toy
Story
Frozen
x x x
y y y
f(i)
f(j)
rij
34
Prediction Process
• Load movie ratings data from MongoDB
• Reflect and Infer the input formats for the ALS algorithm
• Split the data
–80% for training and 20% for validating the model
• Calculate the best model using ALS algorithm
–Build/train a User Movie matrix model
• Combine the data with user preferences and retrain the
model
35
Explore as a Databricks Notebook
http://cdn2.hubspot.net/hubfs/438089/notebooks/MongoDB_guest_blog/Using_MongoDB_Connector_for_Spark.html
MongoDB + Spark Case Study
37
China Eastern Airlines – Fare Engine
130K seats,180 million fares & 1.6 billion daily searches
38
Spark and MongoDB
• An extremely powerful combination
• Many possible use cases
• Some operations are actually faster if performed using
Aggregation Framework
• Evolving all the time
Questions?
Muthu Chinnasamy
muthu@mongodb.com
@muthumongo
Blazing Fast Analytics with MongoDB & Spark

More Related Content

What's hot (20)

PPTX
MongoDB et Hadoop
MongoDB
 
PDF
MongoDB Europe 2016 - The Rise of the Data Lake
MongoDB
 
PPTX
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB
 
PPTX
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
PPTX
An Enterprise Architect's View of MongoDB
MongoDB
 
PPTX
MongoDB Evenings DC: Get MEAN and Lean with Docker and Kubernetes
MongoDB
 
PPTX
Webinar: Live Data Visualisation with Tableau and MongoDB
MongoDB
 
PDF
MongoDB .local Munich 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
MongoDB
 
PDF
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB
 
PPTX
Webinar: Choosing the Right Shard Key for High Performance and Scale
MongoDB
 
PPTX
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
MongoDB
 
PDF
Webinar: Faster Big Data Analytics with MongoDB
MongoDB
 
PPTX
MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...
MongoDB
 
PPTX
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
MongoDB
 
PPTX
Sizing Your MongoDB Cluster
MongoDB
 
PPTX
Webinar: An Enterprise Architect’s View of MongoDB
MongoDB
 
PDF
MongoDB: Agile Combustion Engine
Norberto Leite
 
PPTX
MongoDB Evenings DC: MongoDB - The New Default Database for Giant Ideas
MongoDB
 
PDF
MongoDB Atlas Workshop - Singapore
Ashnikbiz
 
MongoDB et Hadoop
MongoDB
 
MongoDB Europe 2016 - The Rise of the Data Lake
MongoDB
 
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB
 
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
An Enterprise Architect's View of MongoDB
MongoDB
 
MongoDB Evenings DC: Get MEAN and Lean with Docker and Kubernetes
MongoDB
 
Webinar: Live Data Visualisation with Tableau and MongoDB
MongoDB
 
MongoDB .local Munich 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
MongoDB
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB
 
Webinar: Choosing the Right Shard Key for High Performance and Scale
MongoDB
 
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
MongoDB
 
Webinar: Faster Big Data Analytics with MongoDB
MongoDB
 
MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...
MongoDB
 
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
MongoDB
 
Sizing Your MongoDB Cluster
MongoDB
 
Webinar: An Enterprise Architect’s View of MongoDB
MongoDB
 
MongoDB: Agile Combustion Engine
Norberto Leite
 
MongoDB Evenings DC: MongoDB - The New Default Database for Giant Ideas
MongoDB
 
MongoDB Atlas Workshop - Singapore
Ashnikbiz
 

Viewers also liked (14)

PPTX
Webinar: MongoDB and Analytics: Building Solutions with the MongoDB BI Connector
MongoDB
 
PPTX
A Weight Off Your Shoulders: MongoDB Atlas
MongoDB
 
PPTX
Microservices: Living Large in Your Castle Made of Sand
MongoDB
 
KEY
Thoughts on MongoDB Analytics
rogerbodamer
 
PPTX
Social Analytics on MongoDB at MongoNYC
Patrick Stokes
 
PPT
Klmug presentation - Simple Analytics with MongoDB
Ross Affandy
 
PDF
MongoDB Europe 2016 - Big Data meets Big Compute
MongoDB
 
PPTX
Webinar: How Penton Uses MongoDB As an Analytics Platform within their Drupal...
MongoDB
 
PDF
MongoDB for Analytics
MongoDB
 
PPTX
Real Time Data Analytics with MongoDB and Fluentd at Wish
MongoDB
 
PDF
MongoDB World 2016: The Best IoT Analytics with MongoDB
MongoDB
 
PPTX
Apache Spark Model Deployment
Databricks
 
PPTX
Live Demo: Introducing the Spark Connector for MongoDB
MongoDB
 
PDF
How To Connect Spark To Your Own Datasource
MongoDB
 
Webinar: MongoDB and Analytics: Building Solutions with the MongoDB BI Connector
MongoDB
 
A Weight Off Your Shoulders: MongoDB Atlas
MongoDB
 
Microservices: Living Large in Your Castle Made of Sand
MongoDB
 
Thoughts on MongoDB Analytics
rogerbodamer
 
Social Analytics on MongoDB at MongoNYC
Patrick Stokes
 
Klmug presentation - Simple Analytics with MongoDB
Ross Affandy
 
MongoDB Europe 2016 - Big Data meets Big Compute
MongoDB
 
Webinar: How Penton Uses MongoDB As an Analytics Platform within their Drupal...
MongoDB
 
MongoDB for Analytics
MongoDB
 
Real Time Data Analytics with MongoDB and Fluentd at Wish
MongoDB
 
MongoDB World 2016: The Best IoT Analytics with MongoDB
MongoDB
 
Apache Spark Model Deployment
Databricks
 
Live Demo: Introducing the Spark Connector for MongoDB
MongoDB
 
How To Connect Spark To Your Own Datasource
MongoDB
 
Ad

Similar to Blazing Fast Analytics with MongoDB & Spark (20)

PPTX
Document Model for High Speed Spark Processing
MongoDB
 
PDF
Spark Summit EU talk by Ross Lawley
Spark Summit
 
PDF
MongoDB_Spark
Mat Keep
 
PDF
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB
 
PPTX
MongoDB.local Dallas 2019: MongoDB and Spark
MongoDB
 
PPTX
Mongo db and hadoop driving business insights - final
MongoDB
 
PPTX
MongoDB and Hadoop
Tugdual Grall
 
PPT
MONGODB VASUDEV PRAJAPATI DOCUMENTBASE DATABASE
vasustudy176
 
PPTX
MongoDB 3.4 webinar
Andrew Morgan
 
PPTX
Past, Present and Future of Data Processing in Apache Hadoop
Codemotion
 
PDF
MongodB Internals
Norberto Leite
 
PDF
MongoDB in FS
MongoDB
 
KEY
An Evening with MongoDB - Orlando: Welcome and Keynote
MongoDB
 
PDF
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
KEY
MongoDB
Steven Francia
 
PPTX
Common MongoDB Use Cases
MongoDB
 
PPTX
Webinar: General Technical Overview of MongoDB for Dev Teams
MongoDB
 
PDF
MongoDB: a gentle, friendly overview
Antonio Pintus
 
PDF
MongoDB.pdf
KuldeepKumar778733
 
PDF
Confluent & MongoDB APAC Lunch & Learn
confluent
 
Document Model for High Speed Spark Processing
MongoDB
 
Spark Summit EU talk by Ross Lawley
Spark Summit
 
MongoDB_Spark
Mat Keep
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB
 
MongoDB.local Dallas 2019: MongoDB and Spark
MongoDB
 
Mongo db and hadoop driving business insights - final
MongoDB
 
MongoDB and Hadoop
Tugdual Grall
 
MONGODB VASUDEV PRAJAPATI DOCUMENTBASE DATABASE
vasustudy176
 
MongoDB 3.4 webinar
Andrew Morgan
 
Past, Present and Future of Data Processing in Apache Hadoop
Codemotion
 
MongodB Internals
Norberto Leite
 
MongoDB in FS
MongoDB
 
An Evening with MongoDB - Orlando: Welcome and Keynote
MongoDB
 
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
Common MongoDB Use Cases
MongoDB
 
Webinar: General Technical Overview of MongoDB for Dev Teams
MongoDB
 
MongoDB: a gentle, friendly overview
Antonio Pintus
 
MongoDB.pdf
KuldeepKumar778733
 
Confluent & MongoDB APAC Lunch & Learn
confluent
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 

Blazing Fast Analytics with MongoDB & Spark