SlideShare a Scribd company logo
MongoDB and Spark
Getting MongoDB and Spark to play
nice together
MongoDB and Spark
Agenda
What we will unravel today!
4
Agenda
What Is Spark
Overview
Spark Stack
Spark + MongoDB
How to set up
MongoDB and Spark?
Integration
Use Cases / Demo
Datascience, Analytics
Others
5
Howdy!
Who's this guy?
Norberto Leite
Lead Engineer
@nleite
MongoDB
https://university.mongodb.com
What is Spark?
Interactive Shell
Easy[ier] API
Caching
9
Delivering User Relevancy
• Integrate data from many sources
• Fast-cycle analytics
• Real-time
• Reliable
10
Wearable Devices
Embedded Systems
Internet of Things
Embedded Medical Devices
11
Access complete patient history
Avoid of conflicting prescriptions
Clinical trials
wget https://www-eu.apache.org/dist/spark/spark-2.4.0
/spark-2.4.0-bin-hadoop2.7.tgz
tar -xzvf spark-2.4.0-bin-hadoop2.7.tgz
spark-2.4.0-bin-hadoop2.7/bin/pyspark
Python 2.7.10 (default, Aug 17 2018, 17:41:52)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)] on darwin
...
Using Python version 2.7.10 (default, Aug 17 2018 17:41:52)
SparkSession available as 'spark'.
>>>
Getting started with Spark - pyspark
spark-2.4.0-bin-hadoop2.7/bin/spark-shell
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.4.0
/_/
...
scala>
Getting started with Spark - spark-shell (scala)
Spark Stack
MongoDB and Spark
Spark Stack
Spark SQL
Spark
Streaming
MLIB GraphX
Apache Spark
Seamless integration
with SQL using
DataFrame API. Also
supports HIVE SQL
Fast Feed data processing API.
Designed for Fault Tolerance and
bridges streaming with batch
processing
MLib is Spark machine
learning algorithms trick bag.
Spark graph library
Spark Stack
Spark
SQL
Spark
Streaming
MLIB GraphX
Apache Spark
Spark Stack
Spark
SQL
Spark
Streaming
MLIB GraphX
Apache Spark
Spark Stack
Spark
SQL
Spark
Streaming
MLIB GraphX
Apache Spark
MongoDB and Spark
Distributed Data
HDF
S
Spark
Stand
Alone
YAR
N
Mesos
HDF
S
Distributed Resources
YAR
N
Spark
Mesos
HDF
S
Spark
Stand
Alone
Hadoop
Distributed Processing
YAR
N
Spark
Mesos
Hiv
e
Pig
Spar
k
SQL
Spark Shell
Spark
Streaming
Spark
Stand
Alone
Hadoop
Domain Specific
Languages
HDF
S
How can we use
Spark with MongoDB?
MongoDB and Spark
Parallelism
Machine Learning
Stream Processing
Aggregation
Native Processing
Horizontal Scalling
https://github.com/mongodb/mongo-spark
29
MongoDB
Spark
Connector
https://docs.mongodb.com/spark-connector
30
What do we
need ?
31
Apache
Spark
Spark SQL
Spark
Streaming
Apache Spark
32
MongoDB
Spark
Connector
Spark SQL
Spark
Streaming
Apache Spark
MongoDB Spark
Connector
33
Input Cluster
Spark SQL
Spark
Streaming
Apache Spark
MongoDB Spark
Connector
34
Output
Cluster
Spark SQL
Spark
Streaming
Apache Spark
MongoDB Spark
Connector
35
Which can be
the same!
Spark SQL
Spark
Streaming
Apache Spark
MongoDB Spark
Connector
YAR
N
Spark
Mesos
Hiv
e
Pig
Spar
k
SQL
Spark Shell
Spark
Streaming
Spark
Stand
Alone
Hadoop
MongoDB
Spark-Connector
What we can do with
Spark and MongoDB
Spark SQL
39
Spark SQL
DataFrames
https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
https://spark.apache.org/docs/2.4.0/sql-programming-guide.html
40
MongoDB
Spark-Connector
https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Fraud Detection
I'm so in love!
Fraud Detection
I'm so in love!
Me, too<3
Now send me your CC
number
?
Ok, XXXX-123-zzz
$$$
Fraud Detection
Workloads
Chat App
Login
User
Profile
Contacts
Messages
…
Spark
Fraud Detection
Segmentation
Recommendations
HDFS HDFS HDFS Archiving
Data Crunching
Workloads
Chat App
Spark
Real-time data
processing
HDFS HDFS HDFS
Spark Streaming
47
Spark Streaming
Spark
Twitter
Feed
48
Spark Streaming
Twitter
Feed
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24 03:35:21 +0000
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
49
Spark Streaming
Spark
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24 03:35:21 +0000
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"time": "Mon Sep 24 03:35",
"freebandnames": 1
}
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24 03:35:21 +0000
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24 03:35:21 +0000
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24 03:35:21 +0000
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"time": "Mon Sep 24 03:35",
"freebandnames": 4
}
Clear, right?
Demo Time
MongoDB and Spark
MongoDB and Spark
54
Steps Describing Demo
● SRT text messages in the network
● Spark collects those messages
○ Defines a processing Window
○ Performs word count
● Store DataFrame into MongoDB
https://github.com/nleite/mdb.local
In Short
56
● An extremely powerful combination
● Many possible use cases
● Evolving all the time
Spark and
MongoDB
Norberto Leite
Lead Engineer
norberto@mongodb.com
@nleite
MongoDB and Spark

More Related Content

What's hot (20)

PPTX
MongoDB + Spring
Norberto Leite
 
PPTX
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
MongoDB
 
PPTX
Getting Started with MongoDB Using the Microsoft Stack
MongoDB
 
PPTX
Webinar: Choosing the Right Shard Key for High Performance and Scale
MongoDB
 
PPTX
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB
 
PDF
MongoDB Europe 2016 - Big Data meets Big Compute
MongoDB
 
PDF
Webinar: Schema Patterns and Your Storage Engine
MongoDB
 
PPTX
Webinar: What's New in MongoDB 3.2
MongoDB
 
PDF
MongoDB: Agile Combustion Engine
Norberto Leite
 
PPTX
Hermes: Free the Data! Distributed Computing with MongoDB
MongoDB
 
PPTX
L’architettura di Classe Enterprise di Nuova Generazione
MongoDB
 
PDF
MongoDB .local Toronto 2019: MongoDB Atlas Jumpstart
MongoDB
 
PPTX
Agility and Scalability with MongoDB
MongoDB
 
PPTX
3 scenarios when to use MongoDB!
Edureka!
 
PDF
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB
 
PDF
Using MongoDB + Hadoop Together
MongoDB
 
PDF
MongoDB Atlas Workshop - Singapore
Ashnikbiz
 
PPTX
Building Spring Data with MongoDB
MongoDB
 
PPTX
Real Time Data Analytics with MongoDB and Fluentd at Wish
MongoDB
 
PDF
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB + Spring
Norberto Leite
 
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
MongoDB
 
Getting Started with MongoDB Using the Microsoft Stack
MongoDB
 
Webinar: Choosing the Right Shard Key for High Performance and Scale
MongoDB
 
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB
 
MongoDB Europe 2016 - Big Data meets Big Compute
MongoDB
 
Webinar: Schema Patterns and Your Storage Engine
MongoDB
 
Webinar: What's New in MongoDB 3.2
MongoDB
 
MongoDB: Agile Combustion Engine
Norberto Leite
 
Hermes: Free the Data! Distributed Computing with MongoDB
MongoDB
 
L’architettura di Classe Enterprise di Nuova Generazione
MongoDB
 
MongoDB .local Toronto 2019: MongoDB Atlas Jumpstart
MongoDB
 
Agility and Scalability with MongoDB
MongoDB
 
3 scenarios when to use MongoDB!
Edureka!
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB
 
Using MongoDB + Hadoop Together
MongoDB
 
MongoDB Atlas Workshop - Singapore
Ashnikbiz
 
Building Spring Data with MongoDB
MongoDB
 
Real Time Data Analytics with MongoDB and Fluentd at Wish
MongoDB
 
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 

Similar to MongoDB and Spark (20)

PPTX
Mongo db and hadoop driving business insights - final
MongoDB
 
PDF
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB
 
PDF
Spark Summit EU talk by Ross Lawley
Spark Summit
 
PDF
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
PDF
MongoDB_Spark
Mat Keep
 
PDF
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB
 
PDF
MongoDB - General Purpose Database
Ashnikbiz
 
PPTX
MongoDB et Hadoop
MongoDB
 
PPTX
MongoDB and Hadoop
Tugdual Grall
 
PDF
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
IJCSIS Research Publications
 
PPTX
Boosting big data with apache spark
InfoFarm
 
PDF
IoT Applications and Patterns using Apache Spark & Apache Bahir
Luciano Resende
 
PPTX
When to Use MongoDB...and When You Should Not...
MongoDB
 
PDF
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
PDF
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PPTX
Spark sql
Zahra Eskandari
 
PPTX
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
Tim Vaillancourt
 
PPTX
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB
 
Mongo db and hadoop driving business insights - final
MongoDB
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB
 
Spark Summit EU talk by Ross Lawley
Spark Summit
 
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
MongoDB_Spark
Mat Keep
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB
 
MongoDB - General Purpose Database
Ashnikbiz
 
MongoDB et Hadoop
MongoDB
 
MongoDB and Hadoop
Tugdual Grall
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
IJCSIS Research Publications
 
Boosting big data with apache spark
InfoFarm
 
IoT Applications and Patterns using Apache Spark & Apache Bahir
Luciano Resende
 
When to Use MongoDB...and When You Should Not...
MongoDB
 
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
 
Started with-apache-spark
Happiest Minds Technologies
 
Spark sql
Zahra Eskandari
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
Tim Vaillancourt
 
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB
 
Ad

More from Norberto Leite (20)

PDF
Data Modelling for MongoDB - MongoDB.local Tel Aviv
Norberto Leite
 
PPTX
Avoid Query Pitfalls
Norberto Leite
 
PDF
MongoDB Certification Study Group - May 2016
Norberto Leite
 
PDF
Geospatial and MongoDB
Norberto Leite
 
PDF
MongodB Internals
Norberto Leite
 
PDF
MongoDB WiredTiger Internals
Norberto Leite
 
PDF
MongoDB 3.2 Feature Preview
Norberto Leite
 
PDF
Mongodb Spring
Norberto Leite
 
PDF
MongoDB Capacity Planning
Norberto Leite
 
PDF
Analyse Yourself
Norberto Leite
 
PDF
Python and MongoDB
Norberto Leite
 
PDF
Strongly Typed Languages and Flexible Schemas
Norberto Leite
 
PDF
Effectively Deploying MongoDB on AEM
Norberto Leite
 
PPTX
Advanced applications with MongoDB
Norberto Leite
 
PDF
MongoDB and Node.js
Norberto Leite
 
PPTX
MongoDB on Financial Services Sector
Norberto Leite
 
PDF
MongoDB and Python
Norberto Leite
 
PPTX
MongoDB Ops Manager
Norberto Leite
 
PDF
Let the Tiger Roar - MongoDB 3.0
Norberto Leite
 
PPTX
MongoDB + Java - Everything you need to know
Norberto Leite
 
Data Modelling for MongoDB - MongoDB.local Tel Aviv
Norberto Leite
 
Avoid Query Pitfalls
Norberto Leite
 
MongoDB Certification Study Group - May 2016
Norberto Leite
 
Geospatial and MongoDB
Norberto Leite
 
MongodB Internals
Norberto Leite
 
MongoDB WiredTiger Internals
Norberto Leite
 
MongoDB 3.2 Feature Preview
Norberto Leite
 
Mongodb Spring
Norberto Leite
 
MongoDB Capacity Planning
Norberto Leite
 
Analyse Yourself
Norberto Leite
 
Python and MongoDB
Norberto Leite
 
Strongly Typed Languages and Flexible Schemas
Norberto Leite
 
Effectively Deploying MongoDB on AEM
Norberto Leite
 
Advanced applications with MongoDB
Norberto Leite
 
MongoDB and Node.js
Norberto Leite
 
MongoDB on Financial Services Sector
Norberto Leite
 
MongoDB and Python
Norberto Leite
 
MongoDB Ops Manager
Norberto Leite
 
Let the Tiger Roar - MongoDB 3.0
Norberto Leite
 
MongoDB + Java - Everything you need to know
Norberto Leite
 
Ad

Recently uploaded (20)

PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PDF
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
PPTX
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PPTX
How Odoo Became a Game-Changer for an IT Company in Manufacturing ERP
SatishKumar2651
 
PDF
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
How Odoo Became a Game-Changer for an IT Company in Manufacturing ERP
SatishKumar2651
 
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 

MongoDB and Spark