SlideShare a Scribd company logo
MongoDB and Spark
Getting MongoDB and Spark to play
nice together
MongoDB.local Dallas 2019: MongoDB and Spark
Agenda
What we will unravel today!
4
Agenda
What Is Spark
Overview
Spark Stack
Spark + MongoDB
How to set up
MongoDB and Spark?
Integration
Use Cases / Demo
Datascience, Analytics
Others
5
Howdy!
Who's this guy?
Norberto Leite
Lead Engineer
@nleite
MongoDB
https://university.mongodb.com
What is Spark?
Interactive Shell
Easy[ier] API
Caching
9
Delivering User Relevancy
• Integrate data from many sources
• Fast-cycle analytics
• Real-time
• Reliable
10
Wearable Devices
Embedded Systems
Internet of Things
Embedded Medical Devices
11
Access complete patient history
Avoid of conflicting prescriptions
Clinical trials
wget https://www-eu.apache.org/dist/spark/spark-2.4.0
/spark-2.4.0-bin-hadoop2.7.tgz
tar -xzvf spark-2.4.0-bin-hadoop2.7.tgz
spark-2.4.0-bin-hadoop2.7/bin/pyspark
Python 2.7.10 (default, Aug 17 2018, 17:41:52)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)] on darwin
...
Using Python version 2.7.10 (default, Aug 17 2018 17:41:52)
SparkSession available as 'spark'.
>>>
Getting started with Spark - pyspark
spark-2.4.0-bin-hadoop2.7/bin/spark-shell
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.4.0
/_/
...
scala>
Getting started with Spark - spark-shell (scala)
Spark Stack
MongoDB.local Dallas 2019: MongoDB and Spark
Spark Stack
Spark SQL
Spark
Streaming
MLIB GraphX
Apache Spark
Seamless integration
with SQL using
DataFrame API. Also
supports HIVE SQL
Fast Feed data processing API.
Designed for Fault Tolerance and
bridges streaming with batch
processing
MLib is Spark machine
learning algorithms trick bag.
Spark graph library
Spark Stack
Spark
SQL
Spark
Streaming
MLIB GraphX
Apache Spark
Spark Stack
Spark
SQL
Spark
Streaming
MLIB GraphX
Apache Spark
Spark Stack
Spark
SQL
Spark
Streaming
MLIB GraphX
Apache Spark
MongoDB.local Dallas 2019: MongoDB and Spark
Distributed Data
HDF
S
Spark
Stand
Alone
YAR
N
Mesos
HDF
S
Distributed Resources
YAR
N
Spark
Mesos
HDF
S
Spark
Stand
Alone
Hadoop
Distributed Processing
YAR
N
Spark
Mesos
Hiv
e
Pig
Spar
k
SQL
Spark Shell
Spark
Streaming
Spark
Stand
Alone
Hadoop
Domain Specific
Languages
HDF
S
How can we use
Spark with MongoDB?
MongoDB.local Dallas 2019: MongoDB and Spark
Parallelism
Machine Learning
Stream Processing
Aggregation
Native Processing
Horizontal Scalling
https://github.com/mongodb/mongo-spark
29
MongoDB
Spark
Connector
https://docs.mongodb.com/spark-connector
30
What do we
need ?
31
Apache
Spark
Spark SQL
Spark
Streaming
Apache Spark
32
MongoDB
Spark
Connector
Spark SQL
Spark
Streaming
Apache Spark
MongoDB Spark
Connector
33
Input Cluster
Spark SQL
Spark
Streaming
Apache Spark
MongoDB Spark
Connector
34
Output
Cluster
Spark SQL
Spark
Streaming
Apache Spark
MongoDB Spark
Connector
35
Which can be
the same!
Spark SQL
Spark
Streaming
Apache Spark
MongoDB Spark
Connector
YAR
N
Spark
Mesos
Hiv
e
Pig
Spar
k
SQL
Spark Shell
Spark
Streaming
Spark
Stand
Alone
Hadoop
MongoDB
Spark-Connector
What we can do with
Spark and MongoDB
Spark SQL
39
Spark SQL
DataFrames
https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
https://spark.apache.org/docs/2.4.0/sql-programming-guide.html
40
MongoDB
Spark-Connector
https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Fraud Detection
I'm so in love!
Fraud Detection
I'm so in love!
Me, too<3
Now send me your CC
number
?
Ok, XXXX-123-zzz
$$$
Fraud Detection
Workloads
Chat App
Login
User
Profile
Contacts
Messages
…
Spark
Fraud Detection
Segmentation
Recommendations
HDFS HDFS HDFS Archiving
Data Crunching
Workloads
Chat App
Spark
Real-time data
processing
HDFS HDFS HDFS
Spark Streaming
47
Spark Streaming
Spark
Twitter
Feed
48
Spark Streaming
Twitter
Feed
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24 03:35:21 +0000
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
49
Spark Streaming
Spark
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24 03:35:21 +0000
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"time": "Mon Sep 24 03:35",
"freebandnames": 1
}
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24 03:35:21 +0000
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24 03:35:21 +0000
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24 03:35:21 +0000
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"time": "Mon Sep 24 03:35",
"freebandnames": 4
}
Clear, right?
Demo Time
MongoDB.local Dallas 2019: MongoDB and Spark
MongoDB.local Dallas 2019: MongoDB and Spark
54
Steps Describing Demo
● SRT text messages in the network
● Spark collects those messages
○ Defines a processing Window
○ Performs word count
● Store DataFrame into MongoDB
https://github.com/nleite/mdb.local
In Short
56
● An extremely powerful combination
● Many possible use cases
● Evolving all the time
Spark and
MongoDB
Norberto Leite
Lead Engineer
norberto@mongodb.com
@nleite
MongoDB.local Dallas 2019: MongoDB and Spark

More Related Content

What's hot (18)

TXT
Share
NX21
 
PDF
How to Light a Beacon
Miro Cupak
 
PDF
CloudBots - Harvesting Crypto Currency Like a Botnet Farmer
Rob Ragan
 
PPTX
Appsec 2013-krehel-ondrej-forensic-investigations-of-web-exploitations
drewz lin
 
PDF
Bea con anatomy-of-web-attack
Patrick Laverty
 
PDF
Splunk App for Stream - Einblicke in Ihren Netzwerkverkehr
Georg Knon
 
PPTX
Angler talk
Artsiom Holub
 
PDF
Extending Zeek for ICS Defense
James Dickenson
 
PDF
Wtf is happening_inside_my_android_phone_public
Jaime Blasco
 
PDF
RoR Workshop - Web applications hacking - Ruby on Rails example
Railwaymen
 
PDF
Security Ninjas: An Open Source Application Security Training Program
OpenDNS
 
PDF
Security of go modules and vulnerability scanning in go center (1)
Deep Datta
 
PDF
osint + python: extracting information from tor network and darkweb
Jose Manuel Ortega Candel
 
PDF
Finding target for hacking on internet is now easier
David Thomas
 
PDF
Spring Cloud’s Groovy
Marcin Grzejszczak
 
PDF
Workshop KrakYourNet2016 - Web applications hacking Ruby on Rails example
Anna Klepacka
 
PPTX
CryptoWall: How It Works
Tandhy Simanjuntak
 
PPTX
Blackhat 2018 - The New Pentest? Rise of the Compromise Assessment
Christopher Gerritz
 
Share
NX21
 
How to Light a Beacon
Miro Cupak
 
CloudBots - Harvesting Crypto Currency Like a Botnet Farmer
Rob Ragan
 
Appsec 2013-krehel-ondrej-forensic-investigations-of-web-exploitations
drewz lin
 
Bea con anatomy-of-web-attack
Patrick Laverty
 
Splunk App for Stream - Einblicke in Ihren Netzwerkverkehr
Georg Knon
 
Angler talk
Artsiom Holub
 
Extending Zeek for ICS Defense
James Dickenson
 
Wtf is happening_inside_my_android_phone_public
Jaime Blasco
 
RoR Workshop - Web applications hacking - Ruby on Rails example
Railwaymen
 
Security Ninjas: An Open Source Application Security Training Program
OpenDNS
 
Security of go modules and vulnerability scanning in go center (1)
Deep Datta
 
osint + python: extracting information from tor network and darkweb
Jose Manuel Ortega Candel
 
Finding target for hacking on internet is now easier
David Thomas
 
Spring Cloud’s Groovy
Marcin Grzejszczak
 
Workshop KrakYourNet2016 - Web applications hacking Ruby on Rails example
Anna Klepacka
 
CryptoWall: How It Works
Tandhy Simanjuntak
 
Blackhat 2018 - The New Pentest? Rise of the Compromise Assessment
Christopher Gerritz
 

Similar to MongoDB.local Dallas 2019: MongoDB and Spark (20)

PPTX
Mongo db and hadoop driving business insights - final
MongoDB
 
PDF
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB
 
PDF
Spark Summit EU talk by Ross Lawley
Spark Summit
 
PDF
How To Connect Spark To Your Own Datasource
MongoDB
 
PDF
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
João Gabriel Lima
 
PDF
MongoDB Europe 2016 - Big Data meets Big Compute
MongoDB
 
PDF
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
PDF
MongoDB_Spark
Mat Keep
 
PDF
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB
 
PDF
Blazing Fast Analytics with MongoDB & Spark
MongoDB
 
PDF
MongoDB - General Purpose Database
Ashnikbiz
 
PPTX
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
PPTX
MongoDB et Hadoop
MongoDB
 
PPTX
MongoDB and Hadoop
Tugdual Grall
 
PDF
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
IJCSIS Research Publications
 
PPTX
Boosting big data with apache spark
InfoFarm
 
PDF
IoT Applications and Patterns using Apache Spark & Apache Bahir
Luciano Resende
 
PPTX
When to Use MongoDB...and When You Should Not...
MongoDB
 
PDF
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Mongo db and hadoop driving business insights - final
MongoDB
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB
 
Spark Summit EU talk by Ross Lawley
Spark Summit
 
How To Connect Spark To Your Own Datasource
MongoDB
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
João Gabriel Lima
 
MongoDB Europe 2016 - Big Data meets Big Compute
MongoDB
 
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
MongoDB_Spark
Mat Keep
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB
 
Blazing Fast Analytics with MongoDB & Spark
MongoDB
 
MongoDB - General Purpose Database
Ashnikbiz
 
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
MongoDB et Hadoop
MongoDB
 
MongoDB and Hadoop
Tugdual Grall
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
IJCSIS Research Publications
 
Boosting big data with apache spark
InfoFarm
 
IoT Applications and Patterns using Apache Spark & Apache Bahir
Luciano Resende
 
When to Use MongoDB...and When You Should Not...
MongoDB
 
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
Ad

Recently uploaded (20)

PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Python basic programing language for automation
DanialHabibi2
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
July Patch Tuesday
Ivanti
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Python basic programing language for automation
DanialHabibi2
 

MongoDB.local Dallas 2019: MongoDB and Spark