SlideShare a Scribd company logo
#MDBlocal
Bryan Reinero, Product Manager
Analytics and Machine Learning with
Spark and MongoDB
Analytics and Machine Learning with Spark and MongoDB
Level Setting
Analytics and Machine Learning with Spark and MongoDB
Analytics and Machine Learning with Spark and MongoDB
Parallelism
Machine
Learning
Stream
Aggregation
Native
Processing
Analytics and Machine Learning with Spark and MongoDB
TROUGH OF
DISILLUSIONMENT
Interactive
Shell
Easy[ier] API
Caching
Analytics and Machine Learning with Spark and MongoDB
HDF
S
Distributed Data
Spark
Stand
Alone
YAR
N
Mesos
HDF
S
Distributed Resources
YAR
N
Spark
Mesos
HDF
S
Spark
Stand
Alone
Hadoop
Distributed Processing
YAR
N
Spark
Mesos
Hiv
e
Pig
Spar
k
Spark Shell
Spark
Streaming
Spark
Stand
Alone
Hadoop
Domain
Specific
Languag
es HDF
S
YAR
N
Spark
Mesos
Hiv
e
Pig
Spar
k
Spark Shell
Spark
Streaming
Spark
Stand
Alone
Hadoop
YAR
N
Spark
Mesos
Hiv
e
Pig
Spar
k
Spark Shell
Spark
Streaming
Spark
Stand
Alone
Hadoop
https://github.com/mongodb/mongo
-spark
Analytics and Machine Learning with Spark and MongoDB
Clustering Algorithms
Analytics and Machine Learning with Spark and MongoDB
Analytics and Machine Learning with Spark and MongoDB
K-Means
Clustering
Data Refuge
https://www.datarefuge.org/dataset/15-minute-precipitation-
data-dsi-3260
{
"_id" : "01006303201304010015QPCP",
"value" : 0,
"station" : "006303",
"time" : "0015",
"state" : "01",
"year" : 2013,
"month" : 4,
"day" : 1
}
Precipitation Data
{
"_id" : NumberLong(10957),
"NCDCSTN_ID" : "20000282",
"NWSLI_ID" : "BOZA1",
"GHCND_ID" : "USC00010957",
"NAME_COOP_SHORT" : "BOAZ",
"FIPS_COUNTRY_NAME" : "UNITED STATES",
"STATE_PROV" : "AL",
"COUNTY" : "MARSHALL",
"NWS_CLIM_DIV" : "02",
"NWS_CLIM_DIV_NAME" : "APPALACHIAN MOUNTAINS",
"LON_DEC" : -86.1633,
"LAT_DEC" : 34.2008,
"LAT_LON_PRECISION" : "DDdddd",
"ELEV_GROUND" : "1070",
"ELEV_GROUND_UNIT" : "FEET",
"UTC_OFFSET" : -6,
"NWS_REGION" : "SOUTHERN",
"NWS_WFO" : "HUN",
"COOP_SOD" : "Y",
"COOP_HPD" : "Y"
}
Weather Station
Data
{
"_id" : NumberLong(10957),
"NCDCSTN_ID" : "20000282",
"NWSLI_ID" : "BOZA1",
"GHCND_ID" : "USC00010957",
"NAME_COOP_SHORT" : "BOAZ",
"FIPS_COUNTRY_NAME" : "UNITED STATES",
"STATE_PROV" : "AL",
"COUNTY" : "MARSHALL",
"NWS_CLIM_DIV" : "02",
"NWS_CLIM_DIV_NAME" : "APPALACHIAN MOUNTAINS",
"LON_DEC" : -86.1633,
"LAT_DEC" : 34.2008,
"LAT_LON_PRECISION" : "DDdddd",
"ELEV_GROUND" : "1070",
"ELEV_GROUND_UNIT" : "FEET",
"UTC_OFFSET" : -6,
"NWS_REGION" : "SOUTHERN",
"NWS_WFO" : "HUN",
"COOP_SOD" : "Y",
"COOP_HPD" : "Y"
}
./spark-2.2.0/bin/spark-shell 
--conf 
"spark.mongodb.input.uri=mongodb://10.11.12.13/kmeans.log" 
--conf 
"spark.mongodb.output.uri=mongodb://10.11.12.13/kmeans.cluster" 
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0
./spark-2.1.0-bin-hadoop2.7/bin/spark-shell 
--conf 
"spark.mongodb.input.uri=mongodb://10.11.12.13/kmeans.log" 
--conf 
"spark.mongodb.output.uri=mongodb://10.11.12.13/kmeans.cluster" 
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0
./spark-2.1.0-bin-hadoop2.7/bin/spark-shell 
--conf 
"spark.mongodb.input.uri=mongodb://10.0.0.10/kmeans.log" 
--conf 
"spark.mongodb.output.uri=mongodb://10.0.0.10/kmeans.cluster" 
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0
./spark-2.1.0-bin-hadoop2.7/bin/spark-shell 
--conf 
"spark.mongodb.input.uri=mongodb://10.0.0.10/kmeans.log" 
--conf 
"spark.mongodb.output.uri=mongodb://10.0.0.10/kmeans.cluster" 
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0
Demo Time
Analytics and Machine Learning with Spark and MongoDB
Resource
s
MongoDB University Course M:
233
Spark Connector Examples in
Repo

More Related Content

What's hot (20)

PPTX
Hunk - Unlocking the Power of Big Data
Splunk
 
PDF
Webinar: Managing Real Time Risk Analytics with MongoDB
MongoDB
 
PPTX
Intro to cassandra + hadoop
Jeremy Hanna
 
PPTX
Open source log analytics
Vinod Nayal
 
PPTX
Spark
Koushik Mondal
 
PPTX
Hunk - Unlocking The Power of Big Data Breakout Session
Splunk
 
PPTX
Indexing big data in the cloud
OpenSource Connections
 
PPTX
Video Analysis in Hadoop
DataWorks Summit
 
PPTX
Real Time and Big Data – It’s About Time
MapR Technologies
 
PPTX
Splunk's Hunk: A Powerful Way to Visualize Your Data Stored in MongoDB
MongoDB
 
KEY
Cassandra eu
Jeremy Hanna
 
PDF
Data Analytics with Druid
Yousun Jeong
 
PDF
Apache Druid®: A Dance of Distributed Processes
Imply
 
PPTX
Big data advanced topics - part I
Moldovan Radu Adrian
 
PDF
Big data workloads using Apache Sparkon HDInsight
Nilesh Gule
 
PPT
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Suman Srinivasan
 
PDF
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
PPTX
Big data introduction (HackTM 2016)
Moldovan Radu Adrian
 
PDF
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Charles Allen
 
PPTX
Big data advance topics - part 2.pptx
Moldovan Radu Adrian
 
Hunk - Unlocking the Power of Big Data
Splunk
 
Webinar: Managing Real Time Risk Analytics with MongoDB
MongoDB
 
Intro to cassandra + hadoop
Jeremy Hanna
 
Open source log analytics
Vinod Nayal
 
Hunk - Unlocking The Power of Big Data Breakout Session
Splunk
 
Indexing big data in the cloud
OpenSource Connections
 
Video Analysis in Hadoop
DataWorks Summit
 
Real Time and Big Data – It’s About Time
MapR Technologies
 
Splunk's Hunk: A Powerful Way to Visualize Your Data Stored in MongoDB
MongoDB
 
Cassandra eu
Jeremy Hanna
 
Data Analytics with Druid
Yousun Jeong
 
Apache Druid®: A Dance of Distributed Processes
Imply
 
Big data advanced topics - part I
Moldovan Radu Adrian
 
Big data workloads using Apache Sparkon HDInsight
Nilesh Gule
 
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Suman Srinivasan
 
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
Big data introduction (HackTM 2016)
Moldovan Radu Adrian
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Charles Allen
 
Big data advance topics - part 2.pptx
Moldovan Radu Adrian
 

Similar to Analytics and Machine Learning with Spark and MongoDB (20)

PDF
MongoDB World 2018: Spark and Machine Learning
MongoDB
 
PDF
Databricks with R: Deep Dive
Databricks
 
PDF
H2O PySparkling Water
Sri Ambati
 
PDF
20171012 found IT #9 PySparkの勘所
Ryuji Tamagawa
 
PPTX
MongoDB and Hadoop
Tugdual Grall
 
PDF
Infra space talk on Apache Spark - Into to CASK
Rob Mueller
 
PPTX
MongoDB.local Dallas 2019: MongoDB and Spark
MongoDB
 
PPTX
MongoDB and Spark
Norberto Leite
 
PPTX
Building Advanced Analytics Pipelines with Azure Databricks
Lace Lofranco
 
PDF
Big Data Journey
Tugdual Grall
 
PDF
MongoDB + Spark
Bryan Reinero
 
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
PPTX
Azure Databricks & Spark @ Techorama 2018
Nathan Bijnens
 
PPTX
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
 
PDF
SparkR: Enabling Interactive Data Science at Scale on Hadoop
DataWorks Summit
 
PDF
PySparkの勘所(20170630 sapporo db analytics showcase)
Ryuji Tamagawa
 
PDF
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
PDF
Big data with java
Stefan Angelov
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PPTX
Big Data on the Cloud
Sercan Karaoglu
 
MongoDB World 2018: Spark and Machine Learning
MongoDB
 
Databricks with R: Deep Dive
Databricks
 
H2O PySparkling Water
Sri Ambati
 
20171012 found IT #9 PySparkの勘所
Ryuji Tamagawa
 
MongoDB and Hadoop
Tugdual Grall
 
Infra space talk on Apache Spark - Into to CASK
Rob Mueller
 
MongoDB.local Dallas 2019: MongoDB and Spark
MongoDB
 
MongoDB and Spark
Norberto Leite
 
Building Advanced Analytics Pipelines with Azure Databricks
Lace Lofranco
 
Big Data Journey
Tugdual Grall
 
MongoDB + Spark
Bryan Reinero
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Azure Databricks & Spark @ Techorama 2018
Nathan Bijnens
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
DataWorks Summit
 
PySparkの勘所(20170630 sapporo db analytics showcase)
Ryuji Tamagawa
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
Big data with java
Stefan Angelov
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Big Data on the Cloud
Sercan Karaoglu
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
Ad

Analytics and Machine Learning with Spark and MongoDB