SlideShare a Scribd company logo
Mongo db and hadoop   driving business insights - final
MongoDB and Hadoop 
Luke Lovett 
Software Engineer, MongoDB
Agenda 
• Complementary Approaches to Data 
• MongoDB & Hadoop Use Cases 
• MongoDB Connector Overview and Features 
• Demo
Complementary Approaches 
to Data
Operational: MongoDB 
Real-Time 
Analytics 
Product/Asset 
Catalogs 
Security & 
Fraud 
Internet of 
Things 
Mobile Apps 
Customer 
Data Mgmt 
Single View Social 
Churn Analysis Recommender 
Warehouse & 
ETL 
Risk Modeling 
Trade 
Surveillance 
Predictive 
Analytics 
Ad Targeting 
Sentiment 
Analysis
MongoDB 
• Store and read data frequently 
• Easy administration 
• Built-in analytical tools 
– aggregation framework 
– JavaScript MapReduce 
– Geo/text indexes
Analytical: Hadoop 
Real-Time 
Analytics 
Product/Asset 
Catalogs 
Security & 
Fraud 
Internet of 
Things 
Mobile Apps 
Customer 
Data Mgmt 
Single View Social 
Churn Analysis Recommender 
Warehouse & 
ETL 
Risk Modeling 
Trade 
Surveillance 
Predictive 
Analytics 
Ad Targeting 
Sentiment 
Analysis
Hadoop 
The Apache Hadoop software library is a framework that allows for the 
distributed processing of large data sets across clusters of computers 
using simple programming models. 
• Terabyte and Petabyte datasets 
• Data warehousing 
• Advanced analytics
Operational vs. Analytical: Lifecycle 
Real-Time 
Analytics 
Product/Asset 
Catalogs 
Security & 
Fraud 
Internet of 
Things 
Mobile Apps 
Customer 
Data Mgmt 
Single View Social 
Churn Analysis Recommender 
Warehouse & 
ETL 
Risk Modeling 
Trade 
Surveillance 
Predictive 
Analytics 
Ad Targeting 
Sentiment 
Analysis
MongoDB & Hadoop Use 
Cases
Batch Aggregation 
Applicatio 
ns 
powered 
by 
Analysis 
powered 
by 
MongoDB Connector 
for Hadoop 
● Need more than MongoDB aggregation 
● Need offline processing 
● Results sent back to MongoDB 
● Can be left as BSON on HDFS for further analysis
Commerce 
Applicatio 
ns 
powered 
by 
Analysis 
powered 
by 
• Products & Inventory 
• Recommended 
products 
• Customer profile 
• Session management 
• Elastic pricing 
• Recommendation 
models 
• Predictive analytics 
• Clickstream history 
MongoDB Connector 
for Hadoop
Fraud Detection 
Payments 
Nightly 
Analysis 
Fraud modeling 
MongoDB Connector 
for Hadoop 
Results 
Cache 
Online payments 
processing 
3rd Party Data 
Sources 
Fraud 
Detection 
query 
only 
query 
only
MongoDB Connector for 
Hadoop
Connector Overview 
Hadoop 
Map Reduce, Hive, Pig, Spark 
HDFS / S3 
Hadoop Connector 
Text Files 
Hadoop 
Connector 
BSON Files 
MongoDB 
Single Node, Replica Set, 
Cluster 
Apache Hadoop / Cloudera CDH / Hortonworks HDP / Amazon 
EMR
Data Movement 
Dynamic queries to MongoDB vs. BSON snapshots in 
HDFS 
Dynamic queries with 
most recent data 
Puts load on 
operational database 
Snapshots move load to 
Hadoop 
Snapshots add predictable 
load to MongoDB
Connector Operation 
1. Split according to given InputFormat 
- many options available for reading from live cluster 
- configure key pattern, split strategy 
1. Write splits file 
2. Output to BSON file or live MongoDB 
- BSON file splits written automatically for future tasks 
- Mongo insertion round-robin across collections
Getting Splits 
• Split on a sharded cluster 
– Split by chunk 
– Split by shard 
• Splits on replica 
set/standalone 
– splitVector command 
• BSON files 
– specify max docs 
– split per input file 
MongoDB Connector for Hadoop 
Config 
Servers 
Shard 
Chunk 
Chunk 
Chunk 
Mongos 
Shard 
Chunk 
Chunk 
Chunk 
Shard 
Chunk 
Chunk 
Chunk
MongoDB Connector for Hadoop 
Config 
Servers 
Getting Splits 
• Split on a sharded cluster 
– Split by chunk 
– Split by shard 
• Splits on replica 
set/standalone 
– splitVector command 
• BSON files 
– specify max docs 
– split per input file 
Shard 
Chunk 
Chunk 
Chunk 
Mongos 
Shard 
Chunk 
Chunk 
Chunk 
Shard 
Chunk 
Chunk 
Chunk
MapReduce Configuration 
• MongoDB input 
– mongo.job.input.format = com.hadoop.MongoInputFormat 
– mongo.input.uri = mongodb://mydb:27017/db1.collection1 
• MongoDB output 
– mongo.job.output.format = com.hadoop.MongoOutputFormat 
– mongo.output.uri = mongodb://mydb:27017/db1.collection2
MapReduce Configuration 
• BSON input/output 
– mongo.job.input.format = com.hadoop.BSONFileInputFormat 
– mapred.input.dir = hdfs:///tmp/database.bson 
– mongo.job.output.format = 
com.hadoop.BSONFileOutputFormat 
– mapred.output.dir = hdfs:///tmp/output.bson
Spark Usage 
• Use with MapReduce 
input/output formats 
• Create Configuration objects with 
input/output formats and data 
URI 
• Load/save data using 
SparkContext Hadoop file API
Hive Support 
CREATE TABLE mongo_users (id int, name string, age int) 
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" 
WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) 
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”) 
• Access collections as Hive tables 
• Use with MongoStorageHandler or BSONSerDe
Hive Support 
● Types given by schema 
● May use structs to project fields out of documents and ease access 
● Can explode nested fields to make them top-level: 
{“customer”: {“name”: “Bart”}} 
can be accessed with “customer.name”. 
MongoDB Hive 
Primitive type (int, String, etc.) Primitive type (int, float, etc.) 
Document Row 
Sub-document Struct, Map, or exploded field 
Array Array or exploded field
Pig Mappings 
• Input: BSONLoader and MongoLoader 
data = LOAD ‘mongodb://mydb:27017/db.collection’ 
using com.mongodb.hadoop.pig.MongoLoader 
• Output: BSONStorage and MongoInsertStorage 
STORE records INTO ‘hdfs:///output.bson’ 
using com.mongodb.hadoop.pig.BSONStorage
Pig Mappings 
● Organize and prune documents by specifying a schema 
● Access full document in a Map without needing a schema 
MongoDB Pig 
Primitive type (int, String, etc.) Primitive type (int, chararray, etc.) 
Document Tuple (schema given) 
Document Tuple containing a Map (no schema) 
Sub-document Map 
Array Bag
Demo!
Questions?

More Related Content

What's hot (20)

PDF
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
NoSQLmatters
 
PPTX
Scrapinghub Deck for Startups
Scrapinghub
 
PPTX
MongoDB and Spark
Norberto Leite
 
PDF
Mongodb
Thiago Veiga
 
PDF
Frontera-Open Source Large Scale Web Crawling Framework
sixtyone
 
PPTX
When to Use MongoDB
MongoDB
 
PDF
MongoDB on Azure
Norberto Leite
 
PDF
Mongo db 3.4 Overview
Norberto Leite
 
PPTX
Webinar: The Anatomy of the Cloudant Data Layer
IBM Cloud Data Services
 
PDF
Webinar: Managing Real Time Risk Analytics with MongoDB
MongoDB
 
PPTX
MongoDB Schema Design by Examples
Hadi Ariawan
 
PPTX
Benefits of Using MongoDB Over RDBMSs
MongoDB
 
PPTX
Agility and Scalability with MongoDB
MongoDB
 
PPT
Introduction to mongodb
neela madheswari
 
PPTX
MongoDB + Spring
Norberto Leite
 
PDF
Spark and MongoDB
Norberto Leite
 
PPTX
MongoDB 2.4 and spring data
Jimmy Ray
 
PPTX
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB
 
PDF
Data persistence using pouchdb and couchdb
Dimgba Kalu
 
PPTX
Big data at scrapinghub
Dana Brophy
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
NoSQLmatters
 
Scrapinghub Deck for Startups
Scrapinghub
 
MongoDB and Spark
Norberto Leite
 
Mongodb
Thiago Veiga
 
Frontera-Open Source Large Scale Web Crawling Framework
sixtyone
 
When to Use MongoDB
MongoDB
 
MongoDB on Azure
Norberto Leite
 
Mongo db 3.4 Overview
Norberto Leite
 
Webinar: The Anatomy of the Cloudant Data Layer
IBM Cloud Data Services
 
Webinar: Managing Real Time Risk Analytics with MongoDB
MongoDB
 
MongoDB Schema Design by Examples
Hadi Ariawan
 
Benefits of Using MongoDB Over RDBMSs
MongoDB
 
Agility and Scalability with MongoDB
MongoDB
 
Introduction to mongodb
neela madheswari
 
MongoDB + Spring
Norberto Leite
 
Spark and MongoDB
Norberto Leite
 
MongoDB 2.4 and spring data
Jimmy Ray
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB
 
Data persistence using pouchdb and couchdb
Dimgba Kalu
 
Big data at scrapinghub
Dana Brophy
 

Viewers also liked (9)

PPTX
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB
 
PDF
Hadoop to spark-v2
Sujee Maniyam
 
PPTX
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho
 
PDF
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
PDF
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
PDF
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Gord Sissons
 
POTX
Webinar: MongoDB + Hadoop
MongoDB
 
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector
MongoDB
 
Hadoop to spark-v2
Sujee Maniyam
 
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho
 
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Gord Sissons
 
Webinar: MongoDB + Hadoop
MongoDB
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Ad

Similar to Mongo db and hadoop driving business insights - final (20)

PDF
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
PPTX
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
PPTX
MongoDB and Hadoop
Tugdual Grall
 
PDF
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB
 
PDF
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB
 
PDF
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Henrik Ingo
 
PPTX
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
CAPSiDE
 
PDF
Using MongoDB and Python
Mike Bright
 
PDF
2016 feb-23 pyugre-py_mongo
Michael Bright
 
PDF
MongoDB FabLab León
Juan Antonio Roy Couto
 
PPTX
MongoDB 3.0
Victoria Malaya
 
PPTX
Webinar: When to Use MongoDB
MongoDB
 
PPTX
Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
MongoDB
 
PDF
Spark Summit EU talk by Ross Lawley
Spark Summit
 
PDF
How To Connect Spark To Your Own Datasource
MongoDB
 
PDF
MongoDB_Spark
Mat Keep
 
PDF
MongoDB in FS
MongoDB
 
PPTX
Mongo db operations_v2
Thanabalan Sathneeganandan
 
PDF
Data as Documents: Overview and intro to MongoDB
Mitch Pirtle
 
PPTX
When to Use MongoDB...and When You Should Not...
MongoDB
 
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
MongoDB and Hadoop
Tugdual Grall
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Henrik Ingo
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
CAPSiDE
 
Using MongoDB and Python
Mike Bright
 
2016 feb-23 pyugre-py_mongo
Michael Bright
 
MongoDB FabLab León
Juan Antonio Roy Couto
 
MongoDB 3.0
Victoria Malaya
 
Webinar: When to Use MongoDB
MongoDB
 
Webinar: MongoDB and Hadoop - Working Together to provide Business Insights
MongoDB
 
Spark Summit EU talk by Ross Lawley
Spark Summit
 
How To Connect Spark To Your Own Datasource
MongoDB
 
MongoDB_Spark
Mat Keep
 
MongoDB in FS
MongoDB
 
Mongo db operations_v2
Thanabalan Sathneeganandan
 
Data as Documents: Overview and intro to MongoDB
Mitch Pirtle
 
When to Use MongoDB...and When You Should Not...
MongoDB
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 

Recently uploaded (20)

PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 

Mongo db and hadoop driving business insights - final

  • 2. MongoDB and Hadoop Luke Lovett Software Engineer, MongoDB
  • 3. Agenda • Complementary Approaches to Data • MongoDB & Hadoop Use Cases • MongoDB Connector Overview and Features • Demo
  • 5. Operational: MongoDB Real-Time Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  • 6. MongoDB • Store and read data frequently • Easy administration • Built-in analytical tools – aggregation framework – JavaScript MapReduce – Geo/text indexes
  • 7. Analytical: Hadoop Real-Time Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  • 8. Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • Terabyte and Petabyte datasets • Data warehousing • Advanced analytics
  • 9. Operational vs. Analytical: Lifecycle Real-Time Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  • 10. MongoDB & Hadoop Use Cases
  • 11. Batch Aggregation Applicatio ns powered by Analysis powered by MongoDB Connector for Hadoop ● Need more than MongoDB aggregation ● Need offline processing ● Results sent back to MongoDB ● Can be left as BSON on HDFS for further analysis
  • 12. Commerce Applicatio ns powered by Analysis powered by • Products & Inventory • Recommended products • Customer profile • Session management • Elastic pricing • Recommendation models • Predictive analytics • Clickstream history MongoDB Connector for Hadoop
  • 13. Fraud Detection Payments Nightly Analysis Fraud modeling MongoDB Connector for Hadoop Results Cache Online payments processing 3rd Party Data Sources Fraud Detection query only query only
  • 15. Connector Overview Hadoop Map Reduce, Hive, Pig, Spark HDFS / S3 Hadoop Connector Text Files Hadoop Connector BSON Files MongoDB Single Node, Replica Set, Cluster Apache Hadoop / Cloudera CDH / Hortonworks HDP / Amazon EMR
  • 16. Data Movement Dynamic queries to MongoDB vs. BSON snapshots in HDFS Dynamic queries with most recent data Puts load on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB
  • 17. Connector Operation 1. Split according to given InputFormat - many options available for reading from live cluster - configure key pattern, split strategy 1. Write splits file 2. Output to BSON file or live MongoDB - BSON file splits written automatically for future tasks - Mongo insertion round-robin across collections
  • 18. Getting Splits • Split on a sharded cluster – Split by chunk – Split by shard • Splits on replica set/standalone – splitVector command • BSON files – specify max docs – split per input file MongoDB Connector for Hadoop Config Servers Shard Chunk Chunk Chunk Mongos Shard Chunk Chunk Chunk Shard Chunk Chunk Chunk
  • 19. MongoDB Connector for Hadoop Config Servers Getting Splits • Split on a sharded cluster – Split by chunk – Split by shard • Splits on replica set/standalone – splitVector command • BSON files – specify max docs – split per input file Shard Chunk Chunk Chunk Mongos Shard Chunk Chunk Chunk Shard Chunk Chunk Chunk
  • 20. MapReduce Configuration • MongoDB input – mongo.job.input.format = com.hadoop.MongoInputFormat – mongo.input.uri = mongodb://mydb:27017/db1.collection1 • MongoDB output – mongo.job.output.format = com.hadoop.MongoOutputFormat – mongo.output.uri = mongodb://mydb:27017/db1.collection2
  • 21. MapReduce Configuration • BSON input/output – mongo.job.input.format = com.hadoop.BSONFileInputFormat – mapred.input.dir = hdfs:///tmp/database.bson – mongo.job.output.format = com.hadoop.BSONFileOutputFormat – mapred.output.dir = hdfs:///tmp/output.bson
  • 22. Spark Usage • Use with MapReduce input/output formats • Create Configuration objects with input/output formats and data URI • Load/save data using SparkContext Hadoop file API
  • 23. Hive Support CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”) • Access collections as Hive tables • Use with MongoStorageHandler or BSONSerDe
  • 24. Hive Support ● Types given by schema ● May use structs to project fields out of documents and ease access ● Can explode nested fields to make them top-level: {“customer”: {“name”: “Bart”}} can be accessed with “customer.name”. MongoDB Hive Primitive type (int, String, etc.) Primitive type (int, float, etc.) Document Row Sub-document Struct, Map, or exploded field Array Array or exploded field
  • 25. Pig Mappings • Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader • Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage
  • 26. Pig Mappings ● Organize and prune documents by specifying a schema ● Access full document in a Map without needing a schema MongoDB Pig Primitive type (int, String, etc.) Primitive type (int, chararray, etc.) Document Tuple (schema given) Document Tuple containing a Map (no schema) Sub-document Map Array Bag
  • 27. Demo!