SlideShare a Scribd company logo
Mongo-Hadoop Integration
Justin Lee
Software Engineer @ MongoDB
We will cover:
•what it is
•how it works
•a tour of what it can do
A quick briefing on what Mongo
and Hadoop are all about:
(Q+A at the end)
document-oriented database with
dynamic schema
stores data in JSON-like documents:
{
_id : “kosmo kramer”,
age : 42,
location : {
state : ”NY”,
zip : ”10024”
},
favorite_colors : [“red”, “green”]
}
different structure in each document
values can be simple like strings and ints or nested documents
mongodb scales horizontally via
sharding to handle lots of data and load
app
Java-based framework for MapReduce
Excels at batch processing on large data sets
by taking advantage of parallelism
map reduce created by google (white paper)
implemented in open source by hadoop
Mongo-Hadoop Connector - Why
Lots of people using Hadoop and Mongo
separately but need integration
Custom import/export scripts often
used to get data in+out
Scalability and flexibility with changes in
Hadoop or MongoDB configurations
Need to process data across multiple sources
custom scripts slow, fragile
Mongo-Hadoop Connector
Turn MongoDB into a Hadoop-enabled filesystem:
use as the input or output for Hadoop
.BSON
-or-
input
data
.BSON
-or-
Hadoop
Cluster
output
results
bson file new in 1.1
bson is the output of mongodump
Mongo-Hadoop Connector
Benefits + Features
Takes advantage of full multi-core
parallelism to process data in Mongo
Full integration with Hadoop and JVM ecosystems
Can be used with Amazon Elastic MapReduce
Can read and write backup files from local
filesystem, HDFS, or S3
Mongo-Hadoop Connector
Vanilla Java MapReduce
write MapReduce code in
ruby
or if you don’t want to use Java,
support for Hadoop Streaming.
Benefits + Features
can write your own language binding
Mongo-Hadoop Connector
Support for Pig
high-level scripting language for data analysis and
building MapReduce workflows
Support for Hive
SQL-like language for ad-hoc queries + analysis of data sets on
Hadoop-compatible file systems
Benefits + Features
Mongo-Hadoop Connector
How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the data
Each split gets assigned to a node in Hadoop cluster
In parallel, Hadoop nodes pull data for splits from
MongoDB (or BSON) and process them locally
Hadoop merges results and streams output back to
MongoDB or BSON
Tour of Mongo-Hadoop, by Example
- Using Java MapReduce with Mongo-Hadoop
- Using Hadoop Streaming
- Pig and Hive with Mongo-Hadoop
- Elastic MapReduce + BSON
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecastnn ",
"filename" : "1.",
"headers" : {
"From" : "phillip.allen@enron.com",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "tim.belden@enron.com",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}
Input Data: Enron e-mail corpus (501k records, 1.75Gb)
each document is one email
sender
recipients
{"_id": {"t":"bob@enron.com", "f":"alice@enron.com"}, "count" : 14}
{"_id": {"t":"bob@enron.com", "f":"eve@enron.com"}, "count" : 9}
{"_id": {"t":"alice@enron.com", "f":"charlie@enron.com"}, "count" : 99}
{"_id": {"t":"charlie@enron.com", "f":"bob@enron.com"}, "count" : 48}
{"_id": {"t":"eve@enron.com", "f":"charlie@enron.com"}, "count" : 20}
Let’s use Hadoop to build a graph of (senders → recipients)
and the count of messages exchanged between each pair
bob
alice
eve
charlie
14
99
9
48
20
sample, simplified data
nodes are people. edges/arrows # of msgs from A to B
Example 1 - Java MapReduce
mongodb document passed into
Hadoop MapReduce
Map phase - each input doc gets
passed through a Mapper function
@Override
public	
  void	
  map(NullWritable	
  key,	
  BSONObject	
  val,	
  final	
  Context	
  context){
	
  	
  	
  	
  BSONObject	
  headers	
  =	
  (BSONObject)val.get("headers");
	
  	
  	
  	
  if(headers.containsKey("From")	
  &&	
  headers.containsKey("To")){
	
  	
  	
  	
  	
  	
  	
  	
  String	
  from	
  =	
  (String)headers.get("From");
	
  	
  	
  	
  	
  	
  	
  	
  String	
  to	
  =	
  (String)headers.get("To");
	
  	
  	
  	
  	
  	
  	
  	
  String[]	
  recips	
  =	
  to.split(",");
	
  	
  	
  	
  	
  	
  	
  	
  for(int	
  i=0;i<recips.length;i++){
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  String	
  recip	
  =	
  recips[i].trim();
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  context.write(new	
  MailPair(from,	
  recip),	
  new	
  IntWritable(1));
	
  	
  	
  	
  	
  	
  	
  	
  }
	
  	
  	
  	
  }
}
input value doc from mongo. connector will handle translation into
BSONObject for you
output written back to MongoDB
Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer
the {to, from} key
list of all the values
collected under the key
	
  	
  	
  	
  public	
  void	
  reduce(	
  final	
  MailPair	
  pKey,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Iterable<IntWritable>	
  pValues,
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  final	
  Context	
  pContext	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  int	
  sum	
  =	
  0;
	
  	
  	
  	
  	
  	
  	
  	
  for	
  (	
  final	
  IntWritable	
  value	
  :	
  pValues	
  ){
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  sum	
  +=	
  value.get();
	
  	
  	
  	
  	
  	
  	
  	
  }
	
  	
  	
  	
  	
  	
  	
  	
  BSONObject	
  outDoc	
  =	
  new	
  BasicDBObjectBuilder().start()
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .add(	
  "f"	
  ,	
  pKey.from)
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .add(	
  "t"	
  ,	
  pKey.to	
  )
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .get();
	
  	
  	
  	
  	
  	
  	
  	
  BSONWritable	
  pkeyOut	
  =	
  new	
  BSONWritable(outDoc);
	
  	
  	
  	
  	
  	
  	
  	
  pContext.write(	
  pkeyOut,	
  new	
  IntWritable(sum)	
  );
	
  	
  	
  	
  }
Example 1 - Java MapReduce (cont)
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input.uri=mongodb://my-db:27017/enron.messages
Read from MongoDB
Read from BSON
mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
mapred.input.dir=file:///tmp/messages.bson
hdfs:///tmp/messages.bson
s3:///tmp/messages.bson
Example 1 - Java MapReduce (cont)
mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mongo.output.uri=mongodb://my-db:27017/enron.results_out
Write output to MongoDB
Write output to BSON
mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
mapred.output.dir=file:///tmp/results.bson
hdfs:///tmp/results.bson
s3:///tmp/results.bson
Results : Output Data
mongos> db.results_out.find({"_id.t": /^kenneth.lay/})
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "15126-1267@m2.innovyx.com" }, "count" : 1 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "2586207@www4.imakenews.com" }, "count" : 1 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "40enron@enron.com" }, "count" : 2 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..davis@enron.com" }, "count" : 2 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..hughes@enron.com" }, "count" : 4 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..lindholm@enron.com" }, "count" : 1 }
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "a..schroeder@enron.com" }, "count" : 1 }
...
has more
Example 2 - Hadoop Streaming
Let’s do the same Enron MapReduce job with
Python instead of Java
$ pip install pymongo_hadoop
Example 2 - Hadoop Streaming (cont)
Hadoop passes data to an external process
via STDOUT/STDIN
map(k, v)
map(k, v)
map(k, v)map()
JVM
STDIN
Python / Ruby / JS
interpreter
STDOUT
Hadoop (JVM)
def mapper(documents):
. . .
Example 2 - Hadoop Streaming (cont)
from pymongo_hadoop import BSONMapper
def mapper(documents):
i = 0
for doc in documents:
i = i + 1
from_field = doc['headers']['From']
to_field = doc['headers']['To']
recips = [x.strip() for x in to_field.split(',')]
for r in recips:
yield {'_id': {'f':from_field, 't':r}, 'count': 1}
BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."
BSONMapper is pymongo layer that translates from hadoop streaming
back to hadoop
Example 2 - Hadoop Streaming (cont)
from pymongo_hadoop import BSONReducer
def reducer(key, values):
print >> sys.stderr, "Processing from/to %s" % str(key)
_count = 0
for v in values:
_count += v['count']
return {'_id': key, 'count': _count}
BSONReducer(reducer)
Surviving Hadoop:
making MapReduce easier
Pig + Hive
writing m/r jobs from scratch can be clunky and cumbersome
Example 3 - Mongo-Hadoop and Pig
Let’s do the same thing yet again, but this
time using Pig
Pig is a powerful language that can generate
sophisticated MapReduce workflows from simple
scripts
Can perform JOIN, GROUP, and execute
user-defined functions (UDFs)
Example 3 - Mongo-Hadoop and Pig (cont)
Pig directives for loading data:
BSONLoader and MongoLoader
Writing data out
BSONStorage and MongoInsertStorage
data = LOAD 'mongodb://localhost:27017/db.collection'
using com.mongodb.hadoop.pig.MongoLoader;
STORE records INTO 'file:///output.bson'
using com.mongodb.hadoop.pig.BSONStorage;
Pig has its own special datatypes:
Bags, Maps, and Tuples
Mongo-Hadoop Connector intelligently
converts between Pig datatypes and
MongoDB datatypes
Example 3 - Mongo-Hadoop and Pig (cont)
bags -> arrays
maps -> objects
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;
send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;
send_recip_filtered = FILTER send_recip BY to IS NOT NULL;
send_recip_split = FOREACH send_recip_filtered GENERATE
from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;
send_recip_grouped = GROUP send_recip_split BY (from, to);
send_recip_counted = FOREACH send_recip_grouped GENERATE
group, COUNT($1) as count;
STORE send_recip_counted INTO 'file:///enron_results.bson'
using com.mongodb.hadoop.pig.BSONStorage;
Example 3 - Mongo-Hadoop and Pig (cont)
Hive with Mongo-Hadoop
Similar idea to Pig - process your data without
needing to write MapReduce code from
scratch
...but with SQL as the language of choice
Hive with Mongo-Hadoop
Sample Data:
db.users
db.users.find()
{ "_id": 1, "name": "Tom", "age": 28 }
{ "_id": 2, "name": "Alice", "age": 18 }
{ "_id": 3, "name": "Bob", "age": 29 }
{ "_id": 101, "name": "Scott", "age": 10 }
{ "_id": 104, "name": "Jesse", "age": 52 }
{ "_id": 110, "name": "Mike", "age": 32 }
...
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES( "mongo.columns.mapping" = "_id,name,age" )
TBLPROPERTIES ( "mongo.uri" = "mongodb://localhost:27017/test.users");
first, declare the collection to be
accessible in Hive:
Hive with Mongo-Hadoop
...then you can run SQL on it, like a table.
SELECT name,age FROM mongo_users WHERE id > 100 ;
SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ;
you can use GROUP BY:
or JOIN multiple tables/collections together:
SELECT * FROM mongo_users T1
JOIN user_emails T2
WHERE T1.id = T2.id;
subset of SQL
Write the output of queries back into new tables:
INSERT OVERWRITE TABLE old_users SELECT id,name,age
FROM mongo_users WHERE age > 100 ;
DROP TABLE mongo_users;
Drop a table in Hive to delete the
underlying collection in MongoDB
use “external” when declaring your table to prevent the collection drop
Usage with Amazon Elastic MapReduce
Run mongo-hadoop jobs without
needing to set up or manage your
own Hadoop cluster.
Pig, Hive, and streaming work on EMR, too!
Logs get captured into S3 files
Usage with Amazon Elastic MapReduce
First, make a “bootstrap” script that
fetches dependencies (mongo-hadoop
jar and java drivers)
#!/bin/sh
wget -P /home/hadoop/lib http://central.maven.org/maven2/org/
mongodb/mongo-java-driver/2.12.2/mongo-java-driver-2.12.2.jar
wget -P /home/hadoop/lib https://s3.amazonaws.com/mongo-hadoop-
code/mongo-hadoop-core_1.1.2-1.1.0.jar
this will get executed on each node in
the cluster that EMR builds for us.
working on updating hadoop artifacts in maven
Example 4 - Usage with Amazon Elastic MapReduce
Put the bootstrap script, and all your code,
into an S3 bucket where Amazon can see it.
s3cp ./bootstrap.sh s3://$S3_BUCKET/bootstrap.sh
s3mod s3://$S3_BUCKET/bootstrap.sh public-read
s3cp $HERE/../enron/target/enron-example.jar s3://$S3_BUCKET/
enron-example.jar
s3mod s3://$S3_BUCKET/enron-example.jar public-read
$ elastic-mapreduce --create --jobflow ENRON000
--instance-type m1.xlarge
--num-instances 5
--bootstrap-action s3://$S3_BUCKET/bootstrap.sh
--log-uri s3://$S3_BUCKET/enron_logs
--jar s3://$S3_BUCKET/enron-example.jar
--arg -D --arg mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
--arg -D --arg mapred.input.dir=s3n://mongo-test-data/messages.bson
--arg -D --arg mapred.output.dir=s3n://$S3_BUCKET/BSON_OUT
--arg -D --arg mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
# (any additional parameters here)
Example 4 - Usage with Amazon Elastic MapReduce
...then launch the job from the command
line, pointing to your S3 locations
Control the type and
number of instances
in the cluster
Example 4 - Usage with Amazon Elastic MapReduce
Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster
Pig, Hive, and streaming work on EMR, too!
Logs get captured into S3 files
Example 5 - New Feature: MongoUpdateWritable
... but we can also modify an existing output
collection
Works by applying mongodb update modifiers:
$push, $pull, $addToSet, $inc, $set, etc.
Can be used to do incremental MapReduce or
“join” two collections
In previous examples, we wrote job output data
by inserting into a new collection
Example 5 - MongoUpdateWritable
For example,
let’s say we have two collections.
{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d678"),
	
  	
  "sensor_id":	
  ObjectId("51b792d381c3e67b0a18d4a1"),
	
  	
  "value":	
  3328.5895416489802,
	
  	
  "timestamp":	
  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),
	
  	
  "loc":	
  [-­‐175.13,51.658]
}
{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b0a18d0ed"),
	
  	
  "name":	
  "730LsRkX",
	
  	
  "type":	
  "pressure",
	
  	
  "owner":	
  "steve",
}
sensors
log
events
refers to which sensor
logged the event
For each owner, we want to calculate how many events
were recorded for each type of sensor that logged it.
For each owner, we want to calculate how many events
were recorded for each type of sensor that logged it.
Plain english:
Bob’s sensors for temperature have stored 1300 readings
Bob’s sensors for pressure have stored 400 readings
Alice’s sensors for humidity have stored 600 readings
Alice’s sensors for temperature have stored 700 readings
etc...
sensors
(mongodb collection)
Stage 1 -MapReduce
on sensors collection
Results
(mongodb collection)
for each sensor, emit:
{key: owner+type, value: _id}
group data from map() under each key, output:
{key: owner+type, val: [ list of _ids] }
read from
mongodb
insert() new records
to mongodb
MapReduce
log events
(mongodb collection)
do this in two stages
the sensor’s
owner and type
After stage one, the output
docs look like:
list of ID’s of
sensors with this
owner and type
{
	
  	
  "_id":	
  "alice	
  pressure",
	
  	
  "sensors":	
  [
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d475"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d16d"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d2bf"),
	
  	
  	
  	
  …
	
  	
  ]
}
Now we just need to count the total # of log events recorded for
any sensors that appear in the list for each owner/type group.
sensors
(mongodb collection)
Stage 2 -MapReduce on
log events collection
read from
mongodb
Results
(mongodb collection)
update() existing
records in mongodb
MapReduce
log events
(mongodb collection)
for each sensor, emit:
{key: sensor_id, value: 1}
group data from map() under each key
for each value in that key:
update({sensors: key}, {$inc : {logs_count:1}})
context.write(null,	
  
new	
  MongoUpdateWritable(
	
  	
  	
  query,	
  //which	
  documents	
  to	
  modify	
  
	
  	
  	
  update,	
  //how	
  to	
  modify	
  ($inc)
	
  	
  	
  true,	
  	
  	
  	
  //upsert
	
  	
  	
  false)
);	
  //	
  multi
Example - MongoUpdateWritable
Result after stage 2
{
	
  	
  "_id":	
  "1UoTcvnCTz	
  temp",
	
  	
  "sensors":	
  [
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d475"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d16d"),
	
  	
  	
  	
  ObjectId("51b792d381c3e67b0a18d2bf"),
	
  	
  	
  	
  …
	
  	
  ],
	
  	
  "logs_count":	
  1050616
}
now populated with correct count
New Features in v1.2 and beyond
Continually improving Hive support
Performance Improvements - Lazy BSON
Support for multi-collection input sources
API for adding
custom splitter implementations
and more
primarily focusing on hive but pig is next
maven central
Recap
Mongo-Hadoop - use Hadoop to do massive computations
on big data sets stored in MongoDB/BSON
Tools and APIs make it easier:
Streaming, Pig, Hive, EMR, etc.
MongoDB becomes a Hadoop-enabled filesystem
Questions?
https://github.com/mongodb/mongo-hadoop/tree/
master/examples
Examples can be found on github:
MongoDB World
New York City, June 23-25
Save 25% with 25JustinLee
Register at world.mongodb.com

More Related Content

PPTX
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
MongoDB
 
PPTX
2014 bigdatacamp asya_kamsky
Data Con LA
 
PPTX
Introduction to MongoDB and Hadoop
Steven Francia
 
PPTX
Back to Basics: My First MongoDB Application
MongoDB
 
PPTX
Getting Started with MongoDB and NodeJS
MongoDB
 
PPT
Introduction to MongoDB
Nosh Petigara
 
PPTX
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
MongoDB
 
PPTX
Back to Basics, webinar 2: La tua prima applicazione MongoDB
MongoDB
 
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
MongoDB
 
2014 bigdatacamp asya_kamsky
Data Con LA
 
Introduction to MongoDB and Hadoop
Steven Francia
 
Back to Basics: My First MongoDB Application
MongoDB
 
Getting Started with MongoDB and NodeJS
MongoDB
 
Introduction to MongoDB
Nosh Petigara
 
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
MongoDB
 
Back to Basics, webinar 2: La tua prima applicazione MongoDB
MongoDB
 

What's hot (20)

PPTX
MongoDB + Java - Everything you need to know
Norberto Leite
 
PPTX
Beyond the Basics 2: Aggregation Framework
MongoDB
 
PPTX
Back to Basics Webinar 2: Your First MongoDB Application
MongoDB
 
PPTX
Webinar: Back to Basics: Thinking in Documents
MongoDB
 
PPTX
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Gianfranco Palumbo
 
PPTX
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial Indexes
MongoDB
 
PDF
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB
 
PDF
Webinar: Building Your First App with MongoDB and Java
MongoDB
 
PPTX
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
MongoDB
 
PPTX
MongoDB Aggregation
Amit Ghosh
 
PPTX
Webinar: Transitioning from SQL to MongoDB
MongoDB
 
PPTX
MongoDB - Back to Basics - La tua prima Applicazione
Massimo Brignoli
 
PDF
MongoDB for Analytics
MongoDB
 
PPT
Introduction to MongoDB
antoinegirbal
 
PPTX
Back to Basics Webinar 5: Introduction to the Aggregation Framework
MongoDB
 
PPTX
MongoDB 3.2 - Analytics
Massimo Brignoli
 
PDF
Webinar: Working with Graph Data in MongoDB
MongoDB
 
PDF
An introduction to MongoDB
Universidade de São Paulo
 
PDF
Indexing
Mike Dirolf
 
PPTX
Back to Basics Webinar 1: Introduction to NoSQL
MongoDB
 
MongoDB + Java - Everything you need to know
Norberto Leite
 
Beyond the Basics 2: Aggregation Framework
MongoDB
 
Back to Basics Webinar 2: Your First MongoDB Application
MongoDB
 
Webinar: Back to Basics: Thinking in Documents
MongoDB
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Gianfranco Palumbo
 
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial Indexes
MongoDB
 
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB
 
Webinar: Building Your First App with MongoDB and Java
MongoDB
 
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
MongoDB
 
MongoDB Aggregation
Amit Ghosh
 
Webinar: Transitioning from SQL to MongoDB
MongoDB
 
MongoDB - Back to Basics - La tua prima Applicazione
Massimo Brignoli
 
MongoDB for Analytics
MongoDB
 
Introduction to MongoDB
antoinegirbal
 
Back to Basics Webinar 5: Introduction to the Aggregation Framework
MongoDB
 
MongoDB 3.2 - Analytics
Massimo Brignoli
 
Webinar: Working with Graph Data in MongoDB
MongoDB
 
An introduction to MongoDB
Universidade de São Paulo
 
Indexing
Mike Dirolf
 
Back to Basics Webinar 1: Introduction to NoSQL
MongoDB
 
Ad

Viewers also liked (6)

PPTX
Migration from SQL to MongoDB - A Case Study at TheKnot.com
MongoDB
 
PDF
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
João Gabriel Lima
 
PDF
Migrating from RDBMS to MongoDB
MongoDB
 
PDF
Spark and MongoDB
Norberto Leite
 
PPTX
Transitioning from SQL to MongoDB
MongoDB
 
PDF
Java Persistence Frameworks for MongoDB
MongoDB
 
Migration from SQL to MongoDB - A Case Study at TheKnot.com
MongoDB
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
João Gabriel Lima
 
Migrating from RDBMS to MongoDB
MongoDB
 
Spark and MongoDB
Norberto Leite
 
Transitioning from SQL to MongoDB
MongoDB
 
Java Persistence Frameworks for MongoDB
MongoDB
 
Ad

Similar to Hadoop - MongoDB Webinar June 2014 (20)

PDF
Barcelona MUG MongoDB + Hadoop Presentation
Norberto Leite
 
PPTX
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
CAPSiDE
 
PDF
Hadoop webinar-130808141030-phpapp01
Kaushik Dey
 
PDF
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Henrik Ingo
 
PPTX
Mongo db and hadoop driving business insights - final
MongoDB
 
PPTX
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
PPTX
Using MongoDB with Hadoop & Spark
MongoDB
 
PDF
The elephant in the room mongo db + hadoop
iammutex
 
POTX
Webinar: MongoDB + Hadoop
MongoDB
 
PDF
MongoDB: a gentle, friendly overview
Antonio Pintus
 
PDF
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
PPTX
MongoDB et Hadoop
MongoDB
 
PPTX
MongoDB and Hadoop
Tugdual Grall
 
POTX
What's the Scoop on Hadoop? How It Works and How to WORK IT!
MongoDB
 
PDF
MongoDB is the MashupDB
Wynn Netherland
 
PDF
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB
 
PDF
3-Mongodb and Mapreduce Programming.pdf
MarianJRuben
 
PDF
Confluent & MongoDB APAC Lunch & Learn
confluent
 
PDF
Building your first app with MongoDB
Norberto Leite
 
PPTX
introtomongodb
saikiran
 
Barcelona MUG MongoDB + Hadoop Presentation
Norberto Leite
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
CAPSiDE
 
Hadoop webinar-130808141030-phpapp01
Kaushik Dey
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Henrik Ingo
 
Mongo db and hadoop driving business insights - final
MongoDB
 
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
Using MongoDB with Hadoop & Spark
MongoDB
 
The elephant in the room mongo db + hadoop
iammutex
 
Webinar: MongoDB + Hadoop
MongoDB
 
MongoDB: a gentle, friendly overview
Antonio Pintus
 
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
MongoDB et Hadoop
MongoDB
 
MongoDB and Hadoop
Tugdual Grall
 
What's the Scoop on Hadoop? How It Works and How to WORK IT!
MongoDB
 
MongoDB is the MashupDB
Wynn Netherland
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB
 
3-Mongodb and Mapreduce Programming.pdf
MarianJRuben
 
Confluent & MongoDB APAC Lunch & Learn
confluent
 
Building your first app with MongoDB
Norberto Leite
 
introtomongodb
saikiran
 

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 

Recently uploaded (20)

PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 

Hadoop - MongoDB Webinar June 2014

  • 2. We will cover: •what it is •how it works •a tour of what it can do A quick briefing on what Mongo and Hadoop are all about: (Q+A at the end)
  • 3. document-oriented database with dynamic schema stores data in JSON-like documents: { _id : “kosmo kramer”, age : 42, location : { state : ”NY”, zip : ”10024” }, favorite_colors : [“red”, “green”] } different structure in each document values can be simple like strings and ints or nested documents
  • 4. mongodb scales horizontally via sharding to handle lots of data and load app
  • 5. Java-based framework for MapReduce Excels at batch processing on large data sets by taking advantage of parallelism map reduce created by google (white paper) implemented in open source by hadoop
  • 6. Mongo-Hadoop Connector - Why Lots of people using Hadoop and Mongo separately but need integration Custom import/export scripts often used to get data in+out Scalability and flexibility with changes in Hadoop or MongoDB configurations Need to process data across multiple sources custom scripts slow, fragile
  • 7. Mongo-Hadoop Connector Turn MongoDB into a Hadoop-enabled filesystem: use as the input or output for Hadoop .BSON -or- input data .BSON -or- Hadoop Cluster output results bson file new in 1.1 bson is the output of mongodump
  • 8. Mongo-Hadoop Connector Benefits + Features Takes advantage of full multi-core parallelism to process data in Mongo Full integration with Hadoop and JVM ecosystems Can be used with Amazon Elastic MapReduce Can read and write backup files from local filesystem, HDFS, or S3
  • 9. Mongo-Hadoop Connector Vanilla Java MapReduce write MapReduce code in ruby or if you don’t want to use Java, support for Hadoop Streaming. Benefits + Features can write your own language binding
  • 10. Mongo-Hadoop Connector Support for Pig high-level scripting language for data analysis and building MapReduce workflows Support for Hive SQL-like language for ad-hoc queries + analysis of data sets on Hadoop-compatible file systems Benefits + Features
  • 11. Mongo-Hadoop Connector How it works: Adapter examines the MongoDB input collection and calculates a set of splits from the data Each split gets assigned to a node in Hadoop cluster In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally Hadoop merges results and streams output back to MongoDB or BSON
  • 12. Tour of Mongo-Hadoop, by Example - Using Java MapReduce with Mongo-Hadoop - Using Hadoop Streaming - Pig and Hive with Mongo-Hadoop - Elastic MapReduce + BSON
  • 13. { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "[email protected]", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "[email protected]", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Input Data: Enron e-mail corpus (501k records, 1.75Gb) each document is one email sender recipients
  • 14. {"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 14} {"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 9} {"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 99} {"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 48} {"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 20} Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair bob alice eve charlie 14 99 9 48 20 sample, simplified data nodes are people. edges/arrows # of msgs from A to B
  • 15. Example 1 - Java MapReduce mongodb document passed into Hadoop MapReduce Map phase - each input doc gets passed through a Mapper function @Override public  void  map(NullWritable  key,  BSONObject  val,  final  Context  context){        BSONObject  headers  =  (BSONObject)val.get("headers");        if(headers.containsKey("From")  &&  headers.containsKey("To")){                String  from  =  (String)headers.get("From");                String  to  =  (String)headers.get("To");                String[]  recips  =  to.split(",");                for(int  i=0;i<recips.length;i++){                        String  recip  =  recips[i].trim();                        context.write(new  MailPair(from,  recip),  new  IntWritable(1));                }        } } input value doc from mongo. connector will handle translation into BSONObject for you
  • 16. output written back to MongoDB Example 1 - Java MapReduce (cont) Reduce phase - outputs of Map are grouped together by key and passed to Reducer the {to, from} key list of all the values collected under the key        public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;                for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }                BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from)                            .add(  "t"  ,  pKey.to  )                            .get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        }
  • 17. Example 1 - Java MapReduce (cont) mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat mongo.input.uri=mongodb://my-db:27017/enron.messages Read from MongoDB Read from BSON mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat mapred.input.dir=file:///tmp/messages.bson hdfs:///tmp/messages.bson s3:///tmp/messages.bson
  • 18. Example 1 - Java MapReduce (cont) mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat mongo.output.uri=mongodb://my-db:27017/enron.results_out Write output to MongoDB Write output to BSON mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat mapred.output.dir=file:///tmp/results.bson hdfs:///tmp/results.bson s3:///tmp/results.bson
  • 19. Results : Output Data mongos> db.results_out.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 2 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 4 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } { "_id" : { "t" : "[email protected]", "f" : "[email protected]" }, "count" : 1 } ... has more
  • 20. Example 2 - Hadoop Streaming Let’s do the same Enron MapReduce job with Python instead of Java $ pip install pymongo_hadoop
  • 21. Example 2 - Hadoop Streaming (cont) Hadoop passes data to an external process via STDOUT/STDIN map(k, v) map(k, v) map(k, v)map() JVM STDIN Python / Ruby / JS interpreter STDOUT Hadoop (JVM) def mapper(documents): . . .
  • 22. Example 2 - Hadoop Streaming (cont) from pymongo_hadoop import BSONMapper def mapper(documents): i = 0 for doc in documents: i = i + 1 from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." BSONMapper is pymongo layer that translates from hadoop streaming back to hadoop
  • 23. Example 2 - Hadoop Streaming (cont) from pymongo_hadoop import BSONReducer def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer)
  • 24. Surviving Hadoop: making MapReduce easier Pig + Hive writing m/r jobs from scratch can be clunky and cumbersome
  • 25. Example 3 - Mongo-Hadoop and Pig Let’s do the same thing yet again, but this time using Pig Pig is a powerful language that can generate sophisticated MapReduce workflows from simple scripts Can perform JOIN, GROUP, and execute user-defined functions (UDFs)
  • 26. Example 3 - Mongo-Hadoop and Pig (cont) Pig directives for loading data: BSONLoader and MongoLoader Writing data out BSONStorage and MongoInsertStorage data = LOAD 'mongodb://localhost:27017/db.collection' using com.mongodb.hadoop.pig.MongoLoader; STORE records INTO 'file:///output.bson' using com.mongodb.hadoop.pig.BSONStorage;
  • 27. Pig has its own special datatypes: Bags, Maps, and Tuples Mongo-Hadoop Connector intelligently converts between Pig datatypes and MongoDB datatypes Example 3 - Mongo-Hadoop and Pig (cont) bags -> arrays maps -> objects
  • 28. raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; send_recip_filtered = FILTER send_recip BY to IS NOT NULL; send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to; send_recip_grouped = GROUP send_recip_split BY (from, to); send_recip_counted = FOREACH send_recip_grouped GENERATE group, COUNT($1) as count; STORE send_recip_counted INTO 'file:///enron_results.bson' using com.mongodb.hadoop.pig.BSONStorage; Example 3 - Mongo-Hadoop and Pig (cont)
  • 29. Hive with Mongo-Hadoop Similar idea to Pig - process your data without needing to write MapReduce code from scratch ...but with SQL as the language of choice
  • 30. Hive with Mongo-Hadoop Sample Data: db.users db.users.find() { "_id": 1, "name": "Tom", "age": 28 } { "_id": 2, "name": "Alice", "age": 18 } { "_id": 3, "name": "Bob", "age": 29 } { "_id": 101, "name": "Scott", "age": 10 } { "_id": 104, "name": "Jesse", "age": 52 } { "_id": 110, "name": "Mike", "age": 32 } ... CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES( "mongo.columns.mapping" = "_id,name,age" ) TBLPROPERTIES ( "mongo.uri" = "mongodb://localhost:27017/test.users"); first, declare the collection to be accessible in Hive:
  • 31. Hive with Mongo-Hadoop ...then you can run SQL on it, like a table. SELECT name,age FROM mongo_users WHERE id > 100 ; SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ; you can use GROUP BY: or JOIN multiple tables/collections together: SELECT * FROM mongo_users T1 JOIN user_emails T2 WHERE T1.id = T2.id; subset of SQL
  • 32. Write the output of queries back into new tables: INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ; DROP TABLE mongo_users; Drop a table in Hive to delete the underlying collection in MongoDB use “external” when declaring your table to prevent the collection drop
  • 33. Usage with Amazon Elastic MapReduce Run mongo-hadoop jobs without needing to set up or manage your own Hadoop cluster. Pig, Hive, and streaming work on EMR, too! Logs get captured into S3 files
  • 34. Usage with Amazon Elastic MapReduce First, make a “bootstrap” script that fetches dependencies (mongo-hadoop jar and java drivers) #!/bin/sh wget -P /home/hadoop/lib http://central.maven.org/maven2/org/ mongodb/mongo-java-driver/2.12.2/mongo-java-driver-2.12.2.jar wget -P /home/hadoop/lib https://s3.amazonaws.com/mongo-hadoop- code/mongo-hadoop-core_1.1.2-1.1.0.jar this will get executed on each node in the cluster that EMR builds for us. working on updating hadoop artifacts in maven
  • 35. Example 4 - Usage with Amazon Elastic MapReduce Put the bootstrap script, and all your code, into an S3 bucket where Amazon can see it. s3cp ./bootstrap.sh s3://$S3_BUCKET/bootstrap.sh s3mod s3://$S3_BUCKET/bootstrap.sh public-read s3cp $HERE/../enron/target/enron-example.jar s3://$S3_BUCKET/ enron-example.jar s3mod s3://$S3_BUCKET/enron-example.jar public-read
  • 36. $ elastic-mapreduce --create --jobflow ENRON000 --instance-type m1.xlarge --num-instances 5 --bootstrap-action s3://$S3_BUCKET/bootstrap.sh --log-uri s3://$S3_BUCKET/enron_logs --jar s3://$S3_BUCKET/enron-example.jar --arg -D --arg mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat --arg -D --arg mapred.input.dir=s3n://mongo-test-data/messages.bson --arg -D --arg mapred.output.dir=s3n://$S3_BUCKET/BSON_OUT --arg -D --arg mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat # (any additional parameters here) Example 4 - Usage with Amazon Elastic MapReduce ...then launch the job from the command line, pointing to your S3 locations Control the type and number of instances in the cluster
  • 37. Example 4 - Usage with Amazon Elastic MapReduce Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster Pig, Hive, and streaming work on EMR, too! Logs get captured into S3 files
  • 38. Example 5 - New Feature: MongoUpdateWritable ... but we can also modify an existing output collection Works by applying mongodb update modifiers: $push, $pull, $addToSet, $inc, $set, etc. Can be used to do incremental MapReduce or “join” two collections In previous examples, we wrote job output data by inserting into a new collection
  • 39. Example 5 - MongoUpdateWritable For example, let’s say we have two collections. {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] } {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } sensors log events refers to which sensor logged the event For each owner, we want to calculate how many events were recorded for each type of sensor that logged it.
  • 40. For each owner, we want to calculate how many events were recorded for each type of sensor that logged it. Plain english: Bob’s sensors for temperature have stored 1300 readings Bob’s sensors for pressure have stored 400 readings Alice’s sensors for humidity have stored 600 readings Alice’s sensors for temperature have stored 700 readings etc...
  • 41. sensors (mongodb collection) Stage 1 -MapReduce on sensors collection Results (mongodb collection) for each sensor, emit: {key: owner+type, value: _id} group data from map() under each key, output: {key: owner+type, val: [ list of _ids] } read from mongodb insert() new records to mongodb MapReduce log events (mongodb collection) do this in two stages
  • 42. the sensor’s owner and type After stage one, the output docs look like: list of ID’s of sensors with this owner and type {    "_id":  "alice  pressure",    "sensors":  [        ObjectId("51b792d381c3e67b0a18d475"),        ObjectId("51b792d381c3e67b0a18d16d"),        ObjectId("51b792d381c3e67b0a18d2bf"),        …    ] } Now we just need to count the total # of log events recorded for any sensors that appear in the list for each owner/type group.
  • 43. sensors (mongodb collection) Stage 2 -MapReduce on log events collection read from mongodb Results (mongodb collection) update() existing records in mongodb MapReduce log events (mongodb collection) for each sensor, emit: {key: sensor_id, value: 1} group data from map() under each key for each value in that key: update({sensors: key}, {$inc : {logs_count:1}}) context.write(null,   new  MongoUpdateWritable(      query,  //which  documents  to  modify        update,  //how  to  modify  ($inc)      true,        //upsert      false) );  //  multi
  • 44. Example - MongoUpdateWritable Result after stage 2 {    "_id":  "1UoTcvnCTz  temp",    "sensors":  [        ObjectId("51b792d381c3e67b0a18d475"),        ObjectId("51b792d381c3e67b0a18d16d"),        ObjectId("51b792d381c3e67b0a18d2bf"),        …    ],    "logs_count":  1050616 } now populated with correct count
  • 45. New Features in v1.2 and beyond Continually improving Hive support Performance Improvements - Lazy BSON Support for multi-collection input sources API for adding custom splitter implementations and more primarily focusing on hive but pig is next maven central
  • 46. Recap Mongo-Hadoop - use Hadoop to do massive computations on big data sets stored in MongoDB/BSON Tools and APIs make it easier: Streaming, Pig, Hive, EMR, etc. MongoDB becomes a Hadoop-enabled filesystem
  • 48. MongoDB World New York City, June 23-25 Save 25% with 25JustinLee Register at world.mongodb.com