Hadoop - MongoDB Webinar June 2014

Mongo-Hadoop Integration
Justin Lee
Software Engineer @ MongoDB

We will cover:
•what it is
•how it works
•a tour of what it can do
A quick brieﬁng on what Mongo
and Hadoop are all about:
(Q+A at the end)

document-oriented database with
dynamic schema
stores data in JSON-like documents:
{
_id : “kosmo kramer”,
age : 42,
location : {
state : ”NY”,
zip : ”10024”
},
favorite_colors : [“red”, “green”]
}
different structure in each document
values can be simple like strings and ints or nested documents

mongodb scales horizontally via
sharding to handle lots of data and load
app

Java-based framework for MapReduce
Excels at batch processing on large data sets
by taking advantage of parallelism
map reduce created by google (white paper)
implemented in open source by hadoop

Mongo-Hadoop Connector - Why
Lots of people using Hadoop and Mongo
separately but need integration
Custom import/export scripts often
used to get data in+out
Scalability and ﬂexibility with changes in
Hadoop or MongoDB conﬁgurations
Need to process data across multiple sources
custom scripts slow, fragile

Mongo-Hadoop Connector
Turn MongoDB into a Hadoop-enabled ﬁlesystem:
use as the input or output for Hadoop
.BSON
-or-
input
data
.BSON
-or-
Hadoop
Cluster
output
results
bson file new in 1.1
bson is the output of mongodump

Benefits + Features
Takes advantage of full multi-core
parallelism to process data in Mongo
Full integration with Hadoop and JVM ecosystems
Can be used with Amazon Elastic MapReduce
Can read and write backup files from local
filesystem, HDFS, or S3

Vanilla Java MapReduce
write MapReduce code in
ruby
or if you don’t want to use Java,
support for Hadoop Streaming.
can write your own language binding

Support for Pig
high-level scripting language for data analysis and
building MapReduce workﬂows
Support for Hive
SQL-like language for ad-hoc queries + analysis of data sets on
Hadoop-compatible ﬁle systems

How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the data
Each split gets assigned to a node in Hadoop cluster
In parallel, Hadoop nodes pull data for splits from
MongoDB (or BSON) and process them locally
Hadoop merges results and streams output back to
MongoDB or BSON

Tour of Mongo-Hadoop, by Example
- Using Java MapReduce with Mongo-Hadoop
- Using Hadoop Streaming
- Pig and Hive with Mongo-Hadoop
- Elastic MapReduce + BSON

{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecastnn ",
"filename" : "1.",
"headers" : {
"From" : "phillip.allen@enron.com",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "tim.belden@enron.com",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}
Input Data: Enron e-mail corpus (501k records, 1.75Gb)
each document is one email
sender
recipients

{"_id": {"t":"bob@enron.com", "f":"alice@enron.com"}, "count" : 14}
{"_id": {"t":"bob@enron.com", "f":"eve@enron.com"}, "count" : 9}
{"_id": {"t":"alice@enron.com", "f":"charlie@enron.com"}, "count" : 99}
{"_id": {"t":"charlie@enron.com", "f":"bob@enron.com"}, "count" : 48}
{"_id": {"t":"eve@enron.com", "f":"charlie@enron.com"}, "count" : 20}
Let’s use Hadoop to build a graph of (senders → recipients)
and the count of messages exchanged between each pair
bob
alice
eve
charlie
14
99
9
48
20
sample, simpliﬁed data
nodes are people. edges/arrows # of msgs from A to B

Example 1 - Java MapReduce
mongodb document passed into
Hadoop MapReduce
Map phase - each input doc gets
passed through a Mapper function
@Override
public
void
map(NullWritable
key,
BSONObject
val,
final
Context
context){

BSONObject
headers
=
(BSONObject)val.get("headers");

if(headers.containsKey("From")
&&
headers.containsKey("To")){

String
from
=
(String)headers.get("From");

String
to
=
(String)headers.get("To");

String[]
recips
=
to.split(",");

for(int
i=0;i<recips.length;i++){

String
recip
=
recips[i].trim();

context.write(new
MailPair(from,
recip),
new
IntWritable(1));

}

}
}
input value doc from mongo. connector will handle translation into
BSONObject for you

output written back to MongoDB
Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer
the {to, from} key
list of all the values
collected under the key

public
void
reduce(
final
MailPair
pKey,

final
Iterable<IntWritable>
pValues,

final
Context
pContext
){

int
sum
=
0;

for
(
final
IntWritable
value
:
pValues
){

sum
+=
value.get();

}

BSONObject
outDoc
=
new
BasicDBObjectBuilder().start()

.add(
"f"
,
pKey.from)

.add(
"t"
,
pKey.to
)

.get();

BSONWritable
pkeyOut
=
new
BSONWritable(outDoc);

pContext.write(
pkeyOut,
new
IntWritable(sum)
);

}

mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input.uri=mongodb://my-db:27017/enron.messages
Read from MongoDB
Read from BSON
mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
mapred.input.dir=file:///tmp/messages.bson
hdfs:///tmp/messages.bson
s3:///tmp/messages.bson

mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mongo.output.uri=mongodb://my-db:27017/enron.results_out
Write output to MongoDB
Write output to BSON
mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
mapred.output.dir=file:///tmp/results.bson
hdfs:///tmp/results.bson
s3:///tmp/results.bson

Results : Output Data
mongos> db.results_out.find({"_id.t": /^kenneth.lay/})
{ "_id" : { "t" : "kenneth.lay@enron.com",
"f" : "15126-1267@m2.innovyx.com" }, "count" : 1 }
"f" : "2586207@www4.imakenews.com" }, "count" : 1 }
"f" : "40enron@enron.com" }, "count" : 2 }
"f" : "a..davis@enron.com" }, "count" : 2 }
"f" : "a..hughes@enron.com" }, "count" : 4 }
"f" : "a..lindholm@enron.com" }, "count" : 1 }
"f" : "a..schroeder@enron.com" }, "count" : 1 }
...
has more

Example 2 - Hadoop Streaming
Let’s do the same Enron MapReduce job with
Python instead of Java
$ pip install pymongo_hadoop

Example 2 - Hadoop Streaming (cont)
Hadoop passes data to an external process
via STDOUT/STDIN
map(k, v)
map(k, v)
map(k, v)map()
JVM
STDIN
Python / Ruby / JS
interpreter
STDOUT
Hadoop (JVM)
def mapper(documents):
. . .

from pymongo_hadoop import BSONMapper
def mapper(documents):
i = 0
for doc in documents:
i = i + 1
from_field = doc['headers']['From']
to_field = doc['headers']['To']
recips = [x.strip() for x in to_field.split(',')]
for r in recips:
yield {'_id': {'f':from_field, 't':r}, 'count': 1}
BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."
BSONMapper is pymongo layer that translates from hadoop streaming
back to hadoop

from pymongo_hadoop import BSONReducer
def reducer(key, values):
print >> sys.stderr, "Processing from/to %s" % str(key)
_count = 0
for v in values:
_count += v['count']
return {'_id': key, 'count': _count}
BSONReducer(reducer)

Surviving Hadoop:
making MapReduce easier
Pig + Hive
writing m/r jobs from scratch can be clunky and cumbersome

Example 3 - Mongo-Hadoop and Pig
Let’s do the same thing yet again, but this
time using Pig
Pig is a powerful language that can generate
sophisticated MapReduce workﬂows from simple
scripts
Can perform JOIN, GROUP, and execute
user-deﬁned functions (UDFs)

Example 3 - Mongo-Hadoop and Pig (cont)
Pig directives for loading data:
BSONLoader and MongoLoader
Writing data out
BSONStorage and MongoInsertStorage
data = LOAD 'mongodb://localhost:27017/db.collection'
using com.mongodb.hadoop.pig.MongoLoader;
STORE records INTO 'file:///output.bson'
using com.mongodb.hadoop.pig.BSONStorage;

Pig has its own special datatypes:
Bags, Maps, and Tuples
Mongo-Hadoop Connector intelligently
converts between Pig datatypes and
MongoDB datatypes
bags -> arrays
maps -> objects

raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ;
send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to;
send_recip_filtered = FILTER send_recip BY to IS NOT NULL;
send_recip_split = FOREACH send_recip_filtered GENERATE
from as from, TRIM(FLATTEN(TOKENIZE(to))) as to;
send_recip_grouped = GROUP send_recip_split BY (from, to);
send_recip_counted = FOREACH send_recip_grouped GENERATE
group, COUNT($1) as count;
STORE send_recip_counted INTO 'file:///enron_results.bson'
using com.mongodb.hadoop.pig.BSONStorage;

Hive with Mongo-Hadoop
Similar idea to Pig - process your data without
needing to write MapReduce code from
scratch
...but with SQL as the language of choice

Sample Data:
db.users
db.users.find()
{ "_id": 1, "name": "Tom", "age": 28 }
{ "_id": 2, "name": "Alice", "age": 18 }
{ "_id": 3, "name": "Bob", "age": 29 }
{ "_id": 101, "name": "Scott", "age": 10 }
{ "_id": 104, "name": "Jesse", "age": 52 }
{ "_id": 110, "name": "Mike", "age": 32 }
...
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES( "mongo.columns.mapping" = "_id,name,age" )
TBLPROPERTIES ( "mongo.uri" = "mongodb://localhost:27017/test.users");
ﬁrst, declare the collection to be
accessible in Hive:

...then you can run SQL on it, like a table.
SELECT name,age FROM mongo_users WHERE id > 100 ;
SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ;
you can use GROUP BY:
or JOIN multiple tables/collections together:
SELECT * FROM mongo_users T1
JOIN user_emails T2
WHERE T1.id = T2.id;
subset of SQL

Write the output of queries back into new tables:
INSERT OVERWRITE TABLE old_users SELECT id,name,age
FROM mongo_users WHERE age > 100 ;
DROP TABLE mongo_users;
Drop a table in Hive to delete the
underlying collection in MongoDB
use “external” when declaring your table to prevent the collection drop

Usage with Amazon Elastic MapReduce
Run mongo-hadoop jobs without
needing to set up or manage your
own Hadoop cluster.
Pig, Hive, and streaming work on EMR, too!
Logs get captured into S3 ﬁles

Usage with Amazon Elastic MapReduce
First, make a “bootstrap” script that
fetches dependencies (mongo-hadoop
jar and java drivers)
#!/bin/sh
wget -P /home/hadoop/lib http://central.maven.org/maven2/org/
mongodb/mongo-java-driver/2.12.2/mongo-java-driver-2.12.2.jar
wget -P /home/hadoop/lib https://s3.amazonaws.com/mongo-hadoop-
code/mongo-hadoop-core_1.1.2-1.1.0.jar
this will get executed on each node in
the cluster that EMR builds for us.
working on updating hadoop artifacts in maven

Example 4 - Usage with Amazon Elastic MapReduce
Put the bootstrap script, and all your code,
into an S3 bucket where Amazon can see it.
s3cp ./bootstrap.sh s3://$S3_BUCKET/bootstrap.sh
s3mod s3://$S3_BUCKET/bootstrap.sh public-read
s3cp $HERE/../enron/target/enron-example.jar s3://$S3_BUCKET/
enron-example.jar
s3mod s3://$S3_BUCKET/enron-example.jar public-read

$ elastic-mapreduce --create --jobflow ENRON000
--instance-type m1.xlarge
--num-instances 5
--bootstrap-action s3://$S3_BUCKET/bootstrap.sh
--log-uri s3://$S3_BUCKET/enron_logs
--jar s3://$S3_BUCKET/enron-example.jar
--arg -D --arg mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
--arg -D --arg mapred.input.dir=s3n://mongo-test-data/messages.bson
--arg -D --arg mapred.output.dir=s3n://$S3_BUCKET/BSON_OUT
--arg -D --arg mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
# (any additional parameters here)
...then launch the job from the command
line, pointing to your S3 locations
Control the type and
number of instances
in the cluster

Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster
Pig, Hive, and streaming work on EMR, too!
Logs get captured into S3 ﬁles

Example 5 - New Feature: MongoUpdateWritable
... but we can also modify an existing output
collection
Works by applying mongodb update modiﬁers:
$push, $pull, $addToSet, $inc, $set, etc.
Can be used to do incremental MapReduce or
“join” two collections
In previous examples, we wrote job output data
by inserting into a new collection

Example 5 - MongoUpdateWritable
For example,
let’s say we have two collections.
{

"_id":
ObjectId("51b792d381c3e67b0a18d678"),

"sensor_id":
ObjectId("51b792d381c3e67b0a18d4a1"),

"value":
3328.5895416489802,

"timestamp":
ISODate("2013-‐05-‐18T13:11:38.709-‐0400"),

"loc":
[-‐175.13,51.658]
}
{

"_id":
ObjectId("51b792d381c3e67b0a18d0ed"),

"name":
"730LsRkX",

"type":
"pressure",

"owner":
"steve",
}
sensors
log
events
refers to which sensor
logged the event
For each owner, we want to calculate how many events
were recorded for each type of sensor that logged it.

For each owner, we want to calculate how many events
were recorded for each type of sensor that logged it.
Plain english:
Bob’s sensors for temperature have stored 1300 readings
Bob’s sensors for pressure have stored 400 readings
Alice’s sensors for humidity have stored 600 readings
Alice’s sensors for temperature have stored 700 readings
etc...

sensors
(mongodb collection)
Stage 1 -MapReduce
on sensors collection
Results
for each sensor, emit:
{key: owner+type, value: _id}
group data from map() under each key, output:
{key: owner+type, val: [ list of _ids] }
read from
mongodb
insert() new records
to mongodb
MapReduce
log events
do this in two stages

the sensor’s
owner and type
After stage one, the output
docs look like:
list of ID’s of
sensors with this
owner and type
{

"_id":
"alice
pressure",

"sensors":
[


ObjectId("51b792d381c3e67b0a18d16d"),

ObjectId("51b792d381c3e67b0a18d2bf"),

…

]
}
Now we just need to count the total # of log events recorded for
any sensors that appear in the list for each owner/type group.

sensors
Stage 2 -MapReduce on
log events collection
read from
mongodb
Results
update() existing
records in mongodb
MapReduce
log events
for each sensor, emit:
{key: sensor_id, value: 1}
group data from map() under each key
for each value in that key:
update({sensors: key}, {$inc : {logs_count:1}})
context.write(null,

new
MongoUpdateWritable(

query,
//which
documents
to
modify

update,
//how
to
modify
($inc)

true,

//upsert

false)
);
//
multi

Example - MongoUpdateWritable
Result after stage 2
{

"_id":
"1UoTcvnCTz
temp",

"sensors":
[


ObjectId("51b792d381c3e67b0a18d16d"),

ObjectId("51b792d381c3e67b0a18d2bf"),

…

],

"logs_count":
1050616
}
now populated with correct count

New Features in v1.2 and beyond
Continually improving Hive support
Performance Improvements - Lazy BSON
Support for multi-collection input sources
API for adding
custom splitter implementations
and more
primarily focusing on hive but pig is next
maven central

Recap
Mongo-Hadoop - use Hadoop to do massive computations
on big data sets stored in MongoDB/BSON
Tools and APIs make it easier:
Streaming, Pig, Hive, EMR, etc.
MongoDB becomes a Hadoop-enabled ﬁlesystem

Questions?
https://github.com/mongodb/mongo-hadoop/tree/
master/examples
Examples can be found on github:

MongoDB World
New York City, June 23-25
Save 25% with 25JustinLee
Register at world.mongodb.com

Hadoop - MongoDB Webinar June 2014

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Hadoop - MongoDB Webinar June 2014 (20)

More from MongoDB (20)

Recently uploaded (20)

Hadoop - MongoDB Webinar June 2014