SlideShare a Scribd company logo
Massimo Brignoli
Principal Solutions Architect
massimo@mongodb.com
@massimobrignoli
Analytics in MongoDB
Agenda
• Analytics in MongoDB?
• Aggregation Framework
• Aggregation Pipeline Stages
• Aggregation Framework in
Action
• Joins in MongoDB 3.2
• Integrations
• Analytical Architectures
Relational
Expressive Query Language
& Secondary Indexes
Strong Consistency
Enterprise Management
& Integrations
The World Has Changed
Volume
Velocity
Variety
Iterative
Agile
Short Cycles
Always On
Secure
Global
Open-Source
Cloud
Commodity
Data Time
Risk Cost
Scalability
& Performance
Always On,
Global Deployments
FlexibilityExpressive Query Language
& Secondary Indexes
Strong Consistency
Enterprise Management
& Integrations
NoSQL
Nexus Architecture
Scalability
& Performance
Always On,
Global Deployments
FlexibilityExpressive Query Language
& Secondary Indexes
Strong Consistency
Enterprise Management
& Integrations
Some Common MongoDB Use Cases
Single View Internet of Things Mobile Real-Time Analytics
Catalog Personalization Content Management
MongoDB in Research
Analytics in MongoDB?
Analytics in MongoDB?
Create
Read
Update
Delete
Analytics
?
Group
Count
Derive Values
Filter
Average
Sort
Analytics on MongoDB Data
• Extract data from MongoDB and
perform complex analytics with
Hadoop
– Batch rather than real-time
– Extra nodes to manage
• Direct access to MongoDB from
SPARK
• MongoDB BI Connector
– Direct SQL Access from BI Tools
• MongoDB aggregation pipeline
– Real-time
– Live, operational data set
– Narrower feature set
Hadoop
Connector
MapReduce & HDFS
SQL
Connector
For Example: US Census Data
• Census data from 1990, 2000, 2010
• Question:
– Which US Division has the fastest growing population density?
– We only want to include data states with more than 1M people
– We only want to include divisions larger than 100K square miles
– Division = a group of US States
– Population density = Area of division/# of people
– Data is provided at the state level
US Regions and Divisions
How would we solve this in SQL?
• SELECT GROUP BY HAVING
Aggregation Framework
Aggregation Framework
What is an Aggregation Pipeline?
• A Series of Document Transformations
– Executed in stages
– Original input is a collection
– Output as a cursor or a collection
• Rich Library of Functions
– Filter, compute, group, and summarize data
– Output of one stage sent to input of next
– Operations executed in sequential order
Aggregation Pipeline
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
Aggregation Pipeline
$match
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
Aggregation Pipeline
$match
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
Aggregation Pipeline
$match $project
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{=d+s}
Aggregation Pipeline
$match $project
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{=d+s}
Aggregation Pipeline
$match $project $lookup
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{★}
{★}
{★}
{★}
{=d+s}
Aggregation Pipeline
$match $project $lookup
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{★}
{★}
{★}
{★}
{=d+s}
{★[]}
{★[]}
{★}
Aggregation Pipeline
$match $project $lookup $group
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{★}
{★}
{★}
{★}
{=d+s}
{
Σ λ σ}
{
Σ λ σ}
{
Σ λ σ}
{★[]}
{★[]}
{★}
Aggregation Pipeline Stages
• $match
Filter documents
• $geoNear
Geospherical query
• $project
Reshape documents
• $lookup
New – Left-outer equi joins
• $unwind
Expand documents
• $group
Summarize documents
• $sample
New – Randomly selects a subset of
documents
• $sort
Order documents
• $skip
Jump over a number of documents
• $limit
Limit number of documents
• $redact
Restrict documents
• $out
Sends results to a new collection
Aggregation Framework in Action
(let’s play with the census data)
MongoDB State Collection
• Document For Each State
• Name
• Region
• Division
• Census Data For 1990, 2000, 2010
– Population
– Housing Units
– Occupied Housing Units
• Census Data is an array with three subdocuments
Document Model
{ "_id" : ObjectId("54e23c7b28099359f5661525"),
"name" : "California",
"region" : "West",
"data" : [
{ "totalPop" : 33871648,
"totalHouse" : 12214549,
"occHouse" : 11502870,
"year" : 2000},
{ "totalPop" : 37253956,
"totalHouse" : 13680081,
"occHouse" : 12577498,
"year" : 2010},
{ "totalPop" : 29760021,
"totalHouse" : 11182882,
"occHouse" : 29008161,
"year" : 1990}
],
…
}
Total US Area
db.cData.aggregate([
{"$group" :
{"_id" : null,
"totalArea" : {$sum : "$areaM"},
"avgArea" : {$avg : "$areaM"}}}])
$group
• Group documents by value
– Field reference, object, constant
– Other output fields are computed
• $max, $min, $avg, $sum
• $addToSet, $push
• $first, $last
– Processes all data in memory by default
Area By Region
db.cData.aggregate([{
"$group" : {
"_id" : "$region",
"totalArea" : {$sum : "$areaM"},
"avgArea" : {$avg : "$areaM"},
"numStates" : {$sum : 1},
"states" : {$push : "$name"}}}])
Calculating Average State Area By Region
{state: ”New York",
areaM: 218,
region: “North East"
}
{state: ”New Jersey",
areaM: 90,
region: “North East”
}
{state: “California",
area: 300,
region: “West"
}
{ $group: {
_id: "$region",
avgAreaM: {$avg: ”$areaM" }
}}
{ _id: ”North East",
avgAreaM: 154}
{_id: “West",
avgAreaM: 300}
Calculating Total Area and State Count
{state: ”New York",
areaM: 218,
region: “North East"
}
{state: ”New Jersey",
areaM: 90,
region: “North East”
}
{state: “California",
area: 300,
region: “West"
}
{ $group: {
_id: "$region",
totArea: {$sum: ”$areaM" },
sCount : {$sum : 1}
}}
{ _id: ”North East",
totArea: 308
sCount: 2}
{ _id: “West",
totArea: 300,
sCount: 1}
Total US Population By Year
db.cData.aggregate([
{$unwind : "$data"},
{$group : {
"_id" : "$data.year",
"totalPop" : {$sum :"$data.totalPop"}}},
{$sort : {"totalPop" : 1}}
])
$unwind
• Operate on an array field
– Create documents from array elements
• Array replaced by element value
• Missing/empty fields → no output
• Non-array fields → error
– Pipe to $group to aggregate
$unwind
{ state: ”New York",
census: [1990, 2000,
2010]}
{ state: ”New Jersey",
census: [1990, 2000]}
{ state: “California",
census: [1980, 1990, 2000,
2010]}
{ state: ”Delaware",
census: [1990, 2000]}
{ $unwind: $census }
{ state: “New York”, census: 1990}
{ state: “New York”, census: 2000}
{ state: “New York”, census: 2010}
{ state: “New Jersey”, census: 1990}
{ state: “New Jersey”, census: 2000}
Southern State Population By Year
db.cData.aggregate([
{$match : {"region" : "South"}},
{$unwind : "$data"},
{$group : { "_id" : "$data.year",
"totalPop" : {"$sum" :"$data.totalPop"}}}
])
$match
• Filter documents
– Uses existing query syntax, same as .find()
$match
{state: ”New York",
areaM: 218,
region: “North East"
}
{state: ”Oregon",
areaM: 245,
region: “West”
}
{state: “California",
area: 300,
region: “West"
}
{state: ”Oregon",
areaM: 245,
region: “West”}
{state: “California",
area: 300,
region: “West"}
{ $match:
{ “region” : “West” }
}
Population Delta By State from 1990 to 2010
db.cData.aggregate([
{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : { "_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"}}},
{$project : { "_id" : 0,
"name" : "$_id",
"delta" : {"$subtract" : ["$pop2010", "$pop1990"]},
"pop1990" : 1,
"pop2010" : 1}
}])
$sort, $limit, $skip
• Sort documents by one or more fields
– Same order syntax as cursors
– Waits for earlier pipeline operator to return
– In-memory unless early and indexed
• Limit and skip follow cursor behavior
$first, $last
• Collection operations like $push and $addToSet
• Must be used in $group
• $first and $last determined by document order
• Typically used with $sort to ensure ordering is known
$project
• Reshape Documents
– Include, exclude or rename fields
– Inject computed fields
– Create sub-document fields
Including and Excluding Fields
{
"_id" : "Virginia”,
"pop1990" : 453588,
"pop2010" : 3725789
}
{
"_id" : "South Dakota",
"pop1990" : 453588,
"pop2010" : 3725789
}
{ $project:
{ “_id” : 0,
“pop1990” : 1,
“pop2010” : 1}
}
{"pop1990" : 453588,
"pop2010" : 3725789}
{"pop1990" : 453588,
"pop2010" : 3725789}
Renaming and Computing Fields
{ $project:
{ “_id” : 0,
“pop1990” : 0,
“pop2010” : 0,
“name” : “$_id”,
"delta" :
{"$subtract" :
["$pop2010",
"$pop1990"]}}
}
{
"_id" : "Virginia”,
"pop1990" : 6187358,
"pop2010" : 8001024
}
{
"_id" : "South Dakota",
"pop1990" : 696004,
"pop2010" : 814180
} {”name" : “Virginia”,
”delta" : 1813666}
{“name" : “South Dakota”,
“delta" : 118176}
Compare number of people living within 500KM of
Memphis, TN in 1990, 2000, 2010
Compare number of people living within 500KM of
Memphis, TN in 1990, 2000, 2010
db.cData.aggregate([
{$geoNear : { "near" : {"type" : "Point", "coordinates" : [90, 35]},
"distanceField" : "dist.calculated",
"maxDistance" : 500000,
"includeLocs" : "dist.location",
"spherical": true }},
{$unwind : "$data"},
{$group : { "_id" : "$data.year",
"totalPop" : {"$sum" : "$data.totalPop"},
"states" : {"$addToSet" : "$name"}}},
{$sort : {"_id" : 1}}
])
$geoNear
• Order/Filter Documents by Location
– Requires a geospatial index
– Output includes physical distance
– Must be first aggregation stage
$geoNear
{"_id" : "Virginia”,
"pop1990" : 6187358,
"pop2010" : 8001024,
“center” :
{“type” : “Point”,
“coordinates” :
[78.6, 37.5]}}
{ "_id" : ”Tennessee",
"pop1990" : 4877185,
"pop2010" : 6346105,
“center” :
{“type” : “Point”,
“coordinates” :
[86.6, 37.8]}}
{"_id" : ”Tennessee",
"pop1990" : 4877185,
"pop2010" : 6346105,
“center” :
{“type” : “Point”,
“coordinates” :
[86.6, 37.8]}}
{$geoNear : {
"near”: {"type”: "Point",
"coordinates”:
[90, 35]},
maxDistance : 500000,
spherical : true }}
What if I want to save the results to a collection?
db.cData.aggregate([
{$geoNear : { "near" : {"type" : "Point", "coordinates" : [90, 35]},
“distanceField” : "dist.calculated",
“maxDistance” : 500000,
“includeLocs” : "dist.location",
“spherical” : true }},
{$unwind : "$data"},
{$group : { "_id" : "$data.year",
"totalPop" : {"$sum" : "$data.totalPop"},
"states" : {"$addToSet" : "$name"}}},
{$sort : {"_id" : 1}},
{$out : “peopleNearMemphis”}
])
$out
db.cData.aggregate([<pipeline stages>,
{“$out”:“resultsCollection”}])
• Save aggregation results to a new collection
• New aggregation uses:
• Transform documents - ETL
Back To The Original Question
• Which US Division has the fastest growing population density?
– We only want to include data states with more than 1M people
– We only want to include divisions larger than 100K square miles
Division with Fastest Growing Pop Density
db.cData.aggregate(
[{$match : {"data.totalPop" : {"$gt" : 1000000}}},
{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : {"_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"},
"areaM" : {"$first" : "$areaM"},
"division" : {"$first" : "$division"}}},
{$group : { "_id" : "$division",
"totalPop1990" : {"$sum" : "$pop1990"},
"totalPop2010" : {"$sum" : "$pop2010"},
"totalAreaM" : {"$sum" : "$areaM"}}},
{$match : {"totalAreaM" : {"$gt" : 100000}}},
{$project : {"_id" : 0,
"division" : "$_id",
"density1990" : {"$divide" : ["$totalPop1990", "$totalAreaM"]},
"density2010" : {"$divide" : ["$totalPop2010", "$totalAreaM"]},
"denDelta" : {"$subtract" : [{"$divide" : ["$totalPop2010", "$totalAreaM"]}, {"$divide" : ["$totalPop1990","$totalAreaM"]}]},
"totalAreaM" : 1,
"totalPop1990" : 1,
"totalPop2010" : 1}},
{$sort : {"denDelta" : -1}}])
Aggregate Options
Aggregate options
db.cData.aggregate([<pipeline stages>],
{‘explain’ : false
'allowDiskUse' : true,
'cursor' : {'batchSize' : 5}})
• explain – similar to find().explain()
• allowDiskUse – enable use of disk to store intermediate
results
• cursor – specify the size of the initial result
Aggregation and Sharding
Sharding
• Workload split between shards
– Shards execute pipeline up to a
point
– Primary shard merges cursors and
continues processing*
– Use explain to analyze pipeline
split
– Early $match can exclude shards
– Potential CPU and memory
implications for primary shard host
*Prior to v2.6 second stage pipeline processing was
done by mongos
MongoDB 3.2: Joins and other improvements
Existing Alternatives to Joins
{ "_id": 10000,
"items": [
{ "productName": "laptop",
"unitPrice": 1000,
"weight": 1.2,
"remainingStock": 23},
{ "productName": "mouse",
"unitPrice": 20,
"weight": 0.2,
"remainingStock": 276}],
…
}
• Option 1: Include all data for
an order in the same document
– Fast reads
• One find delivers all the required data
– Captures full description at the time of the
event
– Consumes extra space
• Details of each product stored in many
order documents
– Complex to maintain
• A change to any product attribute must be
propagated to all affected orders
orders
The Winner?
• In general, Option 1 wins
– Performance and containment of everything in same place beats space
efficiency of normalization
– There are exceptions
• e.g. Comments in a blog post -> unbounded size
• However, analytics benefit from combining data from
multiple collections
– Keep listening...
Existing Alternatives to Joins
{
"_id": 10000,
"items": [
12345,
54321
],
...
}
• Option 2: Order document
references product documents
– Slower reads
• Multiple trips to the database
– Space efficient
• Product details stored once
– Lose point-in-time snapshot of full record
– Extra application logic
• Must iterate over product IDs in the order
document and find the product documents
• RDBMS would automate through a JOIN
orders
{
"_id": 12345,
"productName": "laptop",
"unitPrice": 1000,
"weight": 1.2,
"remainingStock": 23
}
{
"_id": 54321,
"productName": "mouse",
"unitPrice": 20,
"weight": 0.2,
"remainingStock": 276
}
products
$lookup
• Left-outer join
– Includes all documents from
the left collection
– For each document in the left
collection, find the matching
documents from the right
collection and embed them
Left Collection Right Collection
$lookup
db.leftCollection.aggregate([{
$lookup:
{
from: “rightCollection”,
localField: “leftVal”,
foreignField: “rightVal”,
as: “embeddedData”
}
}])
Left Collection Right Collection
Worked Example – Data Set
db.postcodes.findOne()
{
"_id":ObjectId("5600521e50fa77da54d
fc0d2"),
"postcode": "SL6 0AA",
"location": {
"type": "Point",
"coordinates": [
51.525605,
-0.700974
]}}
db.homeSales.findOne()
{
"_id":ObjectId("56005dd980c3678b19792b7f"),
"amount": 9000,
"date": ISODate("1996-09-19T00:00:00Z"),
"address": {
"nameOrNumber": 25,
"street": "NORFOLK PARK COTTAGES",
"town": "MAIDENHEAD",
"county": "WINDSOR AND MAIDENHEAD",
"postcode": "SL6 7DR"
}
}
Reduce Data Set First
db.homeSales.aggregate([
{$match: {
amount: {$gte:3000000}}
}
])
…
{
"_id":
ObjectId("56005dda80c3678b19799e52"),
"amount": 3000000,
"date": ISODate("2012-04-19T00:00:00Z"),
"address": {
"nameOrNumber": "TEMPLE FERRY
PLACE",
"street": "MILL LANE",
"town": "MAIDENHEAD",
"county": "WINDSOR AND
MAIDENHEAD",
"postcode": "SL6 5ND"
}
},…
Join (left-outer-equi) Results With Second Collection
db.homeSales.aggregate([
{$match: {
amount: {$gte:3000000}}
},
{$lookup: {
from: "postcodes",
localField: "address.postcode",
foreignField: "postcode",
as: "postcode_docs"}
}
])
...
"county": "WINDSOR AND MAIDENHEAD",
"postcode": "SL6 5ND"
},
"postcode_docs": [
{
"_id": ObjectId("560053e280c3678b1978b293"),
"postcode": "SL6 5ND",
"location": {
"type": "Point",
"coordinates": [
51.549516,
-0.80702
]
}}]}, ...
Refactor Each Resulting Document
...},
{$project: {
_id: 0,
saleDate: ”$date",
price: "$amount",
address: 1,
location:
{$arrayElemAt:
["$postcode_docs.location", 0]}}
])
{ "address": {
"nameOrNumber": "TEMPLE FERRY PLACE",
"street": "MILL LANE",
"town": "MAIDENHEAD",
"county": "WINDSOR AND MAIDENHEAD",
"postcode": "SL6 5ND"
},
"saleDate": ISODate("2012-04-19T00:00:00Z"),
"price": 3000000,
"location": {
"type": "Point",
"coordinates": [
51.549516,
-0.80702
]}},...
Sort on Sale Price & Write to Collection
...},
{$sort:
{price: -1}},
{$out: "hotSpots"}
])
…{"address": {
"nameOrNumber": "2 - 3",
"street": "THE SWITCHBACK",
"town": "MAIDENHEAD",
"county": "WINDSOR AND MAIDENHEAD",
"postcode": "SL6 7RJ"
},
"saleDate": ISODate("1999-03-15T00:00:00Z"),
"price": 5425000,
"location": {
"type": "Point",
"coordinates": [
51.536848,
-0.735835
]}},...
Aggregated Statistics
db.homeSales.aggregate([
{$group:
{ _id:
{$year: "$date"},
higestPrice:
{$max: "$amount"},
lowestPrice:
{$min: "$amount"},
averagePrice:
{$avg: "$amount"},
amountStdDev:
{$stdDevPop: "$amount"}
}}
])
...
{
"_id": 1995,
"higestPrice": 1000000,
"lowestPrice": 12000,
"averagePrice": 114059.35206869633,
"amountStdDev": 81540.50490801703
},
{
"_id": 1996,
"higestPrice": 975000,
"lowestPrice": 9000,
"averagePrice": 118862,
"amountStdDev": 79871.07569783277
}, ...
Clean Up Output
...,
{$project:
{
_id: 0,
year: "$_id",
higestPrice: 1,
lowestPrice: 1,
averagePrice:
{$trunc: "$averagePrice"},
priceStdDev:
{$trunc: "$amountStdDev"}
}
}
])
...
{
"higestPrice": 1000000,
"lowestPrice": 12000,
"averagePrice": 114059,
"year": 1995,
"priceStdDev": 81540
},
{
"higestPrice": 2200000,
"lowestPrice": 10500,
"averagePrice": 307372,
"year": 2004,
"priceStdDev": 199643
},...
Integrations
Hadoop Connector
Input data
Hadoop Cluster
-or-
.BSON
Mongo-Hadoop Connector
• Turn MongoDB into a Hadoop-enabled filesystem: use as the
input or output for Hadoop
• Works with MongoDB backup files (.bson)
Benefits and Features
• Takes advantage of full multi-core parallelism to process data
in Mongo
• Full integration with Hadoop and JVM ecosystems
• Can be used with Amazon Elastic MapReduce
• Can read and write backup files from local filesystem, HDFS,
or S3
Benefits and Features
• Vanilla Java MapReduce
• If you don’t want to use Java, support for Hadoop Streaming.
• Write MapReduce code in
Benefits and Features
• Support for Pig
– high-level scripting language for data analysis and building map/reduce
workflows
• Support for Hive
– SQL-like language for ad-hoc queries + analysis of data sets on Hadoop-
compatible file systems
How It Works
• Adapter examines the MongoDB input collection and
calculates a set of splits from the data
• Each split gets assigned to a node in Hadoop cluster
• In parallel, Hadoop nodes pull data for splits from MongoDB
(or BSON) and process them locally
• Hadoop merges results and streams output back to MongoDB
or BSON
BI Connector
MongoDB Connector for BI
Visualize and explore multi-dimensional
documents using SQL-based BI tools. The
connector does the following:
• Provides the BI tool with the schema of the
MongoDB collection to be visualized
• Translates SQL statements issued by the BI tool
into equivalent MongoDB queries that are sent
to MongoDB for processing
• Converts the results into the tabular format
expected by the BI tool, which can then
visualize the data based on user requirements
Location & Flow of Data
MongoDB
BI
Connector
Mapping meta-data Application data
{name:
“Andrew”,
address:
{street:…
}}
DocumentTableAnalytics & visualization
82
Defining Data Mapping
mongodrdl --host 192.168.1.94 --port 27017 -d myDbName 
-o myDrdlFile.drdl
mongobischema import myCollectionName myDrdlFile.drdl
DRDL
mongodrdl mongobischema
PostgreSQL
MongoDB-
specific
Foreign Data
Wrapper
83
Optionally Manually Edit DRDL File
• Redact attributes
• Use more appropriate types
(sampling can get it wrong)
• Rename tables (v1.1+)
• Rename columns (v1.1+)
• Build new views using
MongoDB Aggregation
Framework
• e.g., $lookup to join 2 tables
- table: homesales
collection: homeSales
pipeline: []
columns:
- name: _id
mongotype: bson.ObjectId
sqlname: _id
sqltype: varchar
- name: address.county
mongotype: string
sqlname: address_county
sqltype: varchar
- name:
address.nameOrNumber
mongotype: int
sqlname:
address_nameornumber
sqltype: varchar
Summary
Analytics in MongoDB?
Create
Read
Update
Deletet
Analytics
?
Group
Count
Derive Values
Filter
Average
Sort
YES!
Framework Use Cases
• Complex aggregation queries
• Ad-hoc reporting
• Real-time analytics
• Visualizing and reshaping data
Questions?
MongoDB 3.2  - Analytics

More Related Content

What's hot (20)

PDF
MongoDB for Analytics
MongoDB
 
PPTX
Webinar: General Technical Overview of MongoDB for Dev Teams
MongoDB
 
PPTX
Getting Started with MongoDB and NodeJS
MongoDB
 
PDF
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Henrik Ingo
 
PPTX
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
MongoDB
 
PPTX
Back to Basics Webinar 5: Introduction to the Aggregation Framework
MongoDB
 
PPTX
Back to Basics Webinar 2: Your First MongoDB Application
MongoDB
 
PPTX
2014 bigdatacamp asya_kamsky
Data Con LA
 
PPTX
MongoDB - Aggregation Pipeline
Jason Terpko
 
PPTX
How to leverage what's new in MongoDB 3.6
Maxime Beugnet
 
PPTX
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB
 
PPTX
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...
MongoDB
 
PPTX
Webinaire 2 de la série « Retour aux fondamentaux » : Votre première applicat...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PPTX
Back to Basics Webinar 3: Schema Design Thinking in Documents
MongoDB
 
PPTX
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
MongoDB
 
PPTX
PistonHead's use of MongoDB for Analytics
Andrew Morgan
 
PPTX
Introduction to MongoDB and Hadoop
Steven Francia
 
PPTX
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
MongoDB
 
PPTX
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial Indexes
MongoDB
 
MongoDB for Analytics
MongoDB
 
Webinar: General Technical Overview of MongoDB for Dev Teams
MongoDB
 
Getting Started with MongoDB and NodeJS
MongoDB
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Henrik Ingo
 
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
MongoDB
 
Back to Basics Webinar 5: Introduction to the Aggregation Framework
MongoDB
 
Back to Basics Webinar 2: Your First MongoDB Application
MongoDB
 
2014 bigdatacamp asya_kamsky
Data Con LA
 
MongoDB - Aggregation Pipeline
Jason Terpko
 
How to leverage what's new in MongoDB 3.6
Maxime Beugnet
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB
 
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...
MongoDB
 
Webinaire 2 de la série « Retour aux fondamentaux » : Votre première applicat...
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
Back to Basics Webinar 3: Schema Design Thinking in Documents
MongoDB
 
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
MongoDB
 
PistonHead's use of MongoDB for Analytics
Andrew Morgan
 
Introduction to MongoDB and Hadoop
Steven Francia
 
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
MongoDB
 
Back to Basics Webinar 4: Advanced Indexing, Text and Geospatial Indexes
MongoDB
 

Similar to MongoDB 3.2 - Analytics (20)

PPTX
Agg framework selectgroup feb2015 v2
MongoDB
 
PPTX
Webinar: Exploring the Aggregation Framework
MongoDB
 
PPTX
The Aggregation Framework
MongoDB
 
PDF
Aggregation Framework MongoDB Days Munich
Norberto Leite
 
PPTX
The Aggregation Framework
MongoDB
 
PPTX
MongoDB's New Aggregation framework
Chris Westin
 
PPTX
Joins and Other MongoDB 3.2 Aggregation Enhancements
Andrew Morgan
 
PDF
MongoDB Aggregation Framework
Caserta
 
PPTX
1403 app dev series - session 5 - analytics
MongoDB
 
PPTX
mongodb-aggregation-may-2012
Chris Westin
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PPTX
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
MongoDB
 
PDF
Mongo db aggregation guide
Deysi Gmarra
 
PDF
Mongo db aggregation-guide
Dan Llimpe
 
PDF
MongoDB .local Toronto 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
MongoDB
 
PDF
MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
MongoDB
 
PDF
MongoDB .local Munich 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pip...
MongoDB
 
PPTX
Beyond the Basics 2: Aggregation Framework
MongoDB
 
PDF
Webinar: Data Processing and Aggregation Options
MongoDB
 
PPTX
Aggregation Presentation for databses (1).pptx
plvdravikumarit
 
Agg framework selectgroup feb2015 v2
MongoDB
 
Webinar: Exploring the Aggregation Framework
MongoDB
 
The Aggregation Framework
MongoDB
 
Aggregation Framework MongoDB Days Munich
Norberto Leite
 
The Aggregation Framework
MongoDB
 
MongoDB's New Aggregation framework
Chris Westin
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Andrew Morgan
 
MongoDB Aggregation Framework
Caserta
 
1403 app dev series - session 5 - analytics
MongoDB
 
mongodb-aggregation-may-2012
Chris Westin
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
Webinar: Applikationsentwicklung mit MongoDB : Teil 5: Reporting & Aggregation
MongoDB
 
Mongo db aggregation guide
Deysi Gmarra
 
Mongo db aggregation-guide
Dan Llimpe
 
MongoDB .local Toronto 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
MongoDB
 
MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
MongoDB
 
MongoDB .local Munich 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pip...
MongoDB
 
Beyond the Basics 2: Aggregation Framework
MongoDB
 
Webinar: Data Processing and Aggregation Options
MongoDB
 
Aggregation Presentation for databses (1).pptx
plvdravikumarit
 
Ad

Recently uploaded (20)

PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
July Patch Tuesday
Ivanti
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
July Patch Tuesday
Ivanti
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Python basic programing language for automation
DanialHabibi2
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Ad

MongoDB 3.2 - Analytics

  • 1. Massimo Brignoli Principal Solutions Architect [email protected] @massimobrignoli Analytics in MongoDB
  • 2. Agenda • Analytics in MongoDB? • Aggregation Framework • Aggregation Pipeline Stages • Aggregation Framework in Action • Joins in MongoDB 3.2 • Integrations • Analytical Architectures
  • 3. Relational Expressive Query Language & Secondary Indexes Strong Consistency Enterprise Management & Integrations
  • 4. The World Has Changed Volume Velocity Variety Iterative Agile Short Cycles Always On Secure Global Open-Source Cloud Commodity Data Time Risk Cost
  • 5. Scalability & Performance Always On, Global Deployments FlexibilityExpressive Query Language & Secondary Indexes Strong Consistency Enterprise Management & Integrations NoSQL
  • 6. Nexus Architecture Scalability & Performance Always On, Global Deployments FlexibilityExpressive Query Language & Secondary Indexes Strong Consistency Enterprise Management & Integrations
  • 7. Some Common MongoDB Use Cases Single View Internet of Things Mobile Real-Time Analytics Catalog Personalization Content Management
  • 11. Analytics on MongoDB Data • Extract data from MongoDB and perform complex analytics with Hadoop – Batch rather than real-time – Extra nodes to manage • Direct access to MongoDB from SPARK • MongoDB BI Connector – Direct SQL Access from BI Tools • MongoDB aggregation pipeline – Real-time – Live, operational data set – Narrower feature set Hadoop Connector MapReduce & HDFS SQL Connector
  • 12. For Example: US Census Data • Census data from 1990, 2000, 2010 • Question: – Which US Division has the fastest growing population density? – We only want to include data states with more than 1M people – We only want to include divisions larger than 100K square miles – Division = a group of US States – Population density = Area of division/# of people – Data is provided at the state level
  • 13. US Regions and Divisions
  • 14. How would we solve this in SQL? • SELECT GROUP BY HAVING
  • 17. What is an Aggregation Pipeline? • A Series of Document Transformations – Executed in stages – Original input is a collection – Output as a cursor or a collection • Rich Library of Functions – Filter, compute, group, and summarize data – Output of one stage sent to input of next – Operations executed in sequential order
  • 23. Aggregation Pipeline $match $project $lookup {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {} {★ds} {★ds} {★ds} {★} {★} {★} {★} {★} {★} {★} {=d+s}
  • 24. Aggregation Pipeline $match $project $lookup {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {} {★ds} {★ds} {★ds} {★} {★} {★} {★} {★} {★} {★} {=d+s} {★[]} {★[]} {★}
  • 25. Aggregation Pipeline $match $project $lookup $group {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {★ds} {} {★ds} {★ds} {★ds} {★} {★} {★} {★} {★} {★} {★} {=d+s} { Σ λ σ} { Σ λ σ} { Σ λ σ} {★[]} {★[]} {★}
  • 26. Aggregation Pipeline Stages • $match Filter documents • $geoNear Geospherical query • $project Reshape documents • $lookup New – Left-outer equi joins • $unwind Expand documents • $group Summarize documents • $sample New – Randomly selects a subset of documents • $sort Order documents • $skip Jump over a number of documents • $limit Limit number of documents • $redact Restrict documents • $out Sends results to a new collection
  • 27. Aggregation Framework in Action (let’s play with the census data)
  • 28. MongoDB State Collection • Document For Each State • Name • Region • Division • Census Data For 1990, 2000, 2010 – Population – Housing Units – Occupied Housing Units • Census Data is an array with three subdocuments
  • 29. Document Model { "_id" : ObjectId("54e23c7b28099359f5661525"), "name" : "California", "region" : "West", "data" : [ { "totalPop" : 33871648, "totalHouse" : 12214549, "occHouse" : 11502870, "year" : 2000}, { "totalPop" : 37253956, "totalHouse" : 13680081, "occHouse" : 12577498, "year" : 2010}, { "totalPop" : 29760021, "totalHouse" : 11182882, "occHouse" : 29008161, "year" : 1990} ], … }
  • 30. Total US Area db.cData.aggregate([ {"$group" : {"_id" : null, "totalArea" : {$sum : "$areaM"}, "avgArea" : {$avg : "$areaM"}}}])
  • 31. $group • Group documents by value – Field reference, object, constant – Other output fields are computed • $max, $min, $avg, $sum • $addToSet, $push • $first, $last – Processes all data in memory by default
  • 32. Area By Region db.cData.aggregate([{ "$group" : { "_id" : "$region", "totalArea" : {$sum : "$areaM"}, "avgArea" : {$avg : "$areaM"}, "numStates" : {$sum : 1}, "states" : {$push : "$name"}}}])
  • 33. Calculating Average State Area By Region {state: ”New York", areaM: 218, region: “North East" } {state: ”New Jersey", areaM: 90, region: “North East” } {state: “California", area: 300, region: “West" } { $group: { _id: "$region", avgAreaM: {$avg: ”$areaM" } }} { _id: ”North East", avgAreaM: 154} {_id: “West", avgAreaM: 300}
  • 34. Calculating Total Area and State Count {state: ”New York", areaM: 218, region: “North East" } {state: ”New Jersey", areaM: 90, region: “North East” } {state: “California", area: 300, region: “West" } { $group: { _id: "$region", totArea: {$sum: ”$areaM" }, sCount : {$sum : 1} }} { _id: ”North East", totArea: 308 sCount: 2} { _id: “West", totArea: 300, sCount: 1}
  • 35. Total US Population By Year db.cData.aggregate([ {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalPop" : {$sum :"$data.totalPop"}}}, {$sort : {"totalPop" : 1}} ])
  • 36. $unwind • Operate on an array field – Create documents from array elements • Array replaced by element value • Missing/empty fields → no output • Non-array fields → error – Pipe to $group to aggregate
  • 37. $unwind { state: ”New York", census: [1990, 2000, 2010]} { state: ”New Jersey", census: [1990, 2000]} { state: “California", census: [1980, 1990, 2000, 2010]} { state: ”Delaware", census: [1990, 2000]} { $unwind: $census } { state: “New York”, census: 1990} { state: “New York”, census: 2000} { state: “New York”, census: 2010} { state: “New Jersey”, census: 1990} { state: “New Jersey”, census: 2000}
  • 38. Southern State Population By Year db.cData.aggregate([ {$match : {"region" : "South"}}, {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalPop" : {"$sum" :"$data.totalPop"}}} ])
  • 39. $match • Filter documents – Uses existing query syntax, same as .find()
  • 40. $match {state: ”New York", areaM: 218, region: “North East" } {state: ”Oregon", areaM: 245, region: “West” } {state: “California", area: 300, region: “West" } {state: ”Oregon", areaM: 245, region: “West”} {state: “California", area: 300, region: “West"} { $match: { “region” : “West” } }
  • 41. Population Delta By State from 1990 to 2010 db.cData.aggregate([ {$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : { "_id" : "$name", "pop1990" : {"$first" : "$data.totalPop"}, "pop2010" : {"$last" : "$data.totalPop"}}}, {$project : { "_id" : 0, "name" : "$_id", "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}, "pop1990" : 1, "pop2010" : 1} }])
  • 42. $sort, $limit, $skip • Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed • Limit and skip follow cursor behavior
  • 43. $first, $last • Collection operations like $push and $addToSet • Must be used in $group • $first and $last determined by document order • Typically used with $sort to ensure ordering is known
  • 44. $project • Reshape Documents – Include, exclude or rename fields – Inject computed fields – Create sub-document fields
  • 45. Including and Excluding Fields { "_id" : "Virginia”, "pop1990" : 453588, "pop2010" : 3725789 } { "_id" : "South Dakota", "pop1990" : 453588, "pop2010" : 3725789 } { $project: { “_id” : 0, “pop1990” : 1, “pop2010” : 1} } {"pop1990" : 453588, "pop2010" : 3725789} {"pop1990" : 453588, "pop2010" : 3725789}
  • 46. Renaming and Computing Fields { $project: { “_id” : 0, “pop1990” : 0, “pop2010” : 0, “name” : “$_id”, "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}} } { "_id" : "Virginia”, "pop1990" : 6187358, "pop2010" : 8001024 } { "_id" : "South Dakota", "pop1990" : 696004, "pop2010" : 814180 } {”name" : “Virginia”, ”delta" : 1813666} {“name" : “South Dakota”, “delta" : 118176}
  • 47. Compare number of people living within 500KM of Memphis, TN in 1990, 2000, 2010
  • 48. Compare number of people living within 500KM of Memphis, TN in 1990, 2000, 2010 db.cData.aggregate([ {$geoNear : { "near" : {"type" : "Point", "coordinates" : [90, 35]}, "distanceField" : "dist.calculated", "maxDistance" : 500000, "includeLocs" : "dist.location", "spherical": true }}, {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalPop" : {"$sum" : "$data.totalPop"}, "states" : {"$addToSet" : "$name"}}}, {$sort : {"_id" : 1}} ])
  • 49. $geoNear • Order/Filter Documents by Location – Requires a geospatial index – Output includes physical distance – Must be first aggregation stage
  • 50. $geoNear {"_id" : "Virginia”, "pop1990" : 6187358, "pop2010" : 8001024, “center” : {“type” : “Point”, “coordinates” : [78.6, 37.5]}} { "_id" : ”Tennessee", "pop1990" : 4877185, "pop2010" : 6346105, “center” : {“type” : “Point”, “coordinates” : [86.6, 37.8]}} {"_id" : ”Tennessee", "pop1990" : 4877185, "pop2010" : 6346105, “center” : {“type” : “Point”, “coordinates” : [86.6, 37.8]}} {$geoNear : { "near”: {"type”: "Point", "coordinates”: [90, 35]}, maxDistance : 500000, spherical : true }}
  • 51. What if I want to save the results to a collection? db.cData.aggregate([ {$geoNear : { "near" : {"type" : "Point", "coordinates" : [90, 35]}, “distanceField” : "dist.calculated", “maxDistance” : 500000, “includeLocs” : "dist.location", “spherical” : true }}, {$unwind : "$data"}, {$group : { "_id" : "$data.year", "totalPop" : {"$sum" : "$data.totalPop"}, "states" : {"$addToSet" : "$name"}}}, {$sort : {"_id" : 1}}, {$out : “peopleNearMemphis”} ])
  • 52. $out db.cData.aggregate([<pipeline stages>, {“$out”:“resultsCollection”}]) • Save aggregation results to a new collection • New aggregation uses: • Transform documents - ETL
  • 53. Back To The Original Question • Which US Division has the fastest growing population density? – We only want to include data states with more than 1M people – We only want to include divisions larger than 100K square miles
  • 54. Division with Fastest Growing Pop Density db.cData.aggregate( [{$match : {"data.totalPop" : {"$gt" : 1000000}}}, {$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalPop"}, "pop2010" : {"$last" : "$data.totalPop"}, "areaM" : {"$first" : "$areaM"}, "division" : {"$first" : "$division"}}}, {$group : { "_id" : "$division", "totalPop1990" : {"$sum" : "$pop1990"}, "totalPop2010" : {"$sum" : "$pop2010"}, "totalAreaM" : {"$sum" : "$areaM"}}}, {$match : {"totalAreaM" : {"$gt" : 100000}}}, {$project : {"_id" : 0, "division" : "$_id", "density1990" : {"$divide" : ["$totalPop1990", "$totalAreaM"]}, "density2010" : {"$divide" : ["$totalPop2010", "$totalAreaM"]}, "denDelta" : {"$subtract" : [{"$divide" : ["$totalPop2010", "$totalAreaM"]}, {"$divide" : ["$totalPop1990","$totalAreaM"]}]}, "totalAreaM" : 1, "totalPop1990" : 1, "totalPop2010" : 1}}, {$sort : {"denDelta" : -1}}])
  • 56. Aggregate options db.cData.aggregate([<pipeline stages>], {‘explain’ : false 'allowDiskUse' : true, 'cursor' : {'batchSize' : 5}}) • explain – similar to find().explain() • allowDiskUse – enable use of disk to store intermediate results • cursor – specify the size of the initial result
  • 58. Sharding • Workload split between shards – Shards execute pipeline up to a point – Primary shard merges cursors and continues processing* – Use explain to analyze pipeline split – Early $match can exclude shards – Potential CPU and memory implications for primary shard host *Prior to v2.6 second stage pipeline processing was done by mongos
  • 59. MongoDB 3.2: Joins and other improvements
  • 60. Existing Alternatives to Joins { "_id": 10000, "items": [ { "productName": "laptop", "unitPrice": 1000, "weight": 1.2, "remainingStock": 23}, { "productName": "mouse", "unitPrice": 20, "weight": 0.2, "remainingStock": 276}], … } • Option 1: Include all data for an order in the same document – Fast reads • One find delivers all the required data – Captures full description at the time of the event – Consumes extra space • Details of each product stored in many order documents – Complex to maintain • A change to any product attribute must be propagated to all affected orders orders
  • 61. The Winner? • In general, Option 1 wins – Performance and containment of everything in same place beats space efficiency of normalization – There are exceptions • e.g. Comments in a blog post -> unbounded size • However, analytics benefit from combining data from multiple collections – Keep listening...
  • 62. Existing Alternatives to Joins { "_id": 10000, "items": [ 12345, 54321 ], ... } • Option 2: Order document references product documents – Slower reads • Multiple trips to the database – Space efficient • Product details stored once – Lose point-in-time snapshot of full record – Extra application logic • Must iterate over product IDs in the order document and find the product documents • RDBMS would automate through a JOIN orders { "_id": 12345, "productName": "laptop", "unitPrice": 1000, "weight": 1.2, "remainingStock": 23 } { "_id": 54321, "productName": "mouse", "unitPrice": 20, "weight": 0.2, "remainingStock": 276 } products
  • 63. $lookup • Left-outer join – Includes all documents from the left collection – For each document in the left collection, find the matching documents from the right collection and embed them Left Collection Right Collection
  • 64. $lookup db.leftCollection.aggregate([{ $lookup: { from: “rightCollection”, localField: “leftVal”, foreignField: “rightVal”, as: “embeddedData” } }]) Left Collection Right Collection
  • 65. Worked Example – Data Set db.postcodes.findOne() { "_id":ObjectId("5600521e50fa77da54d fc0d2"), "postcode": "SL6 0AA", "location": { "type": "Point", "coordinates": [ 51.525605, -0.700974 ]}} db.homeSales.findOne() { "_id":ObjectId("56005dd980c3678b19792b7f"), "amount": 9000, "date": ISODate("1996-09-19T00:00:00Z"), "address": { "nameOrNumber": 25, "street": "NORFOLK PARK COTTAGES", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 7DR" } }
  • 66. Reduce Data Set First db.homeSales.aggregate([ {$match: { amount: {$gte:3000000}} } ]) … { "_id": ObjectId("56005dda80c3678b19799e52"), "amount": 3000000, "date": ISODate("2012-04-19T00:00:00Z"), "address": { "nameOrNumber": "TEMPLE FERRY PLACE", "street": "MILL LANE", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" } },…
  • 67. Join (left-outer-equi) Results With Second Collection db.homeSales.aggregate([ {$match: { amount: {$gte:3000000}} }, {$lookup: { from: "postcodes", localField: "address.postcode", foreignField: "postcode", as: "postcode_docs"} } ]) ... "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" }, "postcode_docs": [ { "_id": ObjectId("560053e280c3678b1978b293"), "postcode": "SL6 5ND", "location": { "type": "Point", "coordinates": [ 51.549516, -0.80702 ] }}]}, ...
  • 68. Refactor Each Resulting Document ...}, {$project: { _id: 0, saleDate: ”$date", price: "$amount", address: 1, location: {$arrayElemAt: ["$postcode_docs.location", 0]}} ]) { "address": { "nameOrNumber": "TEMPLE FERRY PLACE", "street": "MILL LANE", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 5ND" }, "saleDate": ISODate("2012-04-19T00:00:00Z"), "price": 3000000, "location": { "type": "Point", "coordinates": [ 51.549516, -0.80702 ]}},...
  • 69. Sort on Sale Price & Write to Collection ...}, {$sort: {price: -1}}, {$out: "hotSpots"} ]) …{"address": { "nameOrNumber": "2 - 3", "street": "THE SWITCHBACK", "town": "MAIDENHEAD", "county": "WINDSOR AND MAIDENHEAD", "postcode": "SL6 7RJ" }, "saleDate": ISODate("1999-03-15T00:00:00Z"), "price": 5425000, "location": { "type": "Point", "coordinates": [ 51.536848, -0.735835 ]}},...
  • 70. Aggregated Statistics db.homeSales.aggregate([ {$group: { _id: {$year: "$date"}, higestPrice: {$max: "$amount"}, lowestPrice: {$min: "$amount"}, averagePrice: {$avg: "$amount"}, amountStdDev: {$stdDevPop: "$amount"} }} ]) ... { "_id": 1995, "higestPrice": 1000000, "lowestPrice": 12000, "averagePrice": 114059.35206869633, "amountStdDev": 81540.50490801703 }, { "_id": 1996, "higestPrice": 975000, "lowestPrice": 9000, "averagePrice": 118862, "amountStdDev": 79871.07569783277 }, ...
  • 71. Clean Up Output ..., {$project: { _id: 0, year: "$_id", higestPrice: 1, lowestPrice: 1, averagePrice: {$trunc: "$averagePrice"}, priceStdDev: {$trunc: "$amountStdDev"} } } ]) ... { "higestPrice": 1000000, "lowestPrice": 12000, "averagePrice": 114059, "year": 1995, "priceStdDev": 81540 }, { "higestPrice": 2200000, "lowestPrice": 10500, "averagePrice": 307372, "year": 2004, "priceStdDev": 199643 },...
  • 74. Input data Hadoop Cluster -or- .BSON Mongo-Hadoop Connector • Turn MongoDB into a Hadoop-enabled filesystem: use as the input or output for Hadoop • Works with MongoDB backup files (.bson)
  • 75. Benefits and Features • Takes advantage of full multi-core parallelism to process data in Mongo • Full integration with Hadoop and JVM ecosystems • Can be used with Amazon Elastic MapReduce • Can read and write backup files from local filesystem, HDFS, or S3
  • 76. Benefits and Features • Vanilla Java MapReduce • If you don’t want to use Java, support for Hadoop Streaming. • Write MapReduce code in
  • 77. Benefits and Features • Support for Pig – high-level scripting language for data analysis and building map/reduce workflows • Support for Hive – SQL-like language for ad-hoc queries + analysis of data sets on Hadoop- compatible file systems
  • 78. How It Works • Adapter examines the MongoDB input collection and calculates a set of splits from the data • Each split gets assigned to a node in Hadoop cluster • In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally • Hadoop merges results and streams output back to MongoDB or BSON
  • 80. MongoDB Connector for BI Visualize and explore multi-dimensional documents using SQL-based BI tools. The connector does the following: • Provides the BI tool with the schema of the MongoDB collection to be visualized • Translates SQL statements issued by the BI tool into equivalent MongoDB queries that are sent to MongoDB for processing • Converts the results into the tabular format expected by the BI tool, which can then visualize the data based on user requirements
  • 81. Location & Flow of Data MongoDB BI Connector Mapping meta-data Application data {name: “Andrew”, address: {street:… }} DocumentTableAnalytics & visualization
  • 82. 82 Defining Data Mapping mongodrdl --host 192.168.1.94 --port 27017 -d myDbName -o myDrdlFile.drdl mongobischema import myCollectionName myDrdlFile.drdl DRDL mongodrdl mongobischema PostgreSQL MongoDB- specific Foreign Data Wrapper
  • 83. 83 Optionally Manually Edit DRDL File • Redact attributes • Use more appropriate types (sampling can get it wrong) • Rename tables (v1.1+) • Rename columns (v1.1+) • Build new views using MongoDB Aggregation Framework • e.g., $lookup to join 2 tables - table: homesales collection: homeSales pipeline: [] columns: - name: _id mongotype: bson.ObjectId sqlname: _id sqltype: varchar - name: address.county mongotype: string sqlname: address_county sqltype: varchar - name: address.nameOrNumber mongotype: int sqlname: address_nameornumber sqltype: varchar
  • 86. Framework Use Cases • Complex aggregation queries • Ad-hoc reporting • Real-time analytics • Visualizing and reshaping data