Agg framework selectgroup feb2015 v2

Exploring the Aggregation Framework
Jay Runkel
Solutions Architect
jay.runkel@mongodb.com
@jayrunkel

Agenda
1. Analytics in MongoDB?
2. Aggregation Framework
3. Aggregation Framework in Action
– US Census Data
4. Aggregation Framework Options

Analytics in MongoDB?
Create
Read
Update
Delete
Analytics
?
Group
Count
Derive Values
Filter
Average
Sort

For Example: US Census Data
• Census data from 1990, 2000, 2010
• Question:
Which US Division has the fastest growing population density?
– We only want to include data states with more than 1M people
– We only want to include divisions larger than 100K square miles
Division = a group of US States
Population density = Area of division/# of people
Data is provided at the state level

How would we solve this in SQL?
• SELECT GROUP BY HAVING

What is an Aggregation Pipeline?
• A Series of Document Transformations
– Executed in stages
– Original input is a collection
– Output as a cursor or a collection
• Rich Library of Functions
– Filter, compute, group, and summarize data
– Output of one stage sent to input of next
– Operations executed in sequential order

Pipeline Operators
• $match
Filter documents
• $project
Reshape documents
• $group
Summarize documents
• $unwind
Expand documents
• $sort
Order documents
• $limit/$skip
Paginate documents
• $redact
Restrict documents
• $geoNear
Proximity sort documents
• $let,$map
Define variables

Aggregation Framework in Action
(let’s play with the census data)

MongoDB State Collection
• Document For Each State
• Name
• Region
• Division
• Census Data For 1990, 2000, 2010
– Population
– Housing Units
– Occupied Housing Units
• Census Data is an array with three subdocuments

Document Model
{ "_id" : ObjectId("54e23c7b28099359f5661525"),
"name" : "California",
"region" : "West",
"data" : [
{"totalPop" : 33871648,
"totalHouse" : 12214549,
"occHouse" : 11502870,
"year" : 2000},
{"totalPop" : 37253956,
"occHouse" : 12577498,
"year" : 2010},
{"totalPop" : 29760021,
"occHouse" : 29008161,
"year" : 1990}
],
…
}

Total US Area
db.cData.aggregate([
{"$group" : {"_id" : null,
"totalArea" : {$sum : "$areaM"},
"avgArea" : {$avg : "$areaM"}}}])

$group
• Group documents by value
– Field reference, object, constant
– Other output fields are computed
• $max, $min, $avg, $sum
• $addToSet, $push
• $first, $last
– Processes all data in memory by
default

Area By Region
{"$group" : {"_id" : "$region",
"totalArea" : {$sum : "$areaM"},
"avgArea" : {$avg : "$areaM"},
"numStates" : {$sum : 1},
"states" : {$push : "$name"}}}
])

Calculating Average StateArea By Region
{ $group: {
_id: "$region",
avgAreaM: {$avg:
”$areaM" }
}}
{
_id: ”North East",
avgAreaM: 154
}
{
_id: “West",
avgAreaM: 300
}
{
state: ”New York",
areaM: 218,
region: “North East"
}
{
state: ”New Jersey",
areaM: 90,
region: “North East”
}
{
state: “California",
areaM: 300,
region: “West"
}

Calculating Total Area and State Count
{ $group: {
_id: "$region",
totArea: {$sum:
”$areaM" },
sCount : {$sum : 1}}}
{
_id: ”North East",
totArea: 308
sCount: 2}
{
_id: “West",
totArea: 300,
sCount: 1}
{
areaM: 218,
}
{
areaM: 90,
region: “North East”
}
{
area: 300,
region: “West"
}

Total US Population By Year
db.cData.aggregate(
[{$unwind : "$data"},
{$group : {"_id" : "$data.year",
"totalPop" : {$sum : "$data.totalPop"}}},
{$sort : {"totalPop" : 1}}
])

$unwind
• Operate on an array field
– Create documents from array elements
• Array replaced by element value
• Missing/empty fields → no output
• Non-array fields → error
– Pipe to $group to aggregate

$unwind
{ $unwind: $census }
{ state: “New York,
census: 1990}
{
census: [1990, 2000,
2010]
}
{
census: [1990, 2000]
}
{
census: [1980, 1990,
2000, 2010]
}
{
state: ”Delaware",
census: [1990, 2000]
}
census: 2000}
census: 2010}
{ state: “New Jersey,
census: 1990}
{ state: “New Jersey,
census: 2000}
…

Southern State Population By Year
db.cData.aggregate(
[{$match : {"region" : "South"}},
{$unwind : "$data"},
"totalPop” : {"$sum” :
"$data.totalPop"}}}])

$match
• Filter documents
– Uses existing query syntax
– No $where (server side Javascript)

$match
{ $match:
{ “region” : “West” }
}
{
areaM: 218,
}
{
state: ”Oregon",
areaM: 245,
region: “West”
}
{
area: 300,
region: “West"
}
{
state: ”Oregon",
areaM: 245,
region: “West”
}
{
area: 300,
region: “West"
}

Population Delta By State from 1990 to 2010
db.cData.aggregate(
[{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : {"_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"}}},
{$project : {"_id" : 0,
"name" : "$_id",
"delta" : {"$subtract" :
["$pop2010", "$pop1990"]},
"pop1990" : 1,
"pop2010” : 1}
}]
)

$sort, $limit, $skip
• Sort documents by one or more fields
– Same order syntax as cursors
– Waits for earlier pipeline operator to return
– In-memory unless early and indexed
• Limit and skip follow cursor behavior

$first, $last
• Collection operations like $push and $addToSet
• Must be used in $group
• $first and $last determined by document order
• Typically used with $sort to ensure ordering is
known

$project
• Reshape Documents
– Include, exclude or rename fields
– Inject computed fields
– Create sub-document fields

Including and Excluding Fields
{ $project:
{ “_id” : 0,
“pop1990” : 1,
“pop2010” : 1
}
{
"_id" : "Virginia”,
"pop1990" : 453588,
"pop2010" : 3725789
}
{
"_id" : "South Dakota",
"pop1990" : 453588,
"pop2010" : 3725789
} {
"pop1990" : 453588,
"pop2010" : 3725789
}
{
"pop1990" : 453588,
"pop2010" : 3725789
}

{
”name" : “South Dakota”,
”delta" : 118176
}
Renaming and Computing Fields
{ $project:
{ “_id” : 0,
“pop1990” : 0,
“pop2010” : 0,
“name” : “$_id”,
"delta" :
{"$subtract" :
["$pop2010",
"$pop1990"]}}
}
{
"pop1990" : 6187358,
"pop2010" : 8001024
}
{
"_id" : "South Dakota",
"pop1990" : 696004,
"pop2010" : 814180
}
{
”name" : “Virginia”,
”delta" : 1813666
}

Compare number of people living within
500KM of Memphis, TN in 1990, 2000, 2010

Compare number of people living within
500KM of Memphis, TN in 1990, 2000, 2010
{$geoNear : {
"near" : {"type" : "Point", "coordinates" : [90, 35]},
“distanceField” : "dist.calculated",
“maxDistance” : 500000,
“includeLocs” : "dist.location",
“spherical” : true }},
"totalPop" : {"$sum" : "$data.totalPop"},
"states" : {"$addToSet" : "$name"}}},
{$sort : {"_id" : 1}}
])

$geoNear
• Order/Filter Documents by Location
– Requires a geospatial index
– Output includes physical distance
– Must be first aggregation stage

{
"_id" : ”Tennessee",
"pop1990" : 4877185,
"pop2010" : 6346105,
“center” :
{“type” : “Point”,
“coordinates” :
[86.6, 37.8]}
}
{
"pop1990" : 6187358,
"pop2010" : 8001024,
“center” :
“coordinates” :
[78.6, 37.5]}
}
$geoNear
{$geoNear : {
"near”: {"type”: "Point",
"coordinates”:
[90, 35]},
maxDistance : 500000,
spherical : true }}
{
"_id" : ”Tennessee",
"pop1990" : 4877185,
"pop2010" : 6346105,
“center” :
“coordinates” :
[86.6, 37.8]}
}

What if I want to save the results to a
collection?
{$geoNear : {
"near" : {"type" : "Point", "coordinates" : [90, 35]},
“distanceField” : "dist.calculated",
“maxDistance” : 500000,
“includeLocs” : "dist.location",
“spherical” : true }},
"totalPop" : {"$sum" : "$data.totalPop"},
"states" : {"$addToSet" : "$name"}}},
{$sort : {"_id" : 1}},
{$out : “peopleNearMemphis”}
])

$out
db.cData.aggregate([<pipeline stages>,
{“$out” : “resultsCollection”}])
• Save aggregation results to a new collection
• New aggregation uses:
• Transform documents - ETL

Back To The Original Question
• Which US Division has the fastest growing population density?
– We only want to include data states with more than 1M people
– We only want to include divisions larger than 100K square miles

Division with Fastest Growing Pop Density
db.cData.aggregate(
[{$match : {"data.totalPop" : {"$gt" : 1000000}}},
{$sort : {"data.year" : 1}},
{$group : {"_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"},
"areaM" : {"$first" : "$areaM"},
"division" : {"$first" : "$division"}}},
{$group : {"_id" : "$division",
"totalPop1990" : {"$sum" : "$pop1990"},
"totalPop2010" : {"$sum" : "$pop2010"},
"totalAreaM" : {"$sum" : "$areaM"}}},
{$match : {"totalAreaM" : {"$gt" : 100000}}},
{$project : {"_id" : 0,
"division" : "$_id",
"density1990" : {"$divide" : ["$totalPop1990", "$totalAreaM"]},
"density2010" : {"$divide" : ["$totalPop2010", "$totalAreaM"]},
"denDelta" : {"$subtract" : [{"$divide" : ["$totalPop2010",
"$totalAreaM"]},
{"$divide" : ["$totalPop1990”,
"$totalAreaM"]}]},
"totalAreaM" : 1,
"totalPop1990" : 1,
"totalPop2010" : 1}},
{$sort : {"denDelta" : -1}}])

Aggregate options
db.cData.aggregate([<pipeline stages>],
{‘explain’ : false
'allowDiskUse' : true,
'cursor' : {'batchSize' : 5}})
explain – similar to find().explain()
allowDiskUse – enable use of disk to store intermediate
results
cursor – specify the size of the initial result

Sharding
• Workload split between shards
– Shards execute pipeline up
to a point
– Primary shard merges
cursors and continues
processing*
– Use explain to analyze
pipeline split
– Early $match may excuse
shards
– Potential CPU and memory
implications for primary
shard host
*Prior to v2.6 second stage pipeline processing was
done by mongos

Analytics in MongoDB?
Create
Read
Update
Deletet
Analytics
?
Group
Count
Derive Values
Filter
Average
Sort
YES!

Framework Use Cases
• Basic aggregation queries
• Ad-hoc reporting
• Real-time analytics
• Visualizing and reshaping data

Questions?
jay.runkel@mongodb.com
@jayrunkel

Agg framework selectgroup feb2015 v2

More Related Content

What's hot (20)

Similar to Agg framework selectgroup feb2015 v2 (20)

More from MongoDB (20)

Recently uploaded (20)

Agg framework selectgroup feb2015 v2