SlideShare a Scribd company logo
Exploring the Aggregation Framework
Jay Runkel
Solutions Architect
jay.runkel@mongodb.com
@jayrunkel
Agenda
1. Analytics in MongoDB?
2. Aggregation Framework
3. Aggregation Framework in Action
– US Census Data
4. Aggregation Framework Options
Analytics in MongoDB?
Create
Read
Update
Delete
Analytics
?
Group
Count
Derive Values
Filter
Average
Sort
For Example: US Census Data
• Census data from 1990, 2000, 2010
• Question:
Which US Division has the fastest growing population density?
– We only want to include data states with more than 1M people
– We only want to include divisions larger than 100K square miles
Division = a group of US States
Population density = Area of division/# of people
Data is provided at the state level
US Regions and Divisions
How would we solve this in SQL?
• SELECT GROUP BY HAVING
What About MongoDB?
Aggregation Framework
What is an Aggregation Pipeline?
• A Series of Document Transformations
– Executed in stages
– Original input is a collection
– Output as a cursor or a collection
• Rich Library of Functions
– Filter, compute, group, and summarize data
– Output of one stage sent to input of next
– Operations executed in sequential order
Aggregation Pipeline
Pipeline Operators
• $match
Filter documents
• $project
Reshape documents
• $group
Summarize documents
• $unwind
Expand documents
• $sort
Order documents
• $limit/$skip
Paginate documents
• $redact
Restrict documents
• $geoNear
Proximity sort documents
• $let,$map
Define variables
Aggregation Framework in Action
(let’s play with the census data)
MongoDB State Collection
• Document For Each State
• Name
• Region
• Division
• Census Data For 1990, 2000, 2010
– Population
– Housing Units
– Occupied Housing Units
• Census Data is an array with three subdocuments
Document Model
{ "_id" : ObjectId("54e23c7b28099359f5661525"),
"name" : "California",
"region" : "West",
"data" : [
{"totalPop" : 33871648,
"totalHouse" : 12214549,
"occHouse" : 11502870,
"year" : 2000},
{"totalPop" : 37253956,
"totalHouse" : 13680081,
"occHouse" : 12577498,
"year" : 2010},
{"totalPop" : 29760021,
"totalHouse" : 11182882,
"occHouse" : 29008161,
"year" : 1990}
],
…
}
Count, Distinct
Total US Area
db.cData.aggregate([
{"$group" : {"_id" : null,
"totalArea" : {$sum : "$areaM"},
"avgArea" : {$avg : "$areaM"}}}])
$group
• Group documents by value
– Field reference, object, constant
– Other output fields are computed
• $max, $min, $avg, $sum
• $addToSet, $push
• $first, $last
– Processes all data in memory by
default
Area By Region
db.cData.aggregate([
{"$group" : {"_id" : "$region",
"totalArea" : {$sum : "$areaM"},
"avgArea" : {$avg : "$areaM"},
"numStates" : {$sum : 1},
"states" : {$push : "$name"}}}
])
Calculating Average StateArea By Region
{ $group: {
_id: "$region",
avgAreaM: {$avg:
”$areaM" }
}}
{
_id: ”North East",
avgAreaM: 154
}
{
_id: “West",
avgAreaM: 300
}
{
state: ”New York",
areaM: 218,
region: “North East"
}
{
state: ”New Jersey",
areaM: 90,
region: “North East”
}
{
state: “California",
areaM: 300,
region: “West"
}
Calculating Total Area and State Count
{ $group: {
_id: "$region",
totArea: {$sum:
”$areaM" },
sCount : {$sum : 1}}}
{
_id: ”North East",
totArea: 308
sCount: 2}
{
_id: “West",
totArea: 300,
sCount: 1}
{
state: ”New York",
areaM: 218,
region: “North East"
}
{
state: ”New Jersey",
areaM: 90,
region: “North East”
}
{
state: “California",
area: 300,
region: “West"
}
Total US Population By Year
db.cData.aggregate(
[{$unwind : "$data"},
{$group : {"_id" : "$data.year",
"totalPop" : {$sum : "$data.totalPop"}}},
{$sort : {"totalPop" : 1}}
])
$unwind
• Operate on an array field
– Create documents from array elements
• Array replaced by element value
• Missing/empty fields → no output
• Non-array fields → error
– Pipe to $group to aggregate
$unwind
{ $unwind: $census }
{ state: “New York,
census: 1990}
{
state: ”New York",
census: [1990, 2000,
2010]
}
{
state: ”New Jersey",
census: [1990, 2000]
}
{
state: “California",
census: [1980, 1990,
2000, 2010]
}
{
state: ”Delaware",
census: [1990, 2000]
}
{ state: “New York,
census: 2000}
{ state: “New York,
census: 2010}
{ state: “New Jersey,
census: 1990}
{ state: “New Jersey,
census: 2000}
…
Southern State Population By Year
db.cData.aggregate(
[{$match : {"region" : "South"}},
{$unwind : "$data"},
{$group : {"_id" : "$data.year",
"totalPop” : {"$sum” :
"$data.totalPop"}}}])
$match
• Filter documents
– Uses existing query syntax
– No $where (server side Javascript)
$match
{ $match:
{ “region” : “West” }
}
{
state: ”New York",
areaM: 218,
region: “North East"
}
{
state: ”Oregon",
areaM: 245,
region: “West”
}
{
state: “California",
area: 300,
region: “West"
}
{
state: ”Oregon",
areaM: 245,
region: “West”
}
{
state: “California",
area: 300,
region: “West"
}
Population Delta By State from 1990 to 2010
db.cData.aggregate(
[{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : {"_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"}}},
{$project : {"_id" : 0,
"name" : "$_id",
"delta" : {"$subtract" :
["$pop2010", "$pop1990"]},
"pop1990" : 1,
"pop2010” : 1}
}]
)
Population Delta By State from 1990 to 2010
db.cData.aggregate(
[{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : {"_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"}}},
{$project : {"_id" : 0,
"name" : "$_id",
"delta" : {"$subtract" :
["$pop2010", "$pop1990"]},
"pop1990" : 1,
"pop2010” : 1}
}]
)
$sort, $limit, $skip
• Sort documents by one or more fields
– Same order syntax as cursors
– Waits for earlier pipeline operator to return
– In-memory unless early and indexed
• Limit and skip follow cursor behavior
Population Delta By State from 1990 to 2010
db.cData.aggregate(
[{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : {"_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"}}},
{$project : {"_id" : 0,
"name" : "$_id",
"delta" : {"$subtract" :
["$pop2010", "$pop1990"]},
"pop1990" : 1,
"pop2010” : 1}
}]
)
$first, $last
• Collection operations like $push and $addToSet
• Must be used in $group
• $first and $last determined by document order
• Typically used with $sort to ensure ordering is
known
Population Delta By State from 1990 to 2010
db.cData.aggregate(
[{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : {"_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"}}},
{$project : {"_id" : 0,
"name" : "$_id",
"delta" : {"$subtract" :
["$pop2010", "$pop1990"]},
"pop1990" : 1,
"pop2010” : 1}
}]
)
$project
• Reshape Documents
– Include, exclude or rename fields
– Inject computed fields
– Create sub-document fields
Including and Excluding Fields
{ $project:
{ “_id” : 0,
“pop1990” : 1,
“pop2010” : 1
}
{
"_id" : "Virginia”,
"pop1990" : 453588,
"pop2010" : 3725789
}
{
"_id" : "South Dakota",
"pop1990" : 453588,
"pop2010" : 3725789
} {
"pop1990" : 453588,
"pop2010" : 3725789
}
{
"pop1990" : 453588,
"pop2010" : 3725789
}
{
”name" : “South Dakota”,
”delta" : 118176
}
Renaming and Computing Fields
{ $project:
{ “_id” : 0,
“pop1990” : 0,
“pop2010” : 0,
“name” : “$_id”,
"delta" :
{"$subtract" :
["$pop2010",
"$pop1990"]}}
}
{
"_id" : "Virginia”,
"pop1990" : 6187358,
"pop2010" : 8001024
}
{
"_id" : "South Dakota",
"pop1990" : 696004,
"pop2010" : 814180
}
{
”name" : “Virginia”,
”delta" : 1813666
}
Compare number of people living within
500KM of Memphis, TN in 1990, 2000, 2010
Compare number of people living within
500KM of Memphis, TN in 1990, 2000, 2010
db.cData.aggregate([
{$geoNear : {
"near" : {"type" : "Point", "coordinates" : [90, 35]},
“distanceField” : "dist.calculated",
“maxDistance” : 500000,
“includeLocs” : "dist.location",
“spherical” : true }},
{$unwind : "$data"},
{$group : {"_id" : "$data.year",
"totalPop" : {"$sum" : "$data.totalPop"},
"states" : {"$addToSet" : "$name"}}},
{$sort : {"_id" : 1}}
])
$geoNear
• Order/Filter Documents by Location
– Requires a geospatial index
– Output includes physical distance
– Must be first aggregation stage
{
"_id" : ”Tennessee",
"pop1990" : 4877185,
"pop2010" : 6346105,
“center” :
{“type” : “Point”,
“coordinates” :
[86.6, 37.8]}
}
{
"_id" : "Virginia”,
"pop1990" : 6187358,
"pop2010" : 8001024,
“center” :
{“type” : “Point”,
“coordinates” :
[78.6, 37.5]}
}
$geoNear
{$geoNear : {
"near”: {"type”: "Point",
"coordinates”:
[90, 35]},
maxDistance : 500000,
spherical : true }}
{
"_id" : ”Tennessee",
"pop1990" : 4877185,
"pop2010" : 6346105,
“center” :
{“type” : “Point”,
“coordinates” :
[86.6, 37.8]}
}
What if I want to save the results to a
collection?
db.cData.aggregate([
{$geoNear : {
"near" : {"type" : "Point", "coordinates" : [90, 35]},
“distanceField” : "dist.calculated",
“maxDistance” : 500000,
“includeLocs” : "dist.location",
“spherical” : true }},
{$unwind : "$data"},
{$group : {"_id" : "$data.year",
"totalPop" : {"$sum" : "$data.totalPop"},
"states" : {"$addToSet" : "$name"}}},
{$sort : {"_id" : 1}},
{$out : “peopleNearMemphis”}
])
$out
db.cData.aggregate([<pipeline stages>,
{“$out” : “resultsCollection”}])
• Save aggregation results to a new collection
• New aggregation uses:
• Transform documents - ETL
Back To The Original Question
• Which US Division has the fastest growing population density?
– We only want to include data states with more than 1M people
– We only want to include divisions larger than 100K square miles
Division with Fastest Growing Pop Density
db.cData.aggregate(
[{$match : {"data.totalPop" : {"$gt" : 1000000}}},
{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : {"_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"},
"areaM" : {"$first" : "$areaM"},
"division" : {"$first" : "$division"}}},
{$group : {"_id" : "$division",
"totalPop1990" : {"$sum" : "$pop1990"},
"totalPop2010" : {"$sum" : "$pop2010"},
"totalAreaM" : {"$sum" : "$areaM"}}},
{$match : {"totalAreaM" : {"$gt" : 100000}}},
{$project : {"_id" : 0,
"division" : "$_id",
"density1990" : {"$divide" : ["$totalPop1990", "$totalAreaM"]},
"density2010" : {"$divide" : ["$totalPop2010", "$totalAreaM"]},
"denDelta" : {"$subtract" : [{"$divide" : ["$totalPop2010",
"$totalAreaM"]},
{"$divide" : ["$totalPop1990”,
"$totalAreaM"]}]},
"totalAreaM" : 1,
"totalPop1990" : 1,
"totalPop2010" : 1}},
{$sort : {"denDelta" : -1}}])
Aggregate Options
Aggregate options
db.cData.aggregate([<pipeline stages>],
{‘explain’ : false
'allowDiskUse' : true,
'cursor' : {'batchSize' : 5}})
explain – similar to find().explain()
allowDiskUse – enable use of disk to store intermediate
results
cursor – specify the size of the initial result
Aggregation and Sharding
Sharding
• Workload split between shards
– Shards execute pipeline up
to a point
– Primary shard merges
cursors and continues
processing*
– Use explain to analyze
pipeline split
– Early $match may excuse
shards
– Potential CPU and memory
implications for primary
shard host
*Prior to v2.6 second stage pipeline processing was
done by mongos
Summary
Analytics in MongoDB?
Create
Read
Update
Deletet
Analytics
?
Group
Count
Derive Values
Filter
Average
Sort
YES!
Framework Use Cases
• Basic aggregation queries
• Ad-hoc reporting
• Real-time analytics
• Visualizing and reshaping data
Questions?
jay.runkel@mongodb.com
@jayrunkel

More Related Content

What's hot (20)

PPTX
Aggregation Framework
MongoDB
 
PPTX
Aggregation in MongoDB
Kishor Parkhe
 
PPTX
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB
 
PPTX
MongoDB World 2016 : Advanced Aggregation
Joe Drumgoole
 
ODP
Aggregation Framework in MongoDB Overview Part-1
Anuj Jain
 
PDF
Mongodb Aggregation Pipeline
zahid-mian
 
PPTX
Beyond the Basics 2: Aggregation Framework
MongoDB
 
PDF
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB
 
PDF
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Henrik Ingo
 
PDF
Webinar: Working with Graph Data in MongoDB
MongoDB
 
PPTX
MongoDB - Aggregation Pipeline
Jason Terpko
 
PPTX
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
MongoDB
 
PPTX
MongoDB Aggregation
Amit Ghosh
 
PDF
MongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines
MongoDB
 
PPT
Introduction to MongoDB
Nosh Petigara
 
PPTX
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB
 
PDF
Webinar: Data Processing and Aggregation Options
MongoDB
 
PPT
Introduction to MongoDB
antoinegirbal
 
PPTX
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
MongoDB
 
PDF
MongoDB Europe 2016 - Debugging MongoDB Performance
MongoDB
 
Aggregation Framework
MongoDB
 
Aggregation in MongoDB
Kishor Parkhe
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB
 
MongoDB World 2016 : Advanced Aggregation
Joe Drumgoole
 
Aggregation Framework in MongoDB Overview Part-1
Anuj Jain
 
Mongodb Aggregation Pipeline
zahid-mian
 
Beyond the Basics 2: Aggregation Framework
MongoDB
 
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Henrik Ingo
 
Webinar: Working with Graph Data in MongoDB
MongoDB
 
MongoDB - Aggregation Pipeline
Jason Terpko
 
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
MongoDB
 
MongoDB Aggregation
Amit Ghosh
 
MongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines
MongoDB
 
Introduction to MongoDB
Nosh Petigara
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB
 
Webinar: Data Processing and Aggregation Options
MongoDB
 
Introduction to MongoDB
antoinegirbal
 
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
MongoDB
 
MongoDB Europe 2016 - Debugging MongoDB Performance
MongoDB
 

Similar to Agg framework selectgroup feb2015 v2 (20)

PPTX
MongoDB 3.2 - Analytics
Massimo Brignoli
 
PPTX
Querying mongo db
Bogdan Sabău
 
PPTX
Joins and Other MongoDB 3.2 Aggregation Enhancements
Andrew Morgan
 
PPTX
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB
 
PDF
Starting out with MongoDB
Harvard Web Working Group
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PPTX
SH 2 - SES 3 - MongoDB Aggregation Framework.pptx
MongoDB
 
PDF
MongoDB .local Toronto 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
MongoDB
 
PDF
MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
MongoDB
 
PDF
MongoDB .local Munich 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pip...
MongoDB
 
PPTX
Mongo db 101 dc group
John Ragan
 
PPTX
MongoDB's New Aggregation framework
Chris Westin
 
PDF
MongoDB World 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pipeline Em...
MongoDB
 
PPTX
Introduction to MongoDB for C# developers
Taras Romanyk
 
PPTX
MongoDB Aggregations Indexing and Profiling
Manish Kapoor
 
PPTX
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
ajhannan
 
PDF
Experiment no 05
Ankit Dubey
 
PDF
Precog & MongoDB User Group: Skyrocket Your Analytics
MongoDB
 
KEY
Thoughts on MongoDB Analytics
rogerbodamer
 
PDF
Mongo db aggregation guide
Deysi Gmarra
 
MongoDB 3.2 - Analytics
Massimo Brignoli
 
Querying mongo db
Bogdan Sabău
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Andrew Morgan
 
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB
 
Starting out with MongoDB
Harvard Web Working Group
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
SH 2 - SES 3 - MongoDB Aggregation Framework.pptx
MongoDB
 
MongoDB .local Toronto 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
MongoDB
 
MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
MongoDB
 
MongoDB .local Munich 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pip...
MongoDB
 
Mongo db 101 dc group
John Ragan
 
MongoDB's New Aggregation framework
Chris Westin
 
MongoDB World 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pipeline Em...
MongoDB
 
Introduction to MongoDB for C# developers
Taras Romanyk
 
MongoDB Aggregations Indexing and Profiling
Manish Kapoor
 
MongoDB Chicago - MapReduce, Geospatial, & Other Cool Features
ajhannan
 
Experiment no 05
Ankit Dubey
 
Precog & MongoDB User Group: Skyrocket Your Analytics
MongoDB
 
Thoughts on MongoDB Analytics
rogerbodamer
 
Mongo db aggregation guide
Deysi Gmarra
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
PDF
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB
 
Ad

Recently uploaded (20)

PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
July Patch Tuesday
Ivanti
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
July Patch Tuesday
Ivanti
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 

Agg framework selectgroup feb2015 v2

  • 1. Exploring the Aggregation Framework Jay Runkel Solutions Architect [email protected] @jayrunkel
  • 2. Agenda 1. Analytics in MongoDB? 2. Aggregation Framework 3. Aggregation Framework in Action – US Census Data 4. Aggregation Framework Options
  • 4. For Example: US Census Data • Census data from 1990, 2000, 2010 • Question: Which US Division has the fastest growing population density? – We only want to include data states with more than 1M people – We only want to include divisions larger than 100K square miles Division = a group of US States Population density = Area of division/# of people Data is provided at the state level
  • 5. US Regions and Divisions
  • 6. How would we solve this in SQL? • SELECT GROUP BY HAVING
  • 9. What is an Aggregation Pipeline? • A Series of Document Transformations – Executed in stages – Original input is a collection – Output as a cursor or a collection • Rich Library of Functions – Filter, compute, group, and summarize data – Output of one stage sent to input of next – Operations executed in sequential order
  • 11. Pipeline Operators • $match Filter documents • $project Reshape documents • $group Summarize documents • $unwind Expand documents • $sort Order documents • $limit/$skip Paginate documents • $redact Restrict documents • $geoNear Proximity sort documents • $let,$map Define variables
  • 12. Aggregation Framework in Action (let’s play with the census data)
  • 13. MongoDB State Collection • Document For Each State • Name • Region • Division • Census Data For 1990, 2000, 2010 – Population – Housing Units – Occupied Housing Units • Census Data is an array with three subdocuments
  • 14. Document Model { "_id" : ObjectId("54e23c7b28099359f5661525"), "name" : "California", "region" : "West", "data" : [ {"totalPop" : 33871648, "totalHouse" : 12214549, "occHouse" : 11502870, "year" : 2000}, {"totalPop" : 37253956, "totalHouse" : 13680081, "occHouse" : 12577498, "year" : 2010}, {"totalPop" : 29760021, "totalHouse" : 11182882, "occHouse" : 29008161, "year" : 1990} ], … }
  • 16. Total US Area db.cData.aggregate([ {"$group" : {"_id" : null, "totalArea" : {$sum : "$areaM"}, "avgArea" : {$avg : "$areaM"}}}])
  • 17. $group • Group documents by value – Field reference, object, constant – Other output fields are computed • $max, $min, $avg, $sum • $addToSet, $push • $first, $last – Processes all data in memory by default
  • 18. Area By Region db.cData.aggregate([ {"$group" : {"_id" : "$region", "totalArea" : {$sum : "$areaM"}, "avgArea" : {$avg : "$areaM"}, "numStates" : {$sum : 1}, "states" : {$push : "$name"}}} ])
  • 19. Calculating Average StateArea By Region { $group: { _id: "$region", avgAreaM: {$avg: ”$areaM" } }} { _id: ”North East", avgAreaM: 154 } { _id: “West", avgAreaM: 300 } { state: ”New York", areaM: 218, region: “North East" } { state: ”New Jersey", areaM: 90, region: “North East” } { state: “California", areaM: 300, region: “West" }
  • 20. Calculating Total Area and State Count { $group: { _id: "$region", totArea: {$sum: ”$areaM" }, sCount : {$sum : 1}}} { _id: ”North East", totArea: 308 sCount: 2} { _id: “West", totArea: 300, sCount: 1} { state: ”New York", areaM: 218, region: “North East" } { state: ”New Jersey", areaM: 90, region: “North East” } { state: “California", area: 300, region: “West" }
  • 21. Total US Population By Year db.cData.aggregate( [{$unwind : "$data"}, {$group : {"_id" : "$data.year", "totalPop" : {$sum : "$data.totalPop"}}}, {$sort : {"totalPop" : 1}} ])
  • 22. $unwind • Operate on an array field – Create documents from array elements • Array replaced by element value • Missing/empty fields → no output • Non-array fields → error – Pipe to $group to aggregate
  • 23. $unwind { $unwind: $census } { state: “New York, census: 1990} { state: ”New York", census: [1990, 2000, 2010] } { state: ”New Jersey", census: [1990, 2000] } { state: “California", census: [1980, 1990, 2000, 2010] } { state: ”Delaware", census: [1990, 2000] } { state: “New York, census: 2000} { state: “New York, census: 2010} { state: “New Jersey, census: 1990} { state: “New Jersey, census: 2000} …
  • 24. Southern State Population By Year db.cData.aggregate( [{$match : {"region" : "South"}}, {$unwind : "$data"}, {$group : {"_id" : "$data.year", "totalPop” : {"$sum” : "$data.totalPop"}}}])
  • 25. $match • Filter documents – Uses existing query syntax – No $where (server side Javascript)
  • 26. $match { $match: { “region” : “West” } } { state: ”New York", areaM: 218, region: “North East" } { state: ”Oregon", areaM: 245, region: “West” } { state: “California", area: 300, region: “West" } { state: ”Oregon", areaM: 245, region: “West” } { state: “California", area: 300, region: “West" }
  • 27. Population Delta By State from 1990 to 2010 db.cData.aggregate( [{$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalPop"}, "pop2010" : {"$last" : "$data.totalPop"}}}, {$project : {"_id" : 0, "name" : "$_id", "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}, "pop1990" : 1, "pop2010” : 1} }] )
  • 28. Population Delta By State from 1990 to 2010 db.cData.aggregate( [{$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalPop"}, "pop2010" : {"$last" : "$data.totalPop"}}}, {$project : {"_id" : 0, "name" : "$_id", "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}, "pop1990" : 1, "pop2010” : 1} }] )
  • 29. $sort, $limit, $skip • Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed • Limit and skip follow cursor behavior
  • 30. Population Delta By State from 1990 to 2010 db.cData.aggregate( [{$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalPop"}, "pop2010" : {"$last" : "$data.totalPop"}}}, {$project : {"_id" : 0, "name" : "$_id", "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}, "pop1990" : 1, "pop2010” : 1} }] )
  • 31. $first, $last • Collection operations like $push and $addToSet • Must be used in $group • $first and $last determined by document order • Typically used with $sort to ensure ordering is known
  • 32. Population Delta By State from 1990 to 2010 db.cData.aggregate( [{$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalPop"}, "pop2010" : {"$last" : "$data.totalPop"}}}, {$project : {"_id" : 0, "name" : "$_id", "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}, "pop1990" : 1, "pop2010” : 1} }] )
  • 33. $project • Reshape Documents – Include, exclude or rename fields – Inject computed fields – Create sub-document fields
  • 34. Including and Excluding Fields { $project: { “_id” : 0, “pop1990” : 1, “pop2010” : 1 } { "_id" : "Virginia”, "pop1990" : 453588, "pop2010" : 3725789 } { "_id" : "South Dakota", "pop1990" : 453588, "pop2010" : 3725789 } { "pop1990" : 453588, "pop2010" : 3725789 } { "pop1990" : 453588, "pop2010" : 3725789 }
  • 35. { ”name" : “South Dakota”, ”delta" : 118176 } Renaming and Computing Fields { $project: { “_id” : 0, “pop1990” : 0, “pop2010” : 0, “name” : “$_id”, "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}} } { "_id" : "Virginia”, "pop1990" : 6187358, "pop2010" : 8001024 } { "_id" : "South Dakota", "pop1990" : 696004, "pop2010" : 814180 } { ”name" : “Virginia”, ”delta" : 1813666 }
  • 36. Compare number of people living within 500KM of Memphis, TN in 1990, 2000, 2010
  • 37. Compare number of people living within 500KM of Memphis, TN in 1990, 2000, 2010 db.cData.aggregate([ {$geoNear : { "near" : {"type" : "Point", "coordinates" : [90, 35]}, “distanceField” : "dist.calculated", “maxDistance” : 500000, “includeLocs” : "dist.location", “spherical” : true }}, {$unwind : "$data"}, {$group : {"_id" : "$data.year", "totalPop" : {"$sum" : "$data.totalPop"}, "states" : {"$addToSet" : "$name"}}}, {$sort : {"_id" : 1}} ])
  • 38. $geoNear • Order/Filter Documents by Location – Requires a geospatial index – Output includes physical distance – Must be first aggregation stage
  • 39. { "_id" : ”Tennessee", "pop1990" : 4877185, "pop2010" : 6346105, “center” : {“type” : “Point”, “coordinates” : [86.6, 37.8]} } { "_id" : "Virginia”, "pop1990" : 6187358, "pop2010" : 8001024, “center” : {“type” : “Point”, “coordinates” : [78.6, 37.5]} } $geoNear {$geoNear : { "near”: {"type”: "Point", "coordinates”: [90, 35]}, maxDistance : 500000, spherical : true }} { "_id" : ”Tennessee", "pop1990" : 4877185, "pop2010" : 6346105, “center” : {“type” : “Point”, “coordinates” : [86.6, 37.8]} }
  • 40. What if I want to save the results to a collection? db.cData.aggregate([ {$geoNear : { "near" : {"type" : "Point", "coordinates" : [90, 35]}, “distanceField” : "dist.calculated", “maxDistance” : 500000, “includeLocs” : "dist.location", “spherical” : true }}, {$unwind : "$data"}, {$group : {"_id" : "$data.year", "totalPop" : {"$sum" : "$data.totalPop"}, "states" : {"$addToSet" : "$name"}}}, {$sort : {"_id" : 1}}, {$out : “peopleNearMemphis”} ])
  • 41. $out db.cData.aggregate([<pipeline stages>, {“$out” : “resultsCollection”}]) • Save aggregation results to a new collection • New aggregation uses: • Transform documents - ETL
  • 42. Back To The Original Question • Which US Division has the fastest growing population density? – We only want to include data states with more than 1M people – We only want to include divisions larger than 100K square miles
  • 43. Division with Fastest Growing Pop Density db.cData.aggregate( [{$match : {"data.totalPop" : {"$gt" : 1000000}}}, {$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalPop"}, "pop2010" : {"$last" : "$data.totalPop"}, "areaM" : {"$first" : "$areaM"}, "division" : {"$first" : "$division"}}}, {$group : {"_id" : "$division", "totalPop1990" : {"$sum" : "$pop1990"}, "totalPop2010" : {"$sum" : "$pop2010"}, "totalAreaM" : {"$sum" : "$areaM"}}}, {$match : {"totalAreaM" : {"$gt" : 100000}}}, {$project : {"_id" : 0, "division" : "$_id", "density1990" : {"$divide" : ["$totalPop1990", "$totalAreaM"]}, "density2010" : {"$divide" : ["$totalPop2010", "$totalAreaM"]}, "denDelta" : {"$subtract" : [{"$divide" : ["$totalPop2010", "$totalAreaM"]}, {"$divide" : ["$totalPop1990”, "$totalAreaM"]}]}, "totalAreaM" : 1, "totalPop1990" : 1, "totalPop2010" : 1}}, {$sort : {"denDelta" : -1}}])
  • 45. Aggregate options db.cData.aggregate([<pipeline stages>], {‘explain’ : false 'allowDiskUse' : true, 'cursor' : {'batchSize' : 5}}) explain – similar to find().explain() allowDiskUse – enable use of disk to store intermediate results cursor – specify the size of the initial result
  • 47. Sharding • Workload split between shards – Shards execute pipeline up to a point – Primary shard merges cursors and continues processing* – Use explain to analyze pipeline split – Early $match may excuse shards – Potential CPU and memory implications for primary shard host *Prior to v2.6 second stage pipeline processing was done by mongos
  • 50. Framework Use Cases • Basic aggregation queries • Ad-hoc reporting • Real-time analytics • Visualizing and reshaping data