SlideShare a Scribd company logo
Mongo Analytics – 
Learn aggregation by example 
Exploratory Analytics and 
Visualization using Flight Data 
www.jsonstudio.com
Analyzing Flight Data 
• JSON data imported from CSV downloaded from www.transtats.bts.gov 
• Sample document for a flight: 
{ 
"_id": { "$oid": "534205f61c479f6149a92709" }, 
"YEAR": 2013, "QUARTER": 1, 
"MONTH": 1, 
"DAY_OF_MONTH": 18, 
"DAY_OF_WEEK": 5, 
"FL_DATE": "2013-01-18”, 
"UNIQUE_CARRIER": "DL”, 
"AIRLINE_ID": 19790, 
"CARRIER": "DL", 
"TAIL_NUM": "N325US”, 
"FL_NUM": 1497, 
"ORIGIN_AIRPORT_ID": 14100, 
"ORIGIN_AIRPORT_SEQ_ID": 1410002, 
"ORIGIN_CITY_MARKET_ID": 34100, 
"ORIGIN": "PHL", 
"ORIGIN_CITY_NAME": "Philadelphia, PA", 
"ORIGIN_STATE_ABR": "PA”, 
"ORIGIN_STATE_FIPS": 42, 
"DEST_AIRPORT_ID": 13487, 
"DEST_AIRPORT_SEQ_ID": 1348702, 
"DEST_CITY_MARKET_ID": 31650, 
"DEST": "MSP", 
"DEST_CITY_NAME": "Minneapolis, MN", 
"DEST_STATE_ABR": "MN", 
"DEST_STATE_FIPS": 27, 
"DEST_STATE_NM": "Minnesota", 
"DEST_WAC": 63, 
"CRS_DEP_TIME": 805, 
"DEP_TIME": 758, 
"DEP_DELAY": -7, 
"DEP_DELAY_NEW": 0, 
"DEP_DEL15": 0, 
"DEP_DELAY_GROUP": -1, 
"DEP_TIME_BLK": "0800-0859", 
"TAXI_OUT": 24, 
"WHEELS_OFF": 822, 
"WHEELS_ON": 958, 
"TAXI_IN": 4, 
"CRS_ARR_TIME": 1015, 
"ARR_TIME": 1002, 
"ARR_DELAY": -13, 
"ARR_DELAY_NEW": 0, 
"ARR_DEL15": 0, 
"ARR_DELAY_GROUP": -1, 
"ARR_TIME_BLK": "1000-1059", 
"CANCELLED": 0, 
"CANCELLATION_CODE": "", 
"DIVERTED": 0, 
"CRS_ELAPSED_TIME": 190, 
"ACTUAL_ELAPSED_TIME": 184, 
"AIR_TIME": 156, 
"FLIGHTS": 1, 
"DISTANCE": 980, 
"DISTANCE_GROUP": 4, 
"CARRIER_DELAY": "", 
"WEATHER_DELAY": "", 
"NAS_DELAY": "", 
"SECURITY_DELAY": "", 
"LATE_AIRCRAFT_DELAY": "", 
"FIRST_DEP_TIME": "", 
"TOTAL_ADD_GTIME": "", 
"LONGEST_ADD_GTIME": "", 
"": "" 
} 
• We will build aggregation pipelines and visualize data using JSON Studio (www.jsonstudio.com) 
• You will fall madly in love with the Aggregation Framework and it’s capabilities
MongoDB aggregation steps/stages 
• Grouping 
• Matching/filtering 
• Projection 
• Sorting 
• Unwind 
• Limit, skip 
• Added in 2.6 
– Out 
– Redact
Who are the largest carriers?
Some Carrier Stats { 
"$group": { 
"_id": { 
"CARRIER": "$CARRIER" 
}, 
"_avg_DEP_DELAY": { 
"$avg": "$DEP_DELAY" 
}, 
"_avg_ARR_DELAY": { 
"$avg": "$ARR_DELAY" 
}, 
"_avg_DISTANCE_GROUP": { 
"$avg": "$DISTANCE_GROUP" 
}, 
"_avg_TAXI_IN": { 
"$avg": "$TAXI_IN" 
}, 
"_avg_TAXI_OUT": { 
"$avg": "$TAXI_OUT" 
} 
} 
} 
{ 
"_id": { 
"CARRIER": "9E" 
}, 
"_avg_DEP_DELAY": 8.45451754385965, 
"_avg_ARR_DELAY": 3.3237368838726744, 
"_avg_DISTANCE_GROUP": 2.2188688815622624, 
"_avg_TAXI_IN": 7.082464246424642, 
"_avg_TAXI_OUT": 20.558167120639663 
}
Which airports have the most cancellations?
Which carriers are most at fault for cancellations?
Arrival delays by distance
Delays by distance by carrier
Delays by distance by carrier – long haul only
Words of caution (courtesy of David Weisman)
Words of caution (courtesy of David Weisman)
What to do? 
“Touch” the data – e.g. Histograms
Words of caution (courtesy of David Weisman)
Words of caution (courtesy of David Weisman)
Order Does Matter 
http://docs.mongodb.org/manual/core/aggregation-pipeline-optimization/
An example for $unwind 
Count how many airports each carrier lands in 
{ 
"_id": { 
"$oid": "5383623b7bfb8767e2e9ca1f" 
}, 
"iata": "00M", 
"airport": "Thigpen ", 
"city": "Bay Springs", 
"state": "MS", 
"country": "USA", 
"lat": 31.95376472, 
"long": -89.23450472, 
"carriers": [ 
"AA", 
"UA", 
"DL", 
"BA" 
] 
} 
… 
[ 
{ 
"_id": { 
"$oid": "5383623b7bfb8767e2e9ca1f" 
}, 
"iata": "00M", 
"airport": "Thigpen ", 
"city": "Bay Springs", 
"state": "MS", 
"country": "USA", 
"lat": 31.95376472, 
"long": -89.23450472, 
"carriers": "AA" 
}, 
{ 
"_id": { 
"$oid": "542217ffc026b858b47a6640" 
}, 
"iata": "00M", 
"airport": "Thigpen ", 
"city": "Bay Springs", 
"state": "MS", 
"country": "USA", 
"lat": 31.95376472, 
"long": -89.23450472, 
"carriers": "UA" 
} 
… 
] 
[ 
{ 
"_id": { 
"carriers": "BA" 
}, 
"count": 10 
}, 
{ 
"_id": { 
"carriers": "DL" 
}, 
"count": 10 
} 
… 
] 
airports2 
$unwind $group
Hub airports – try1
Hub airports – try2
Hub airports – try 3 
{ $group: { _id: { ORIGIN: "$ORIGIN", CARRIER: "$CARRIER" }, count: { $sum: 1 } } }, 
{ $project: { airport: "$_id.ORIGIN", carrier: "$_id.CARRIER", "count": 1 } }, 
{ $match: { "count": { $gte: "$$hub_threshold" } } }, 
{ $group: { 
_id: { airport: "$airport" }, 
airlines: { $sum: 1 }, 
flights: { $sum: "$count" }, 
avg_airline: { $avg: "$count" }, 
max_airline: { $max: "$count" } } }, 
{ $project: { 
"airlines": 1, 
"flights": 1, 
"avg_airline": 1, 
"max_airline": 1, 
"avg_no_max": { $divide: [ { $subtract: [ "$flights", "$max_airline" ] }, "$airlines" ] } } }, 
{ $sort: { "flights": -1 } }
Hub airports
From-to Insensitive 
{ $group: { _id: { UNIQUE_CARRIER: "$UNIQUE_CARRIER", ORIGIN: "$ORIGIN", 
DEST: "$DEST" }, count: { $sum: 1 } } }, 
{ $match: { "count": { $gt: "$$count_threshold" } } }, 
{ $project: { _id_UNIQUE_CARRIER: "$_id.UNIQUE_CARRIER", "count": 1, 
rroute: { 
$cond: [ 
{ $lt: [ { $cmp: [ "$_id.ORIGIN", "$_id.DEST" ] }, 0 ] }, 
{ $concat: [ "$_id.ORIGIN", "$_id.DEST" ] }, 
{ $concat: [ "$_id.DEST", "$_id.ORIGIN" ] } 
] } } 
}, 
{ $group: { _id: { _id_UNIQUE_CARRIER: "$_id_UNIQUE_CARRIER", rroute: "$rroute" }, 
_sum_count: { $sum: "$count" } } }
Hub visualization (using routes – from/to, $$count=1, origin treemap)
Using “R” for Advanced Analytics 
• Using a MongoDB driver for “R” 
• Using the JSON Studio Gateway (including using aggregation output) 
install.packages("jSonarR") 
library(’jSonarR') 
con2 <- sonarR::new.SonarConnection('https://localhost:8443', 'localhost', 'flights', port=47017, username="ron", 
pwd=”<pwd>”) 
nyc_by_day <- sonarR::sonarAgg(con2, 'delays_by_day', 'NYCFlights', 
colClasses=c(X_avg_AirTime='numeric', X_avg_ArrDelay='numeric',X_avg_DepDelay='numeric')) 
lm.out = lm(nyc_by_day$X_sum_ArrDelay ~ nyc_by_day$X_sum_AirTime) 
MongoDB
Recommendation engine example: jsonstudio.com
NYC Flights – Quiz Questions 
• Of the three airports, who has the most flights? 
– Nyc1 
• Who has the most cancellations and highest cancellation ratio? 
– Nyc2 
• Taxi in/out times? 
– Nyc3 
• What about delays? 
– Nyc4 
• How do delays differ by month? 
– Nyc5 + nyc5 
– (summer vs. winter / bubble size vs. y-axis) 
• What about weather delays only? Which months are worse? Are the three airports 
equivalent? 
– Nyc7 + nyc7 
• Where can I fly to if I work for Boeing and am very loyal (and on which aicraft)? 
– Nyc8 + map
www.jsonstudio.com 
(download – presentation and eval copy) 
Discount code: MUGTX* 
(* Good for 1 month after event) 
ron@jsonar.com
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and Visualization Using Flight Data

More Related Content

What's hot (20)

PPTX
MongoDB - Aggregation Pipeline
Jason Terpko
 
PPTX
MongoDB Aggregation
Amit Ghosh
 
PPTX
MongoDB World 2016 : Advanced Aggregation
Joe Drumgoole
 
PPTX
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
MongoDB
 
KEY
MongoDB Aggregation Framework
Tyler Brock
 
PDF
Aggregation Framework MongoDB Days Munich
Norberto Leite
 
ODP
Aggregation Framework in MongoDB Overview Part-1
Anuj Jain
 
PPTX
Aggregation Framework
MongoDB
 
PDF
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB
 
PPTX
Aggregation in MongoDB
Kishor Parkhe
 
PDF
MongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines
MongoDB
 
PPTX
Querying Nested JSON Data Using N1QL and Couchbase
Brant Burnett
 
PDF
Webinar: Working with Graph Data in MongoDB
MongoDB
 
PPTX
2014 bigdatacamp asya_kamsky
Data Con LA
 
PPTX
Beyond the Basics 2: Aggregation Framework
MongoDB
 
PPTX
Data Governance with JSON Schema
MongoDB
 
PPT
Building Your First MongoDB Application (Mongo Austin)
MongoDB
 
PDF
Using a mobile phone as a therapist - Superweek 2018
Peter Meyer
 
PDF
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB
 
PPTX
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
MongoDB
 
MongoDB - Aggregation Pipeline
Jason Terpko
 
MongoDB Aggregation
Amit Ghosh
 
MongoDB World 2016 : Advanced Aggregation
Joe Drumgoole
 
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
MongoDB
 
MongoDB Aggregation Framework
Tyler Brock
 
Aggregation Framework MongoDB Days Munich
Norberto Leite
 
Aggregation Framework in MongoDB Overview Part-1
Anuj Jain
 
Aggregation Framework
MongoDB
 
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB
 
Aggregation in MongoDB
Kishor Parkhe
 
MongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines
MongoDB
 
Querying Nested JSON Data Using N1QL and Couchbase
Brant Burnett
 
Webinar: Working with Graph Data in MongoDB
MongoDB
 
2014 bigdatacamp asya_kamsky
Data Con LA
 
Beyond the Basics 2: Aggregation Framework
MongoDB
 
Data Governance with JSON Schema
MongoDB
 
Building Your First MongoDB Application (Mongo Austin)
MongoDB
 
Using a mobile phone as a therapist - Superweek 2018
Peter Meyer
 
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB
 
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
MongoDB
 

Similar to MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and Visualization Using Flight Data (20)

PDF
Starting out with MongoDB
Harvard Web Working Group
 
PPTX
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB
 
PDF
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB
 
PPTX
big data slides.pptx
BSwethaBindu
 
PDF
Mdb dn 2017_18_query_hackathon
Daniel M. Farrell
 
PPTX
MongoDB 3.2 - Analytics
Massimo Brignoli
 
PPTX
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB
 
PDF
Connected hubs: an analysis of the Lufthansa network in Europe
Sau Yee Chan
 
PPTX
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
MongoDB
 
PDF
14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Da...
Swiss Big Data User Group
 
PPTX
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Gianfranco Palumbo
 
PPTX
Maximizing Airline Operations Efficiency with Flight Data APIs.pptx
fly24hrs
 
PDF
Analyzing NYC Transit Data
Work-Bench
 
PPTX
Joins and Other MongoDB 3.2 Aggregation Enhancements
Andrew Morgan
 
PDF
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
 
PPTX
Data Analytics with MongoDB - Jane Fine
MongoDB
 
PPTX
Creating a Single View Part 2: Loading Disparate Source Data and Creating a S...
MongoDB
 
PPTX
Webinar: Getting Started with MongoDB - Back to Basics
MongoDB
 
PDF
dplyr use case
Romain Francois
 
PPTX
Querying mongo db
Bogdan Sabău
 
Starting out with MongoDB
Harvard Web Working Group
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB
 
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB
 
big data slides.pptx
BSwethaBindu
 
Mdb dn 2017_18_query_hackathon
Daniel M. Farrell
 
MongoDB 3.2 - Analytics
Massimo Brignoli
 
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB
 
Connected hubs: an analysis of the Lufthansa network in Europe
Sau Yee Chan
 
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
MongoDB
 
14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Da...
Swiss Big Data User Group
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Gianfranco Palumbo
 
Maximizing Airline Operations Efficiency with Flight Data APIs.pptx
fly24hrs
 
Analyzing NYC Transit Data
Work-Bench
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Andrew Morgan
 
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
 
Data Analytics with MongoDB - Jane Fine
MongoDB
 
Creating a Single View Part 2: Loading Disparate Source Data and Creating a S...
MongoDB
 
Webinar: Getting Started with MongoDB - Back to Basics
MongoDB
 
dplyr use case
Romain Francois
 
Querying mongo db
Bogdan Sabău
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
Ad

Recently uploaded (20)

PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 

MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and Visualization Using Flight Data

  • 1. Mongo Analytics – Learn aggregation by example Exploratory Analytics and Visualization using Flight Data www.jsonstudio.com
  • 2. Analyzing Flight Data • JSON data imported from CSV downloaded from www.transtats.bts.gov • Sample document for a flight: { "_id": { "$oid": "534205f61c479f6149a92709" }, "YEAR": 2013, "QUARTER": 1, "MONTH": 1, "DAY_OF_MONTH": 18, "DAY_OF_WEEK": 5, "FL_DATE": "2013-01-18”, "UNIQUE_CARRIER": "DL”, "AIRLINE_ID": 19790, "CARRIER": "DL", "TAIL_NUM": "N325US”, "FL_NUM": 1497, "ORIGIN_AIRPORT_ID": 14100, "ORIGIN_AIRPORT_SEQ_ID": 1410002, "ORIGIN_CITY_MARKET_ID": 34100, "ORIGIN": "PHL", "ORIGIN_CITY_NAME": "Philadelphia, PA", "ORIGIN_STATE_ABR": "PA”, "ORIGIN_STATE_FIPS": 42, "DEST_AIRPORT_ID": 13487, "DEST_AIRPORT_SEQ_ID": 1348702, "DEST_CITY_MARKET_ID": 31650, "DEST": "MSP", "DEST_CITY_NAME": "Minneapolis, MN", "DEST_STATE_ABR": "MN", "DEST_STATE_FIPS": 27, "DEST_STATE_NM": "Minnesota", "DEST_WAC": 63, "CRS_DEP_TIME": 805, "DEP_TIME": 758, "DEP_DELAY": -7, "DEP_DELAY_NEW": 0, "DEP_DEL15": 0, "DEP_DELAY_GROUP": -1, "DEP_TIME_BLK": "0800-0859", "TAXI_OUT": 24, "WHEELS_OFF": 822, "WHEELS_ON": 958, "TAXI_IN": 4, "CRS_ARR_TIME": 1015, "ARR_TIME": 1002, "ARR_DELAY": -13, "ARR_DELAY_NEW": 0, "ARR_DEL15": 0, "ARR_DELAY_GROUP": -1, "ARR_TIME_BLK": "1000-1059", "CANCELLED": 0, "CANCELLATION_CODE": "", "DIVERTED": 0, "CRS_ELAPSED_TIME": 190, "ACTUAL_ELAPSED_TIME": 184, "AIR_TIME": 156, "FLIGHTS": 1, "DISTANCE": 980, "DISTANCE_GROUP": 4, "CARRIER_DELAY": "", "WEATHER_DELAY": "", "NAS_DELAY": "", "SECURITY_DELAY": "", "LATE_AIRCRAFT_DELAY": "", "FIRST_DEP_TIME": "", "TOTAL_ADD_GTIME": "", "LONGEST_ADD_GTIME": "", "": "" } • We will build aggregation pipelines and visualize data using JSON Studio (www.jsonstudio.com) • You will fall madly in love with the Aggregation Framework and it’s capabilities
  • 3. MongoDB aggregation steps/stages • Grouping • Matching/filtering • Projection • Sorting • Unwind • Limit, skip • Added in 2.6 – Out – Redact
  • 4. Who are the largest carriers?
  • 5. Some Carrier Stats { "$group": { "_id": { "CARRIER": "$CARRIER" }, "_avg_DEP_DELAY": { "$avg": "$DEP_DELAY" }, "_avg_ARR_DELAY": { "$avg": "$ARR_DELAY" }, "_avg_DISTANCE_GROUP": { "$avg": "$DISTANCE_GROUP" }, "_avg_TAXI_IN": { "$avg": "$TAXI_IN" }, "_avg_TAXI_OUT": { "$avg": "$TAXI_OUT" } } } { "_id": { "CARRIER": "9E" }, "_avg_DEP_DELAY": 8.45451754385965, "_avg_ARR_DELAY": 3.3237368838726744, "_avg_DISTANCE_GROUP": 2.2188688815622624, "_avg_TAXI_IN": 7.082464246424642, "_avg_TAXI_OUT": 20.558167120639663 }
  • 6. Which airports have the most cancellations?
  • 7. Which carriers are most at fault for cancellations?
  • 8. Arrival delays by distance
  • 9. Delays by distance by carrier
  • 10. Delays by distance by carrier – long haul only
  • 11. Words of caution (courtesy of David Weisman)
  • 12. Words of caution (courtesy of David Weisman)
  • 13. What to do? “Touch” the data – e.g. Histograms
  • 14. Words of caution (courtesy of David Weisman)
  • 15. Words of caution (courtesy of David Weisman)
  • 16. Order Does Matter http://docs.mongodb.org/manual/core/aggregation-pipeline-optimization/
  • 17. An example for $unwind Count how many airports each carrier lands in { "_id": { "$oid": "5383623b7bfb8767e2e9ca1f" }, "iata": "00M", "airport": "Thigpen ", "city": "Bay Springs", "state": "MS", "country": "USA", "lat": 31.95376472, "long": -89.23450472, "carriers": [ "AA", "UA", "DL", "BA" ] } … [ { "_id": { "$oid": "5383623b7bfb8767e2e9ca1f" }, "iata": "00M", "airport": "Thigpen ", "city": "Bay Springs", "state": "MS", "country": "USA", "lat": 31.95376472, "long": -89.23450472, "carriers": "AA" }, { "_id": { "$oid": "542217ffc026b858b47a6640" }, "iata": "00M", "airport": "Thigpen ", "city": "Bay Springs", "state": "MS", "country": "USA", "lat": 31.95376472, "long": -89.23450472, "carriers": "UA" } … ] [ { "_id": { "carriers": "BA" }, "count": 10 }, { "_id": { "carriers": "DL" }, "count": 10 } … ] airports2 $unwind $group
  • 20. Hub airports – try 3 { $group: { _id: { ORIGIN: "$ORIGIN", CARRIER: "$CARRIER" }, count: { $sum: 1 } } }, { $project: { airport: "$_id.ORIGIN", carrier: "$_id.CARRIER", "count": 1 } }, { $match: { "count": { $gte: "$$hub_threshold" } } }, { $group: { _id: { airport: "$airport" }, airlines: { $sum: 1 }, flights: { $sum: "$count" }, avg_airline: { $avg: "$count" }, max_airline: { $max: "$count" } } }, { $project: { "airlines": 1, "flights": 1, "avg_airline": 1, "max_airline": 1, "avg_no_max": { $divide: [ { $subtract: [ "$flights", "$max_airline" ] }, "$airlines" ] } } }, { $sort: { "flights": -1 } }
  • 22. From-to Insensitive { $group: { _id: { UNIQUE_CARRIER: "$UNIQUE_CARRIER", ORIGIN: "$ORIGIN", DEST: "$DEST" }, count: { $sum: 1 } } }, { $match: { "count": { $gt: "$$count_threshold" } } }, { $project: { _id_UNIQUE_CARRIER: "$_id.UNIQUE_CARRIER", "count": 1, rroute: { $cond: [ { $lt: [ { $cmp: [ "$_id.ORIGIN", "$_id.DEST" ] }, 0 ] }, { $concat: [ "$_id.ORIGIN", "$_id.DEST" ] }, { $concat: [ "$_id.DEST", "$_id.ORIGIN" ] } ] } } }, { $group: { _id: { _id_UNIQUE_CARRIER: "$_id_UNIQUE_CARRIER", rroute: "$rroute" }, _sum_count: { $sum: "$count" } } }
  • 23. Hub visualization (using routes – from/to, $$count=1, origin treemap)
  • 24. Using “R” for Advanced Analytics • Using a MongoDB driver for “R” • Using the JSON Studio Gateway (including using aggregation output) install.packages("jSonarR") library(’jSonarR') con2 <- sonarR::new.SonarConnection('https://localhost:8443', 'localhost', 'flights', port=47017, username="ron", pwd=”<pwd>”) nyc_by_day <- sonarR::sonarAgg(con2, 'delays_by_day', 'NYCFlights', colClasses=c(X_avg_AirTime='numeric', X_avg_ArrDelay='numeric',X_avg_DepDelay='numeric')) lm.out = lm(nyc_by_day$X_sum_ArrDelay ~ nyc_by_day$X_sum_AirTime) MongoDB
  • 26. NYC Flights – Quiz Questions • Of the three airports, who has the most flights? – Nyc1 • Who has the most cancellations and highest cancellation ratio? – Nyc2 • Taxi in/out times? – Nyc3 • What about delays? – Nyc4 • How do delays differ by month? – Nyc5 + nyc5 – (summer vs. winter / bubble size vs. y-axis) • What about weather delays only? Which months are worse? Are the three airports equivalent? – Nyc7 + nyc7 • Where can I fly to if I work for Boeing and am very loyal (and on which aicraft)? – Nyc8 + map
  • 27. www.jsonstudio.com (download – presentation and eval copy) Discount code: MUGTX* (* Good for 1 month after event) [email protected]