SlideShare a Scribd company logo
#MDBW17
Using Aggregation for Analytics
JUMPSTART SESSION
#MDBW17
Senior Solutions Architect, MongoDB
RUBÉN TERCEÑO
@rubenTerceno
#MDBW17
AGENDA
• My Problem (And Maybe Yours).
• In the Search for a Solution.
• MongoDB Aggregation Framework.
• Let’s Crunch Some Numbers!
MY PROBLEM
(AND MAYBE YOURS)
#MDBW17
ONCE UPON A TIME…
#MDBW17
THAT MANAGER
• The CRM system is the mean
source of revenue information.
• Up to date information is critical
for our business owners.
• Grouped data is much more
valuable while taking decisions.
• Graphs are a powerful mean to
present grouped information.
IN THE SEARCH
FOR A SOLUTION
#MDBW17
STEP 1: ANALYTICS ON THE OPERATIONAL
DB
• Running Analytics on your operational database.
#MDBW17
STEP 1: ANALYTICS ON THE OPERATIONAL
DB
#MDBW17
STEP 1: ANALYTICS ON THE OPERATIONAL
DB
• Running Analytics on your operational database.
‒ Analytical workload affects operational users
o Lots of table scans and heavy counts and groups.
#MDBW17
STEP 2: ETL AND OLAP
• ETLing your data into an analytical dedicated database.
#MDBW17
STEP 2: ETL AND OLAP
#MDBW17
STEP 2: ETL AND OLAP
• ETLing your data into an analytical dedicated database.
‒ Longer time to react to business requests.
o Every change affects four systems.
‒ Lack of accuracy on real time reports.
o Data synchronization was happening overnight, so today’s report is on yesterday’s data.
#MDBW17
STEP 3: DEDICATED NICHE PRODUCTS
• Real-Time data replication (CDC), embedded BI capabilities,
dedicated hardware.
#MDBW17
STEP 3: DEDICATED NICHE PRODUCTS
• Real-Time data replication (CDC), embedded BI capabilities,
dedicated hardware.
‒ New skills required in them team.
o Hardware, CDC, Middleware, Java, UI.
‒ The solution reliability was low.
o Too many moving parts.
o Monitoring and debugging was complex.
‒ Cost was very high.
o More expensive than the CRM itself!
#MDBW17
SO… WHAT DO WE NEED?
• Analytical capabilities.
• Simple Architecture.
• Workload isolation.
• Real time data.
• High Availability.
• Cost aligned with provided value.
AGGREGATION
FRAMEWORK
#MDBW17
MONGODB AGGREGATION FRAMEWORK
• A Series of Document Transformations.
‒ Executed in stages.
o Original input is a collection.
o Output of one stage sent as input of next.
o Output as a cursor or a collection.
• Rich Library of Functions.
‒ Filter, manipulate, group, join and summarize data.
• Optimized for performance.
‒ Full index support.
‒ Operations executed in sequential order, performing stage optimization, if possible.
#MDBW17
EXAMPLE
#MDBW17
ARCHITECTURE – REPLICASET
mongod
27017
mongod
27017
mongod
27017
Replica Set
mongod
27017
Primary
mongod
27017
Secondary
mongod
27017
Secondary
#MDBW17
ARCHITECTURE – DATA REPLICATION
mongod
27017
mongod
27017
mongod
27017
Replica Set
mongod
27017
Secondary
Oplog
replication
mongod
27017
Primary
mongod
27017
Secondary
#MDBW17
ARCHITECTURE – FAILOVER
mongod
27017
Replica Set
mongod
27017
Secondary
Oplog
replication
Heartbeat
mongod
27017
Primary
mongod
27017
Secondary
#MDBW17
ARCHITECTURE – ANALYTICAL WORKLOAD
mongod
27017
Replica Set
mongod
27017
Secondary
Oplog
replication
Heartbeat
mongod
27017
Primary
mongod
27017
Secondary
#MDBW17
ARCHITECTURE – SECONDARY READS
mongod
27017
mongod
27017
mongod
27017
Replica Set
mongod
27017
Secondary
Oplog
replication
Heartbeat
mongod
27017
Primary
mongod
27017
Secondary
LET’S CRUNCH
SOME NUMBERS
#MDBW17
LIVE AGGREGATION FRAMEWORK DEMO
• Fingers crossed!!
#MDBW17
PROBLEM DESCRIPTION
• Database containing the biggest ships out there and, in a different
collection, the containers (not docker, shipping containers).
• Information of the cargo is at container level, but we need it at ship
level where information like destination sits.
• We want to know the cargo of each ship to be able to find things
like all ships currently in the North Atlantic, arriving in the US with
more than 100000 TM of Iron.
#MDBW17
BUILDING THE AGGREGATION STEP BY
STEP
• We’ll create one variable for every step of the aggregation
framework, so we can easily build and test our pipe.
var myMatch = {some JSON};
var myGroup = {other JSON};
var mySort = {more JSON};
db.ships.aggregate([myMatch, myGroup, mySort])
#MDBW17
ALL SHIPS WITHIN THE NORTH ATLANTIC
• Our first stage is a match. It allow us to filter the vessels. Let’s
find all ships in the North Atlantic going to US ports.
var match = {$match :
{location: {
$geoWithin: { $geometry : atlantic}},
"route.destination.Country": "United States"}}
#MDBW17
FINDING THE CONTAINERS OF EACH SHIP
• The containers are in a different collection. In order to find the
containers of each ship let’s join both collection together. The
lookup operator will allow us to do this.
var lookup = {$lookup :
{from: "containers",
as: "cargo",
localField: "Name",
foreignField: "shipName"}}
#MDBW17
MANIPULATING THE ARRAY
• That huge array is not going to be usable, let’s transform it into
something easier to handle. The unwind function will help us.
var unwind = {$unwind: "$cargo”}
#MDBW17
GROUPING BY SHIP AND CARGO TYPE
• This stage will group the individual documents by ship and cargo
type, count and add up the TM for each ship and cargo type.
var group = {$group :
{_id: {ship: "$Name",
cargo : "$cargo.cargo",
route: "$route",
location: "$location"},
sum: {$sum: "$cargo.Tons"},
count : {$sum: 1}}}
#MDBW17
MANIPULATING THE FIELD NAMES
• It’s possible to change the shape of our documents at any moment
thanks to project stage. Let’s put the cargo info in a sub document.
var project = {$project: {
_id : {ship: "$_id.ship", route: "$_id.route",
location: "$_id.location"},
cargo : { type : "$_id.cargo",
tons: "$sum",
count: "$count"}}}
#MDBW17
GROUPING BY SHIP
• And now let’s group again only by ship. The different cargos of each
ship will be pushed into a newly created array of documents.
var group2 = {$group : {
_id: "$_id",
cargo: {$push: "$cargo"}}}
#MDBW17
FINAL POLISHING
• Finally, let’s reorder our fields again with another project stage
var project2 = {$project: {_id: 0,
ship: "$_id.ship”,
route: "$_id.route",
location: "$_id.location”,
cargo: 1}}
#MDBW17
SAVING THE RESULTS
• We can store the results to a new collection using the out stage.
var out = {$out: "result"}
#MDBW17
SHOW ME THE VOLUME!!
• Will it perform with a much larger volume? Let’s try with 5000 ships
and 21 million containers.
• Thanks to our step by step approach, we only need to build a new
lookup step.
var lookup2 = {"$lookup" : {
"from" : "containers2",
"as" : "cargo",
"localField" : "Name",
"foreignField" : "shipName”}}
#MDBW17
COMMON PIPELINE OPERATORS
• $match
‒ Filter documents
• $project
‒ Reshape documents
• $group
‒ Summarize documents
• $lookup
‒ Join two collections together
• $unwind
‒ Expand an array
• $out
‒ Create new collections
• $sort
‒ Order documents
• $limit/$skip
‒ Paginate documents
• $facet
‒ Executes multiple expressions
• $sample
‒ samples random data
• $bucket
‒ Creates groups by range
• $redact
‒ Restrict documents
#MDBW17
SO… WHAT DO WE NEED?
• Analytical capabilities.
• Simple Architecture.
• Workload isolation.
• Real time data.
• High Availability.
• Cost aligned with provided value.
#MDBW17
SO… WHAT DO WE HAVE?
• Analytical capabilities.  Native, rich and performing.
• Simple Architecture.
• Workload isolation.
• Real time data.
• High Availability.
• Cost aligned with provided value.
#MDBW17
SO… WHAT DO WE HAVE?
• Analytical capabilities.  Native, rich and performing.
• Simple Architecture.  No extra products, no data transfer.
• Workload isolation.
• Real time data.
• High Availability.
• Cost aligned with provided value.
#MDBW17
SO… WHAT DO WE HAVE?
• Analytical capabilities.  Native, rich and performing.
• Simple Architecture.  No extra products, no data transfer.
• Workload isolation.  Secondary reads.
• Real time data.
• High Availability.
• Cost aligned with provided value.
#MDBW17
SO… WHAT DO WE HAVE?
• Analytical capabilities.  Native, rich and performing.
• Simple Architecture.  No extra products, no data transfer.
• Workload isolation.  Secondary reads.
• Real time data.  Replication lag typically under 1 sec.
• High Availability.
• Cost aligned with provided value.
#MDBW17
SO… WHAT DO WE HAVE?
• Analytical capabilities.  Native, rich and performing.
• Simple Architecture.  No extra products, no data transfer.
• Workload isolation.  Secondary reads.
• Real time data.  Replication lag typically under 1 sec.
• High Availability.  Native MongoDB replication and failover.
• Cost aligned with provided value.
#MDBW17
SO… WHAT DO WE HAVE?
• Analytical capabilities.  Native, rich and performing.
• Simple Architecture.  No extra products, no data transfer.
• Workload isolation.  Secondary reads.
• Real time data.  Replication lag typically under 1 sec.
• High Availability.  Native MongoDB replication and failover.
• Cost aligned with provided value.  No extra servers or licenses.
AGGREGATION
FRAMEWORK
ANALYTICAL
SUPERPOWERS
QUESTIONS?
SAVE
THE
DATE
POWERFUL ANALYSIS
WITH THE
AGGREGATION PIPELINE
SPEAKER: ASYA
KAMSKY
DOING JOINS IN
MONGODB: BEST
PRACTICES FOR USING
$LOOKUP
SPEAKER: AUSTIN
ZELLNER
Using Aggregation for analytics

More Related Content

What's hot (20)

PPTX
Sizing MongoDB Clusters
MongoDB
 
PPT
Mongo Web Apps: OSCON 2011
rogerbodamer
 
PDF
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
MongoDB
 
PDF
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB
 
PPTX
Transitioning from SQL to MongoDB
MongoDB
 
PDF
MongoDB .local Toronto 2019: Using Change Streams to Keep Up with Your Data
MongoDB
 
PPTX
ReadConcern and WriteConcern
MongoDB
 
PPTX
Real Time Data Analytics with MongoDB and Fluentd at Wish
MongoDB
 
PPTX
MongoDB Aggregation Performance
MongoDB
 
PDF
MongoDB World 2019: Lessons Learned: Migrating Buffer's Production Database t...
MongoDB
 
PPTX
High Performance Applications with MongoDB
MongoDB
 
PDF
Social Data and Log Analysis Using MongoDB
Takahiro Inoue
 
PPTX
Introduction to MongoDB
NodeXperts
 
PDF
Webinar: Developing with the modern App Stack: MEAN and MERN (with Angular2 a...
MongoDB
 
PPTX
Basics of MongoDB
HabileLabs
 
PPTX
MongoDB 101
Abhijeet Vaikar
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PPTX
Back to Basics Webinar 1: Introduction to NoSQL
MongoDB
 
PPTX
Webinar: When to Use MongoDB
MongoDB
 
PPTX
Scalable Event Analytics with MongoDB & Ruby on Rails
Jared Rosoff
 
Sizing MongoDB Clusters
MongoDB
 
Mongo Web Apps: OSCON 2011
rogerbodamer
 
MongoDB World 2019: Raiders of the Anti-patterns: A Journey Towards Fixing Sc...
MongoDB
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB
 
Transitioning from SQL to MongoDB
MongoDB
 
MongoDB .local Toronto 2019: Using Change Streams to Keep Up with Your Data
MongoDB
 
ReadConcern and WriteConcern
MongoDB
 
Real Time Data Analytics with MongoDB and Fluentd at Wish
MongoDB
 
MongoDB Aggregation Performance
MongoDB
 
MongoDB World 2019: Lessons Learned: Migrating Buffer's Production Database t...
MongoDB
 
High Performance Applications with MongoDB
MongoDB
 
Social Data and Log Analysis Using MongoDB
Takahiro Inoue
 
Introduction to MongoDB
NodeXperts
 
Webinar: Developing with the modern App Stack: MEAN and MERN (with Angular2 a...
MongoDB
 
Basics of MongoDB
HabileLabs
 
MongoDB 101
Abhijeet Vaikar
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
Back to Basics Webinar 1: Introduction to NoSQL
MongoDB
 
Webinar: When to Use MongoDB
MongoDB
 
Scalable Event Analytics with MongoDB & Ruby on Rails
Jared Rosoff
 

Similar to Using Aggregation for analytics (20)

PPTX
Sizing MongoDB Clusters
MongoDB
 
PPT
Wmware NoSQL
Murat Çakal
 
PPTX
Empowering the AWS DynamoDB™ application developer with Alternator
ScyllaDB
 
PDF
Using Spring with NoSQL databases (SpringOne China 2012)
Chris Richardson
 
PPTX
Performance Tipping Points - Hitting Hardware Bottlenecks
MongoDB
 
PPTX
Scaling and Transaction Futures
MongoDB
 
KEY
Hybrid MongoDB and RDBMS Applications
Steven Francia
 
PDF
Worldwide Local Latency With ScyllaDB
ScyllaDB
 
PDF
MongoDB: a gentle, friendly overview
Antonio Pintus
 
KEY
Mongodb intro
christkv
 
PPTX
Sizing Your MongoDB Cluster
MongoDB
 
PDF
Big Data, Fast Data @ PayPal (YOW 2018)
Sid Anand
 
PDF
Building a Microservices-based ERP System
MongoDB
 
PPTX
Curriculum Associates Strata NYC 2017
Kristi Lewandowski
 
PPTX
Curriculum Associates Strata NYC 2017
SingleStore
 
PPTX
Curriculum Associates Strata NYC 2017
Kristi Lewandowski
 
PDF
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC
 
PDF
Data Processing and Aggregation with MongoDB
MongoDB
 
PDF
Learn from HomeAway Hadoop Development and Operations Best Practices
Driven Inc.
 
ODP
Big data nyu
Edward Capriolo
 
Sizing MongoDB Clusters
MongoDB
 
Wmware NoSQL
Murat Çakal
 
Empowering the AWS DynamoDB™ application developer with Alternator
ScyllaDB
 
Using Spring with NoSQL databases (SpringOne China 2012)
Chris Richardson
 
Performance Tipping Points - Hitting Hardware Bottlenecks
MongoDB
 
Scaling and Transaction Futures
MongoDB
 
Hybrid MongoDB and RDBMS Applications
Steven Francia
 
Worldwide Local Latency With ScyllaDB
ScyllaDB
 
MongoDB: a gentle, friendly overview
Antonio Pintus
 
Mongodb intro
christkv
 
Sizing Your MongoDB Cluster
MongoDB
 
Big Data, Fast Data @ PayPal (YOW 2018)
Sid Anand
 
Building a Microservices-based ERP System
MongoDB
 
Curriculum Associates Strata NYC 2017
Kristi Lewandowski
 
Curriculum Associates Strata NYC 2017
SingleStore
 
Curriculum Associates Strata NYC 2017
Kristi Lewandowski
 
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC
 
Data Processing and Aggregation with MongoDB
MongoDB
 
Learn from HomeAway Hadoop Development and Operations Best Practices
Driven Inc.
 
Big data nyu
Edward Capriolo
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
PDF
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB
 
Ad

Recently uploaded (20)

PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Digital Circuits, important subject in CS
contactparinay1
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 

Using Aggregation for analytics

  • 1. #MDBW17 Using Aggregation for Analytics JUMPSTART SESSION
  • 2. #MDBW17 Senior Solutions Architect, MongoDB RUBÉN TERCEÑO @rubenTerceno
  • 3. #MDBW17 AGENDA • My Problem (And Maybe Yours). • In the Search for a Solution. • MongoDB Aggregation Framework. • Let’s Crunch Some Numbers!
  • 6. #MDBW17 THAT MANAGER • The CRM system is the mean source of revenue information. • Up to date information is critical for our business owners. • Grouped data is much more valuable while taking decisions. • Graphs are a powerful mean to present grouped information.
  • 7. IN THE SEARCH FOR A SOLUTION
  • 8. #MDBW17 STEP 1: ANALYTICS ON THE OPERATIONAL DB • Running Analytics on your operational database.
  • 9. #MDBW17 STEP 1: ANALYTICS ON THE OPERATIONAL DB
  • 10. #MDBW17 STEP 1: ANALYTICS ON THE OPERATIONAL DB • Running Analytics on your operational database. ‒ Analytical workload affects operational users o Lots of table scans and heavy counts and groups.
  • 11. #MDBW17 STEP 2: ETL AND OLAP • ETLing your data into an analytical dedicated database.
  • 13. #MDBW17 STEP 2: ETL AND OLAP • ETLing your data into an analytical dedicated database. ‒ Longer time to react to business requests. o Every change affects four systems. ‒ Lack of accuracy on real time reports. o Data synchronization was happening overnight, so today’s report is on yesterday’s data.
  • 14. #MDBW17 STEP 3: DEDICATED NICHE PRODUCTS • Real-Time data replication (CDC), embedded BI capabilities, dedicated hardware.
  • 15. #MDBW17 STEP 3: DEDICATED NICHE PRODUCTS • Real-Time data replication (CDC), embedded BI capabilities, dedicated hardware. ‒ New skills required in them team. o Hardware, CDC, Middleware, Java, UI. ‒ The solution reliability was low. o Too many moving parts. o Monitoring and debugging was complex. ‒ Cost was very high. o More expensive than the CRM itself!
  • 16. #MDBW17 SO… WHAT DO WE NEED? • Analytical capabilities. • Simple Architecture. • Workload isolation. • Real time data. • High Availability. • Cost aligned with provided value.
  • 18. #MDBW17 MONGODB AGGREGATION FRAMEWORK • A Series of Document Transformations. ‒ Executed in stages. o Original input is a collection. o Output of one stage sent as input of next. o Output as a cursor or a collection. • Rich Library of Functions. ‒ Filter, manipulate, group, join and summarize data. • Optimized for performance. ‒ Full index support. ‒ Operations executed in sequential order, performing stage optimization, if possible.
  • 20. #MDBW17 ARCHITECTURE – REPLICASET mongod 27017 mongod 27017 mongod 27017 Replica Set mongod 27017 Primary mongod 27017 Secondary mongod 27017 Secondary
  • 21. #MDBW17 ARCHITECTURE – DATA REPLICATION mongod 27017 mongod 27017 mongod 27017 Replica Set mongod 27017 Secondary Oplog replication mongod 27017 Primary mongod 27017 Secondary
  • 22. #MDBW17 ARCHITECTURE – FAILOVER mongod 27017 Replica Set mongod 27017 Secondary Oplog replication Heartbeat mongod 27017 Primary mongod 27017 Secondary
  • 23. #MDBW17 ARCHITECTURE – ANALYTICAL WORKLOAD mongod 27017 Replica Set mongod 27017 Secondary Oplog replication Heartbeat mongod 27017 Primary mongod 27017 Secondary
  • 24. #MDBW17 ARCHITECTURE – SECONDARY READS mongod 27017 mongod 27017 mongod 27017 Replica Set mongod 27017 Secondary Oplog replication Heartbeat mongod 27017 Primary mongod 27017 Secondary
  • 26. #MDBW17 LIVE AGGREGATION FRAMEWORK DEMO • Fingers crossed!!
  • 27. #MDBW17 PROBLEM DESCRIPTION • Database containing the biggest ships out there and, in a different collection, the containers (not docker, shipping containers). • Information of the cargo is at container level, but we need it at ship level where information like destination sits. • We want to know the cargo of each ship to be able to find things like all ships currently in the North Atlantic, arriving in the US with more than 100000 TM of Iron.
  • 28. #MDBW17 BUILDING THE AGGREGATION STEP BY STEP • We’ll create one variable for every step of the aggregation framework, so we can easily build and test our pipe. var myMatch = {some JSON}; var myGroup = {other JSON}; var mySort = {more JSON}; db.ships.aggregate([myMatch, myGroup, mySort])
  • 29. #MDBW17 ALL SHIPS WITHIN THE NORTH ATLANTIC • Our first stage is a match. It allow us to filter the vessels. Let’s find all ships in the North Atlantic going to US ports. var match = {$match : {location: { $geoWithin: { $geometry : atlantic}}, "route.destination.Country": "United States"}}
  • 30. #MDBW17 FINDING THE CONTAINERS OF EACH SHIP • The containers are in a different collection. In order to find the containers of each ship let’s join both collection together. The lookup operator will allow us to do this. var lookup = {$lookup : {from: "containers", as: "cargo", localField: "Name", foreignField: "shipName"}}
  • 31. #MDBW17 MANIPULATING THE ARRAY • That huge array is not going to be usable, let’s transform it into something easier to handle. The unwind function will help us. var unwind = {$unwind: "$cargo”}
  • 32. #MDBW17 GROUPING BY SHIP AND CARGO TYPE • This stage will group the individual documents by ship and cargo type, count and add up the TM for each ship and cargo type. var group = {$group : {_id: {ship: "$Name", cargo : "$cargo.cargo", route: "$route", location: "$location"}, sum: {$sum: "$cargo.Tons"}, count : {$sum: 1}}}
  • 33. #MDBW17 MANIPULATING THE FIELD NAMES • It’s possible to change the shape of our documents at any moment thanks to project stage. Let’s put the cargo info in a sub document. var project = {$project: { _id : {ship: "$_id.ship", route: "$_id.route", location: "$_id.location"}, cargo : { type : "$_id.cargo", tons: "$sum", count: "$count"}}}
  • 34. #MDBW17 GROUPING BY SHIP • And now let’s group again only by ship. The different cargos of each ship will be pushed into a newly created array of documents. var group2 = {$group : { _id: "$_id", cargo: {$push: "$cargo"}}}
  • 35. #MDBW17 FINAL POLISHING • Finally, let’s reorder our fields again with another project stage var project2 = {$project: {_id: 0, ship: "$_id.ship”, route: "$_id.route", location: "$_id.location”, cargo: 1}}
  • 36. #MDBW17 SAVING THE RESULTS • We can store the results to a new collection using the out stage. var out = {$out: "result"}
  • 37. #MDBW17 SHOW ME THE VOLUME!! • Will it perform with a much larger volume? Let’s try with 5000 ships and 21 million containers. • Thanks to our step by step approach, we only need to build a new lookup step. var lookup2 = {"$lookup" : { "from" : "containers2", "as" : "cargo", "localField" : "Name", "foreignField" : "shipName”}}
  • 38. #MDBW17 COMMON PIPELINE OPERATORS • $match ‒ Filter documents • $project ‒ Reshape documents • $group ‒ Summarize documents • $lookup ‒ Join two collections together • $unwind ‒ Expand an array • $out ‒ Create new collections • $sort ‒ Order documents • $limit/$skip ‒ Paginate documents • $facet ‒ Executes multiple expressions • $sample ‒ samples random data • $bucket ‒ Creates groups by range • $redact ‒ Restrict documents
  • 39. #MDBW17 SO… WHAT DO WE NEED? • Analytical capabilities. • Simple Architecture. • Workload isolation. • Real time data. • High Availability. • Cost aligned with provided value.
  • 40. #MDBW17 SO… WHAT DO WE HAVE? • Analytical capabilities.  Native, rich and performing. • Simple Architecture. • Workload isolation. • Real time data. • High Availability. • Cost aligned with provided value.
  • 41. #MDBW17 SO… WHAT DO WE HAVE? • Analytical capabilities.  Native, rich and performing. • Simple Architecture.  No extra products, no data transfer. • Workload isolation. • Real time data. • High Availability. • Cost aligned with provided value.
  • 42. #MDBW17 SO… WHAT DO WE HAVE? • Analytical capabilities.  Native, rich and performing. • Simple Architecture.  No extra products, no data transfer. • Workload isolation.  Secondary reads. • Real time data. • High Availability. • Cost aligned with provided value.
  • 43. #MDBW17 SO… WHAT DO WE HAVE? • Analytical capabilities.  Native, rich and performing. • Simple Architecture.  No extra products, no data transfer. • Workload isolation.  Secondary reads. • Real time data.  Replication lag typically under 1 sec. • High Availability. • Cost aligned with provided value.
  • 44. #MDBW17 SO… WHAT DO WE HAVE? • Analytical capabilities.  Native, rich and performing. • Simple Architecture.  No extra products, no data transfer. • Workload isolation.  Secondary reads. • Real time data.  Replication lag typically under 1 sec. • High Availability.  Native MongoDB replication and failover. • Cost aligned with provided value.
  • 45. #MDBW17 SO… WHAT DO WE HAVE? • Analytical capabilities.  Native, rich and performing. • Simple Architecture.  No extra products, no data transfer. • Workload isolation.  Secondary reads. • Real time data.  Replication lag typically under 1 sec. • High Availability.  Native MongoDB replication and failover. • Cost aligned with provided value.  No extra servers or licenses.
  • 49. SAVE THE DATE POWERFUL ANALYSIS WITH THE AGGREGATION PIPELINE SPEAKER: ASYA KAMSKY DOING JOINS IN MONGODB: BEST PRACTICES FOR USING $LOOKUP SPEAKER: AUSTIN ZELLNER

Editor's Notes