Using Aggregation for analytics

#MDBW17
Using Aggregation for Analytics
JUMPSTART SESSION

#MDBW17
Senior Solutions Architect, MongoDB
RUBÉN TERCEÑO
@rubenTerceno

#MDBW17
AGENDA
• My Problem (And Maybe Yours).
• In the Search for a Solution.
• MongoDB Aggregation Framework.
• Let’s Crunch Some Numbers!

#MDBW17
THAT MANAGER
• The CRM system is the mean
source of revenue information.
• Up to date information is critical
for our business owners.
• Grouped data is much more
valuable while taking decisions.
• Graphs are a powerful mean to
present grouped information.

#MDBW17
STEP 1: ANALYTICS ON THE OPERATIONAL
DB
• Running Analytics on your operational database.

#MDBW17
DB

#MDBW17
DB
• Running Analytics on your operational database.
‒ Analytical workload affects operational users
o Lots of table scans and heavy counts and groups.

#MDBW17
STEP 2: ETL AND OLAP
• ETLing your data into an analytical dedicated database.

#MDBW17
STEP 2: ETL AND OLAP
• ETLing your data into an analytical dedicated database.
‒ Longer time to react to business requests.
o Every change affects four systems.
‒ Lack of accuracy on real time reports.
o Data synchronization was happening overnight, so today’s report is on yesterday’s data.

#MDBW17
STEP 3: DEDICATED NICHE PRODUCTS
• Real-Time data replication (CDC), embedded BI capabilities,
dedicated hardware.

#MDBW17
STEP 3: DEDICATED NICHE PRODUCTS
• Real-Time data replication (CDC), embedded BI capabilities,
dedicated hardware.
‒ New skills required in them team.
o Hardware, CDC, Middleware, Java, UI.
‒ The solution reliability was low.
o Too many moving parts.
o Monitoring and debugging was complex.
‒ Cost was very high.
o More expensive than the CRM itself!

#MDBW17
SO… WHAT DO WE NEED?
• Analytical capabilities.
• Simple Architecture.
• Workload isolation.
• Real time data.
• High Availability.
• Cost aligned with provided value.

#MDBW17
MONGODB AGGREGATION FRAMEWORK
• A Series of Document Transformations.
‒ Executed in stages.
o Original input is a collection.
o Output of one stage sent as input of next.
o Output as a cursor or a collection.
• Rich Library of Functions.
‒ Filter, manipulate, group, join and summarize data.
• Optimized for performance.
‒ Full index support.
‒ Operations executed in sequential order, performing stage optimization, if possible.

#MDBW17
ARCHITECTURE – REPLICASET
mongod
27017
mongod
27017
mongod
27017
Replica Set
mongod
27017
Primary
mongod
27017
Secondary
mongod
27017
Secondary

#MDBW17
ARCHITECTURE – DATA REPLICATION
mongod
27017
mongod
27017
mongod
27017
Replica Set
mongod
27017
Secondary
Oplog
replication
mongod
27017
Primary
mongod
27017
Secondary

#MDBW17
ARCHITECTURE – FAILOVER
mongod
27017
Replica Set
mongod
27017
Secondary
Oplog
replication
Heartbeat
mongod
27017
Primary
mongod
27017
Secondary

#MDBW17
ARCHITECTURE – ANALYTICAL WORKLOAD
mongod
27017
Replica Set
mongod
27017
Secondary
Oplog
replication
Heartbeat
mongod
27017
Primary
mongod
27017
Secondary

#MDBW17
ARCHITECTURE – SECONDARY READS
mongod
27017
mongod
27017
mongod
27017
Replica Set
mongod
27017
Secondary
Oplog
replication
Heartbeat
mongod
27017
Primary
mongod
27017
Secondary

#MDBW17
LIVE AGGREGATION FRAMEWORK DEMO
• Fingers crossed!!

#MDBW17
PROBLEM DESCRIPTION
• Database containing the biggest ships out there and, in a different
collection, the containers (not docker, shipping containers).
• Information of the cargo is at container level, but we need it at ship
level where information like destination sits.
• We want to know the cargo of each ship to be able to find things
like all ships currently in the North Atlantic, arriving in the US with
more than 100000 TM of Iron.

#MDBW17
BUILDING THE AGGREGATION STEP BY
STEP
• We’ll create one variable for every step of the aggregation
framework, so we can easily build and test our pipe.
var myMatch = {some JSON};
var myGroup = {other JSON};
var mySort = {more JSON};
db.ships.aggregate([myMatch, myGroup, mySort])

#MDBW17
ALL SHIPS WITHIN THE NORTH ATLANTIC
• Our first stage is a match. It allow us to filter the vessels. Let’s
find all ships in the North Atlantic going to US ports.
var match = {$match :
{location: {
$geoWithin: { $geometry : atlantic}},
"route.destination.Country": "United States"}}

#MDBW17
FINDING THE CONTAINERS OF EACH SHIP
• The containers are in a different collection. In order to find the
containers of each ship let’s join both collection together. The
lookup operator will allow us to do this.
var lookup = {$lookup :
{from: "containers",
as: "cargo",
localField: "Name",
foreignField: "shipName"}}

#MDBW17
MANIPULATING THE ARRAY
• That huge array is not going to be usable, let’s transform it into
something easier to handle. The unwind function will help us.
var unwind = {$unwind: "$cargo”}

#MDBW17
GROUPING BY SHIP AND CARGO TYPE
• This stage will group the individual documents by ship and cargo
type, count and add up the TM for each ship and cargo type.
var group = {$group :
{_id: {ship: "$Name",
cargo : "$cargo.cargo",
route: "$route",
location: "$location"},
sum: {$sum: "$cargo.Tons"},
count : {$sum: 1}}}

#MDBW17
MANIPULATING THE FIELD NAMES
• It’s possible to change the shape of our documents at any moment
thanks to project stage. Let’s put the cargo info in a sub document.
var project = {$project: {
_id : {ship: "$_id.ship", route: "$_id.route",
location: "$_id.location"},
cargo : { type : "$_id.cargo",
tons: "$sum",
count: "$count"}}}

#MDBW17
GROUPING BY SHIP
• And now let’s group again only by ship. The different cargos of each
ship will be pushed into a newly created array of documents.
var group2 = {$group : {
_id: "$_id",
cargo: {$push: "$cargo"}}}

#MDBW17
FINAL POLISHING
• Finally, let’s reorder our fields again with another project stage
var project2 = {$project: {_id: 0,
ship: "$_id.ship”,
route: "$_id.route",
location: "$_id.location”,
cargo: 1}}

#MDBW17
SAVING THE RESULTS
• We can store the results to a new collection using the out stage.
var out = {$out: "result"}

#MDBW17
SHOW ME THE VOLUME!!
• Will it perform with a much larger volume? Let’s try with 5000 ships
and 21 million containers.
• Thanks to our step by step approach, we only need to build a new
lookup step.
var lookup2 = {"$lookup" : {
"from" : "containers2",
"as" : "cargo",
"localField" : "Name",
"foreignField" : "shipName”}}

#MDBW17
COMMON PIPELINE OPERATORS
• $match
‒ Filter documents
• $project
‒ Reshape documents
• $group
‒ Summarize documents
• $lookup
‒ Join two collections together
• $unwind
‒ Expand an array
• $out
‒ Create new collections
• $sort
‒ Order documents
• $limit/$skip
‒ Paginate documents
• $facet
‒ Executes multiple expressions
• $sample
‒ samples random data
• $bucket
‒ Creates groups by range
• $redact
‒ Restrict documents

#MDBW17
SO… WHAT DO WE HAVE?
• Analytical capabilities.  Native, rich and performing.
• Simple Architecture.
• Real time data.

#MDBW17
• Simple Architecture.  No extra products, no data transfer.
• Real time data.

#MDBW17
• Workload isolation.  Secondary reads.
• Real time data.

#MDBW17
• Real time data.  Replication lag typically under 1 sec.

#MDBW17
• High Availability.  Native MongoDB replication and failover.

#MDBW17
• High Availability.  Native MongoDB replication and failover.
• Cost aligned with provided value.  No extra servers or licenses.

SAVE
THE
DATE
POWERFUL ANALYSIS
WITH THE
AGGREGATION PIPELINE
SPEAKER: ASYA
KAMSKY
DOING JOINS IN
MONGODB: BEST
PRACTICES FOR USING
$LOOKUP
SPEAKER: AUSTIN
ZELLNER

Using Aggregation for analytics

Using Aggregation for analytics

More Related Content

What's hot (20)

Similar to Using Aggregation for analytics (20)

More from MongoDB (20)

Recently uploaded (20)

Using Aggregation for analytics

Editor's Notes