1
What problem are we solving?
• Map/Reduce can be used for aggregation…
  • Currently being used for totaling, averaging, etc
• Map/Reduce is a big hammer
  • Simpler tasks should be easier
    • Shouldn’t need to write JavaScript
    • Avoid the overhead of JavaScript engine
• We’re seeing requests for help in handling
  complex documents
  • Select only matching subdocuments or arrays
How will we solve the problem?
• Our new aggregation framework
  • Declarative framework
    • No JavaScript required
  • Describe a chain of operations to apply
  • Expression evaluation
    • Return computed values
  • Framework: we can add new operations easily
  • C++ implementation
    • Higher performance than JavaScript
Aggregation - Pipelines
• Aggregation requests specify a pipeline
• A pipeline is a series of operations
• Conceptually, the members of a collection
  are passed through a pipeline to produce a
  result
  • Similar to a command-line pipe
Pipeline Operations
• $match
  • Uses a query predicate (like .find({…})) as a filter
• $project
  • Uses a sample document to determine the shape
    of the result (similar to .find()’s optional argument)
    • This can include computed values
• $unwind
  • Hands out array elements one at a time
• $group
  • Aggregates items into buckets defined by a key
Pipeline Operations (continued)
• $sort
  • Sort documents
• $limit
  • Only allow the specified number of documents to
    pass
• $skip
  • Skip over the specified number of documents
Projections
• $project can reshape results
  • Include or exclude fields
  • Computed fields
    • Arithmetic expressions, including built-in functions
    • Pull fields from nested documents to the top
    • Push fields from the top down into new virtual
      documents
Unwinding
• $unwind can “stream” arrays
  • Array values are doled out one at time in the
    context of their surrounding documents
  • Makes it possible to filter out elements before
    returning
Grouping
• $group aggregation expressions
  • Define a grouping key as the _id of the result
  • Total grouped column values: $sum
  • Average grouped column values: $avg
  • Collect grouped column values in an array or set:
    $push, $addToSet
  • Other functions
    • $min, $max, $first, $last
Sorting
• $sort can sort documents
  • Sort specifications are the same as today, e.g.,
    $sort:{ key1: 1, key2: -1, …}
Computed Expressions
• Available in $project operations
• Prefix expression language
  • Add two fields: $add:[“$field1”, “$field2”]
  • Provide a value for a missing field:
    $ifNull:[“$field1”, “$field2”]
  • Nesting: $add:[“$field1”, $ifNull:[“$field2”,
    “$field3”]]
  • Other functions….
    • And we can easily add more as required
Computed Expressions (continued)
• String functions
  • toUpper, toLower, substr
• Date field extraction
  • Get year, month, day, hour, etc, from ISODate
• Date arithmetic
• Null value substitution (like MySQL ifnull(),
  Oracle nvl())
• Ternary conditional
  • Return one of two values based on a predicate
Demo
Demo files are at https://gist.github.com/1401585
Usage Tips
• Use $match in a pipeline as early as possible
  • The query optimizer can then choose to scan an
    index and avoid scanning the entire collection
• Use $sort in a pipeline as early as possible
  • The query optimizer can then be used to choose
    an index to scan instead of sorting the result
Driver Support
• Initial version is a command
  • For any language, build a JSON database object,
    and execute the command
    • In the shell: db.runCommand({ aggregate :
      <collection-name>, pipeline : {…} });
  • Beware of command result size limit
    • Document size limit is 16MB
Sharding support
• Initial release will support sharding
• Mongos analyzes pipeline, and forwards
  operations up to $group or $sort to shards;
  combines shard server results and returns
  them
When is this being released?
• In final development now
  • Adding an explain facility
• Expect to see this in the near future
Future Plans
• More optimizations
• $out pipeline operation
  • Saves the document stream to a collection
  • Similar to M/R $out, but with sharded output
  • Functions like a tee, so that intermediate results
    can be saved
mongodb-aggregation-may-2012

More Related Content

PPTX
MongoDB Aggregation MongoSF May 2011
PPTX
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...
PDF
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
PDF
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
PDF
Redis: REmote DIctionary Server
PPTX
Learn AJAX at ASIT
PDF
Devoxx france 2015 influxdb
KEY
EG Reports - Delicious Data
MongoDB Aggregation MongoSF May 2011
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Redis: REmote DIctionary Server
Learn AJAX at ASIT
Devoxx france 2015 influxdb
EG Reports - Delicious Data

What's hot (20)

PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PDF
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
PPTX
PPTX
Query handlingbytheserver
PPT
Asp #2
PPTX
Things you can find in the plan cache
PDF
Data centric Metaprogramming by Vlad Ulreche
PDF
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
PDF
You got schema in my json
PDF
Getting started with influx Db and Grafana Installation Guide
PDF
Up and Running with the Typelevel Stack
PPTX
Utilizing the OpenNTF Domino API
PDF
Cost-based Query Optimization
PDF
Head first latex
PDF
Hadoop spark online demo
PDF
Towards sql for streams
PDF
Streaming SQL with Apache Calcite
PDF
Apache Tajo on Swift: Bringing SQL to the OpenStack World
PPTX
Entity framework
PPT
SQL on Big Data using Optiq
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Query handlingbytheserver
Asp #2
Things you can find in the plan cache
Data centric Metaprogramming by Vlad Ulreche
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
You got schema in my json
Getting started with influx Db and Grafana Installation Guide
Up and Running with the Typelevel Stack
Utilizing the OpenNTF Domino API
Cost-based Query Optimization
Head first latex
Hadoop spark online demo
Towards sql for streams
Streaming SQL with Apache Calcite
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Entity framework
SQL on Big Data using Optiq
Ad

Similar to mongodb-aggregation-may-2012 (20)

PPTX
MongoDB's New Aggregation framework
PPT
No sql Database
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PDF
10 Reasons to Start Your Analytics Project with PostgreSQL
PPTX
Big data week presentation
PPTX
Spark real world use cases and optimizations
PPTX
Apache Spark
PDF
cb streams - gavin pickin
PPTX
New T-SQL Features in SQL Server 2012
PDF
Couchbas for dummies
PPTX
Skillwise - Enhancing dotnet app
PPTX
Hadoop and HBase experiences in perf log project
PPTX
Data Analytics using sparkabcdefghi.pptx
PDF
Spring Day | Spring and Scala | Eberhard Wolff
PPTX
In memory databases presentation
PPTX
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
PPTX
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
PPTX
PPTX
Dive into spark2
PPTX
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
MongoDB's New Aggregation framework
No sql Database
AI與大數據數據處理 Spark實戰(20171216)
10 Reasons to Start Your Analytics Project with PostgreSQL
Big data week presentation
Spark real world use cases and optimizations
Apache Spark
cb streams - gavin pickin
New T-SQL Features in SQL Server 2012
Couchbas for dummies
Skillwise - Enhancing dotnet app
Hadoop and HBase experiences in perf log project
Data Analytics using sparkabcdefghi.pptx
Spring Day | Spring and Scala | Eberhard Wolff
In memory databases presentation
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Dive into spark2
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Ad

More from Chris Westin (20)

PDF
Data torrent meetup-productioneng
PDF
Gripshort
PPTX
Ambari hadoop-ops-meetup-2013-09-19.final
PDF
Cluster management and automation with cloudera manager
PDF
Building low latency java applications with ehcache
PDF
SDN/OpenFlow #lspe
ODP
cfengine3 at #lspe
PDF
Nimbula lspe-2012-04-19
PPTX
mongodb-brief-intro-february-2012
PDF
Stingray - Riverbed Technology
PPTX
Replication and replica sets
PPTX
Architecting a Scale Out Cloud Storage Solution
PPTX
FlashCache
PPTX
Large Scale Cacti
PPTX
MongoDB: An Introduction - July 2011
PPTX
Practical Replication June-2011
PPTX
MongoDB: An Introduction - june-2011
PPT
Ganglia Overview-v2
ODP
Mysql Proxy Presentation Yahoo
ODP
Mysql proxy presentation_yahoo
Data torrent meetup-productioneng
Gripshort
Ambari hadoop-ops-meetup-2013-09-19.final
Cluster management and automation with cloudera manager
Building low latency java applications with ehcache
SDN/OpenFlow #lspe
cfengine3 at #lspe
Nimbula lspe-2012-04-19
mongodb-brief-intro-february-2012
Stingray - Riverbed Technology
Replication and replica sets
Architecting a Scale Out Cloud Storage Solution
FlashCache
Large Scale Cacti
MongoDB: An Introduction - July 2011
Practical Replication June-2011
MongoDB: An Introduction - june-2011
Ganglia Overview-v2
Mysql Proxy Presentation Yahoo
Mysql proxy presentation_yahoo

Recently uploaded (20)

PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Unlock new opportunities with location data.pdf
PPT
Module 1.ppt Iot fundamentals and Architecture
PPT
What is a Computer? Input Devices /output devices
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
A novel scalable deep ensemble learning framework for big data classification...
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
O2C Customer Invoices to Receipt V15A.pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
DP Operators-handbook-extract for the Mautical Institute
NewMind AI Weekly Chronicles – August ’25 Week III
Chapter 5: Probability Theory and Statistics
Tartificialntelligence_presentation.pptx
Web Crawler for Trend Tracking Gen Z Insights.pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
A contest of sentiment analysis: k-nearest neighbor versus neural network
Unlock new opportunities with location data.pdf
Module 1.ppt Iot fundamentals and Architecture
What is a Computer? Input Devices /output devices
Zenith AI: Advanced Artificial Intelligence
A comparative study of natural language inference in Swahili using monolingua...
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Getting started with AI Agents and Multi-Agent Systems
How ambidextrous entrepreneurial leaders react to the artificial intelligence...

mongodb-aggregation-may-2012

  • 1. 1
  • 2. What problem are we solving? • Map/Reduce can be used for aggregation… • Currently being used for totaling, averaging, etc • Map/Reduce is a big hammer • Simpler tasks should be easier • Shouldn’t need to write JavaScript • Avoid the overhead of JavaScript engine • We’re seeing requests for help in handling complex documents • Select only matching subdocuments or arrays
  • 3. How will we solve the problem? • Our new aggregation framework • Declarative framework • No JavaScript required • Describe a chain of operations to apply • Expression evaluation • Return computed values • Framework: we can add new operations easily • C++ implementation • Higher performance than JavaScript
  • 4. Aggregation - Pipelines • Aggregation requests specify a pipeline • A pipeline is a series of operations • Conceptually, the members of a collection are passed through a pipeline to produce a result • Similar to a command-line pipe
  • 5. Pipeline Operations • $match • Uses a query predicate (like .find({…})) as a filter • $project • Uses a sample document to determine the shape of the result (similar to .find()’s optional argument) • This can include computed values • $unwind • Hands out array elements one at a time • $group • Aggregates items into buckets defined by a key
  • 6. Pipeline Operations (continued) • $sort • Sort documents • $limit • Only allow the specified number of documents to pass • $skip • Skip over the specified number of documents
  • 7. Projections • $project can reshape results • Include or exclude fields • Computed fields • Arithmetic expressions, including built-in functions • Pull fields from nested documents to the top • Push fields from the top down into new virtual documents
  • 8. Unwinding • $unwind can “stream” arrays • Array values are doled out one at time in the context of their surrounding documents • Makes it possible to filter out elements before returning
  • 9. Grouping • $group aggregation expressions • Define a grouping key as the _id of the result • Total grouped column values: $sum • Average grouped column values: $avg • Collect grouped column values in an array or set: $push, $addToSet • Other functions • $min, $max, $first, $last
  • 10. Sorting • $sort can sort documents • Sort specifications are the same as today, e.g., $sort:{ key1: 1, key2: -1, …}
  • 11. Computed Expressions • Available in $project operations • Prefix expression language • Add two fields: $add:[“$field1”, “$field2”] • Provide a value for a missing field: $ifNull:[“$field1”, “$field2”] • Nesting: $add:[“$field1”, $ifNull:[“$field2”, “$field3”]] • Other functions…. • And we can easily add more as required
  • 12. Computed Expressions (continued) • String functions • toUpper, toLower, substr • Date field extraction • Get year, month, day, hour, etc, from ISODate • Date arithmetic • Null value substitution (like MySQL ifnull(), Oracle nvl()) • Ternary conditional • Return one of two values based on a predicate
  • 13. Demo Demo files are at https://gist.github.com/1401585
  • 14. Usage Tips • Use $match in a pipeline as early as possible • The query optimizer can then choose to scan an index and avoid scanning the entire collection • Use $sort in a pipeline as early as possible • The query optimizer can then be used to choose an index to scan instead of sorting the result
  • 15. Driver Support • Initial version is a command • For any language, build a JSON database object, and execute the command • In the shell: db.runCommand({ aggregate : <collection-name>, pipeline : {…} }); • Beware of command result size limit • Document size limit is 16MB
  • 16. Sharding support • Initial release will support sharding • Mongos analyzes pipeline, and forwards operations up to $group or $sort to shards; combines shard server results and returns them
  • 17. When is this being released? • In final development now • Adding an explain facility • Expect to see this in the near future
  • 18. Future Plans • More optimizations • $out pipeline operation • Saves the document stream to a collection • Similar to M/R $out, but with sharded output • Functions like a tee, so that intermediate results can be saved