SlideShare a Scribd company logo
BUILDING A MISSION
CRITICAL EVENT SYSTEM
ON TOP OF MONGODB
by @shahar_kedar
BIGPANDA
SaaS platform that lets companies aggregate alerts
from all their monitoring systems into one place for
faster incident discovery and response.
HOW IT WORKS
High CPU on	

prod-srv-1	

18/06/14 16:05	

CRITICAL
High CPU on	

prod-srv-1	

18/06/14 16:07	

WARNING	

Memory usage on	

prod-srv-1	

18/06/14 16:08	

CRITICAL	

Events Entities
High CPU on	

prod-srv-1	

WARNING
Memory usage on	

prod-srv-1	

CRITICAL	

Incidents
2 Alerts on 	

prod-srv-1
PRODUCT REQUIREMENTS
• Events need to be processed into incidents and
streamed to the user’s browser as fast as possible 	

• Incidents need to reliably reflect the state as it is in
the monitoring system	

• The service has to be up and running 24x7
MISSION CRITICAL
• It’s not rocket science, it’s not Google, but:	

• It has to be super fast	

• It has to be extremely reliable	

• It has to always be available
OUR #1 COMPETITOR
WHY MONGO?
BECAUSE IT’S WEB SCALE!
WHY MONGO?
At first:	

• NodeJS shop	

• Schemaless	

• Easy to master	

Later on:	

• Reliable	

• Easy to evolve	

• Partial and atomic updates	

• Powerful query language
BECAUSE IT’S WEB SCALE!
SUPER FAST
Hardware
Schema Design
Lean & Stream
HARDWARE
03/13
3 x m1.medium
02/14
1 x i2.xlarge

+	

2 x m1.medium
m1.medium: 1 vCPUs, 3.75GB RAM, EBS drive
06/14
2 x i2.xlarge

+	

1 x m3.xlarge
m3.xlarge: 4 vCPUs, 15GB RAM, EBS drive
i2.xlarge: 4 vCPUs, 30.5GB RAM, SSD 800GB
x3 reads
x4 writes
–Eliot Horowitz
“Schema design is … the largest factor when it comes
to performance and scalability … more important
than hardware, how you shard, or anything else,
schema is by far the most important thing.”
SCHEMA DESIGN
Event
{	

timestamp : Date	

status: String	

description: String,	

}	

Entity
{	

start : Date	

end: Date	

status: String	

description: String,	

events: [
<embedded>
]
source_system: String	

}	

Incident
{	

start : Date	

end: Date	

is_active: Boolean	

description: String,	

entities: [

{
entityId: ObjectId
status: String
}
]	

}
DENORMALIZATION
• Go over the checklist (http://bit.ly/1vUdz2T)	

• Incidents => Entities: partially embedded + ref	

• Cardinality: one-to-few	

• Direct access to Entities	

• Entities are frequently updated	

• Entities => Events: embedded	

• Events are not directly accessed	

• Events are immutable	

• Cardinality: one-to-many ~ one-to-gazzilion
INDEXES
• Optimized indexes 

db.collection.find({..}).explain()	

• Removed redundant indexes	

• Truncated events collections (TTL index)
LEAN QUERIES
• Use projections to limit fields returned by a query:

Model.find().select(‘-events’)	

• Mongoose users: use .lean() when possible to gain more
than 50% performance boost:

Model.find().lean()	

• Stream results: 

Model.find().stream().on(‘data’, function(doc){})

RESULTS
• Average latency of all API calls went from 500ms
to under 20ms	

• Average latency of full pipeline went from 2s to
under 500ms	

• Peak time latency of full pipeline went down from
5m(!!) to less than 30s
EXTREMELY
RELIABLE
Atomic & Partial Updates
ATOMIC & PARTIAL UPDATES
• Several services might try to update the same
document at the same time, but:	

• Different systems update different parts of the
document	

• Updates to the same document are sharded and
ordered at the application level 

(read our awesome blog post: http://bit.ly/1nQVcbS)
IMPOSSIBLETO
KILL
Replica Set
Disaster Recovery
REPLICA SET
• 3 nodes replica set	

• Using priorities to enforce master election of
stronger nodes	

• Deployed on different availability zones
DISASTER RECOVERY
• Cold backup using MMS Backup	

• Full production replication on another EC2 region:
using mongo’s replication mechanism to
continuously sync data to the backup region
THANKYOU!

More Related Content

What's hot (18)

PPTX
SplunkLive! Customer Presentation - Garmin International
Splunk
 
PDF
Turning Cloud Metrics into Results
InfluxData
 
PDF
Efficient IT operations using monitoring systems and standardized tools - Ici...
Icinga
 
PPTX
LabGauge - LRIG Late Night
xi2elic
 
ODP
Monitoring via Datadog
Knoldus Inc.
 
PDF
Monitoring @ scale spot dy
Arvind Rapaka
 
PDF
Combinación de logs, métricas y trazas para una observabilidad centralizada
Elasticsearch
 
PDF
Capstone Poster Final Draft - 2
Krishna Prasad A R
 
PPTX
Splunk Implementation and Usage - Garmin
Splunk
 
PDF
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
QAware GmbH
 
PDF
Data torrent meetup-productioneng
Chris Westin
 
PDF
Go Observability (in practice)
Eran Levy
 
PPTX
Codemotion Milan 2015 Alerts Overload
sarahjwells
 
PPTX
Sarah Wells - Alert overload: How to adopt a microservices architecture witho...
Codemotion
 
PPTX
Subutai Ahmad, VP of Research, Numenta at MLconf SF - 11/13/15
MLconf
 
PDF
SensorThings API webinar-#4-Connect Your Sensor
SensorUp
 
PPTX
Using static analysis tools within continuous integration systems
Rogue Wave Software
 
PDF
Cloud-native application monitoring powered by Riverbed and Elasticsearch
Richard Juknavorian
 
SplunkLive! Customer Presentation - Garmin International
Splunk
 
Turning Cloud Metrics into Results
InfluxData
 
Efficient IT operations using monitoring systems and standardized tools - Ici...
Icinga
 
LabGauge - LRIG Late Night
xi2elic
 
Monitoring via Datadog
Knoldus Inc.
 
Monitoring @ scale spot dy
Arvind Rapaka
 
Combinación de logs, métricas y trazas para una observabilidad centralizada
Elasticsearch
 
Capstone Poster Final Draft - 2
Krishna Prasad A R
 
Splunk Implementation and Usage - Garmin
Splunk
 
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
QAware GmbH
 
Data torrent meetup-productioneng
Chris Westin
 
Go Observability (in practice)
Eran Levy
 
Codemotion Milan 2015 Alerts Overload
sarahjwells
 
Sarah Wells - Alert overload: How to adopt a microservices architecture witho...
Codemotion
 
Subutai Ahmad, VP of Research, Numenta at MLconf SF - 11/13/15
MLconf
 
SensorThings API webinar-#4-Connect Your Sensor
SensorUp
 
Using static analysis tools within continuous integration systems
Rogue Wave Software
 
Cloud-native application monitoring powered by Riverbed and Elasticsearch
Richard Juknavorian
 

Similar to Building an event system on top MongoDB (20)

PDF
MongoDB: What, why, when
Eugenio Minardi
 
PDF
Confluent & MongoDB APAC Lunch & Learn
confluent
 
PDF
MongoDB and the Internet of Things
MongoDB
 
PPTX
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
Data Driven Innovation
 
PPTX
Ops Jumpstart: MongoDB Administration 101
MongoDB
 
PPTX
MonogDB Admin 101 - MonogDBDays Munich
Marc Schwering
 
PDF
MongoDB - Warehouse and Aggregator of Events
Maxim Ligus
 
PPTX
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
MongoDB
 
PPTX
L’architettura di Classe Enterprise di Nuova Generazione
MongoDB
 
PPTX
Advanced applications with MongoDB
Norberto Leite
 
PPTX
MongoDB Operations for Developers
MongoDB
 
PPTX
Ops Jumpstart: MongoDB Administration 101
MongoDB
 
PPTX
Stream me to the Cloud (and back) with Confluent & MongoDB
confluent
 
PPTX
L’architettura di classe enterprise di nuova generazione
MongoDB
 
PPTX
Data Streaming with Apache Kafka & MongoDB
confluent
 
PDF
How to monitor MongoDB
Server Density
 
PPTX
MongoDB for Time Series Data
MongoDB
 
PPTX
MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB
 
PPTX
MongoDB 2.4 and spring data
Jimmy Ray
 
PPTX
MongoDB Evenings Minneapolis: MongoDB is Cool But When Should I Use It?
MongoDB
 
MongoDB: What, why, when
Eugenio Minardi
 
Confluent & MongoDB APAC Lunch & Learn
confluent
 
MongoDB and the Internet of Things
MongoDB
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
Data Driven Innovation
 
Ops Jumpstart: MongoDB Administration 101
MongoDB
 
MonogDB Admin 101 - MonogDBDays Munich
Marc Schwering
 
MongoDB - Warehouse and Aggregator of Events
Maxim Ligus
 
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
MongoDB
 
L’architettura di Classe Enterprise di Nuova Generazione
MongoDB
 
Advanced applications with MongoDB
Norberto Leite
 
MongoDB Operations for Developers
MongoDB
 
Ops Jumpstart: MongoDB Administration 101
MongoDB
 
Stream me to the Cloud (and back) with Confluent & MongoDB
confluent
 
L’architettura di classe enterprise di nuova generazione
MongoDB
 
Data Streaming with Apache Kafka & MongoDB
confluent
 
How to monitor MongoDB
Server Density
 
MongoDB for Time Series Data
MongoDB
 
MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB
 
MongoDB 2.4 and spring data
Jimmy Ray
 
MongoDB Evenings Minneapolis: MongoDB is Cool But When Should I Use It?
MongoDB
 
Ad

Recently uploaded (20)

PDF
IObit Driver Booster Pro 12.4.0.585 Crack Free Download
henryc1122g
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
Salesforce Experience Cloud Consultant.pdf
VALiNTRY360
 
PPTX
prodad heroglyph crack 2.0.214.2 Full Free Download
cracked shares
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PDF
Everything you need to know about pricing & licensing Microsoft 365 Copilot f...
Q-Advise
 
PDF
Simplify React app login with asgardeo-sdk
vaibhav289687
 
PDF
Best Web development company in india 2025
Greenusys
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
AI Prompts Cheat Code prompt engineering
Avijit Kumar Roy
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PPTX
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
IObit Driver Booster Pro 12.4.0.585 Crack Free Download
henryc1122g
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Salesforce Experience Cloud Consultant.pdf
VALiNTRY360
 
prodad heroglyph crack 2.0.214.2 Full Free Download
cracked shares
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Everything you need to know about pricing & licensing Microsoft 365 Copilot f...
Q-Advise
 
Simplify React app login with asgardeo-sdk
vaibhav289687
 
Best Web development company in india 2025
Greenusys
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
AI Prompts Cheat Code prompt engineering
Avijit Kumar Roy
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
Ad

Building an event system on top MongoDB

  • 1. BUILDING A MISSION CRITICAL EVENT SYSTEM ON TOP OF MONGODB by @shahar_kedar
  • 2. BIGPANDA SaaS platform that lets companies aggregate alerts from all their monitoring systems into one place for faster incident discovery and response.
  • 3. HOW IT WORKS High CPU on prod-srv-1 18/06/14 16:05 CRITICAL High CPU on prod-srv-1 18/06/14 16:07 WARNING Memory usage on prod-srv-1 18/06/14 16:08 CRITICAL Events Entities High CPU on prod-srv-1 WARNING Memory usage on prod-srv-1 CRITICAL Incidents 2 Alerts on prod-srv-1
  • 4. PRODUCT REQUIREMENTS • Events need to be processed into incidents and streamed to the user’s browser as fast as possible • Incidents need to reliably reflect the state as it is in the monitoring system • The service has to be up and running 24x7
  • 5. MISSION CRITICAL • It’s not rocket science, it’s not Google, but: • It has to be super fast • It has to be extremely reliable • It has to always be available
  • 8. WHY MONGO? At first: • NodeJS shop • Schemaless • Easy to master Later on: • Reliable • Easy to evolve • Partial and atomic updates • Powerful query language BECAUSE IT’S WEB SCALE!
  • 10. HARDWARE 03/13 3 x m1.medium 02/14 1 x i2.xlarge
 + 2 x m1.medium m1.medium: 1 vCPUs, 3.75GB RAM, EBS drive 06/14 2 x i2.xlarge
 + 1 x m3.xlarge m3.xlarge: 4 vCPUs, 15GB RAM, EBS drive i2.xlarge: 4 vCPUs, 30.5GB RAM, SSD 800GB x3 reads x4 writes
  • 11. –Eliot Horowitz “Schema design is … the largest factor when it comes to performance and scalability … more important than hardware, how you shard, or anything else, schema is by far the most important thing.”
  • 12. SCHEMA DESIGN Event { timestamp : Date status: String description: String, } Entity { start : Date end: Date status: String description: String, events: [ <embedded> ] source_system: String } Incident { start : Date end: Date is_active: Boolean description: String, entities: [
 { entityId: ObjectId status: String } ] }
  • 13. DENORMALIZATION • Go over the checklist (http://bit.ly/1vUdz2T) • Incidents => Entities: partially embedded + ref • Cardinality: one-to-few • Direct access to Entities • Entities are frequently updated • Entities => Events: embedded • Events are not directly accessed • Events are immutable • Cardinality: one-to-many ~ one-to-gazzilion
  • 14. INDEXES • Optimized indexes 
 db.collection.find({..}).explain() • Removed redundant indexes • Truncated events collections (TTL index)
  • 15. LEAN QUERIES • Use projections to limit fields returned by a query:
 Model.find().select(‘-events’) • Mongoose users: use .lean() when possible to gain more than 50% performance boost:
 Model.find().lean() • Stream results: 
 Model.find().stream().on(‘data’, function(doc){})

  • 16. RESULTS • Average latency of all API calls went from 500ms to under 20ms • Average latency of full pipeline went from 2s to under 500ms • Peak time latency of full pipeline went down from 5m(!!) to less than 30s
  • 18. ATOMIC & PARTIAL UPDATES • Several services might try to update the same document at the same time, but: • Different systems update different parts of the document • Updates to the same document are sharded and ordered at the application level 
 (read our awesome blog post: http://bit.ly/1nQVcbS)
  • 20. REPLICA SET • 3 nodes replica set • Using priorities to enforce master election of stronger nodes • Deployed on different availability zones
  • 21. DISASTER RECOVERY • Cold backup using MMS Backup • Full production replication on another EC2 region: using mongo’s replication mechanism to continuously sync data to the backup region

Editor's Notes

  • #4: For each customer: aggregate alert notifications from multiple monitoring systems group together alerts that belong to the same monitored appliance group together, into “incidents”, alerts that are (topo-)logically related