SlideShare a Scribd company logo
Sr. Solutions Architect, MongoDB
Jake Angerman
Sharding Time Series Data
Let's Pretend We Are DevOps
What my friends
think I do
What society
thinks I do
What my Mom
thinks I do
What my boss
thinks I do What I think I do What I really do
DevOps
Sharding Overview
Primary
Secondary
Secondary
Shard 1
Primary
Secondary
Secondary
Shard 2
Primary
Secondary
Secondary
Shard 3
Primary
Secondary
Secondary
Shard N
…
Query
Router
Query
Router
Query
Router
……
Driver
Application
Why do we need to shard?
• Reaching a limit on some resource
– RAM (working set)
– Disk space
– Disk IO
– Client network latency on writes (tag aware sharding)
– CPU
Do we need to shard right now?
• Two schools of thought:
1. Shard at the outset to avoid technical debt later
2. Shard later to avoid complexity and overhead today
• Either way, shard before you need to!
– 256GB data size threshold published in documentation
– Chunk migrations can cause memory contention and disk
IO
Working
Set
Free
RAM
Things seemed fine…
Working
Set
Chunk
Migration
… then I waited
too long to
shard
Develop Nationwide traffic monitoring
system
Traffic sensors to monitor interstate
conditions
• 16,000 sensors
• Measure
• Speed
• Travel time
• Weather, pavement, and traffic conditions
• Support desktop, mobile, and car navigation
systems
Model After NY State Solution
http://511ny.org
{ _id: “900006:2014031206”,
data: [
{ speed: NaN, time: NaN },
{ speed: NaN, time: NaN },
{ speed: NaN, time: NaN },
...
],
conditions: {
status: "unknown",
pavement: "unknown",
weather: "unknown"
}
}
Sample Document Structure
Pre-allocated,
60 element array of
per-minute data
> db.mdbw.stats()
{
"ns" : "test.mdbw",
"count" : 16000, // one hour's worth of documents
"size" : 65280000, // size of user data, padding included
"avgObjSize" : 4080,
"storageSize" : 93356032, // size of data extents, unused space included
"numExtents" : 11,
"nindexes" : 1,
"lastExtentSize" : 31354880,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 1,
"totalIndexSize" : 801248,
"indexSizes" : { "_id_" : 801248 },
"ok" : 1
}
collection stats
Storage model spreadsheet
sensors 16,000
years to keep data 6
docs per day 384,000
docs per year 140,160,000
docs total across all years 840,960,000
indexes per day 801248 bytes
storage per hour 63 MB
storage per day 1.5 GB
storage per year 539 GB
storage across all years 3,235 GB
Why we need to shard now
539 GB in year one
alone
0
500
1,000
1,500
2,000
2,500
3,000
3,500
1 2 3 4 5 6
Year
Total storage
(GB)
16,000 sensors today…
… 47,000
tomorrow?
What will our sharded cluster look
like?
• We need to model the application to answer this
question
• Model should include:
– application write patterns (sensors)
– application read patterns (clients)
– analytic read patterns
– data storage requirements
• Two main collections
– summary data (fast query times)
– historical data (analysis of environmental conditions)
Option 1: Everything in one sharded
cluster
Primary Primary Primary
Secondary Secondary Secondary
Secondary Secondary Secondary
Shard 2 Shard 3 Shard N
…
Primary
Secondary
Secondary
Shard 1
Primary Shard
Primary
Secondary
Secondary
Shard 4
• Issue: prevent analytics jobs from affecting application
performance
• Summary data is small (16,000 * N bytes) and accessed
frequently
Option 2: Distinct replica set for
summaries
Primary Primary Primary
Secondary Secondary Secondary
Secondary Secondary Secondary
Shard 1 Shard 2 Shard N
…
Primary
Secondary
Secondary
Replica set
Primary
Secondary
Secondary
Shard 3
• Pros: Operational separation between business functions
• Cons: application must write to two different databases
Application read patterns
• Web browsers, mobile phones, and in-car
navigation devices
• Working set will be kept in RAM
• 5M subscribers * 1% active * 50
sensors/query *
1 device query/min = 41,667 reads/sec
• 41,667 reads/sec * 4080 bytes = 162 MB/sec
– and that's without any protocol overhead
• Gigabit Ethernet is ≈ 118 MB/sec
Primary
Secondary
Secondary
Replica set
1 Gbps
Application read patterns
(continued)
• Options
– provision more bandwidth ($$$)
– tune application read pattern
– add a caching layer
– secondary reads from the replica
set
Primary
Secondary
Secondary
Replica set
1 Gbps
1 Gbps
1 Gbps
Secondary Reads from the Replica
Set
• Stale data OK in this use case
• caution: read preference of
secondary could be disastrous in a
3-replica set if a secondary fails!
• app servers with mixed read
preferences of primary and
secondary are operationally
cumbersome
• Use nearest read preference to
access all nodes
Primary
Secondary
Secondary
Replica set
1 Gbps
1 Gbps
1 Gbps
db.collection.find().readPref
( { mode: 'nearest'} )
Replica Set Tags
• app servers in different data centers use
replica set tags plus read preferencenearest
• db.collection.find().readPref( { mode: 'nearest',
tags: [ {'datacenter': 'east'} ] } )
east
Secondary
Secondary
Primary
>rs.conf()
{"_id":"rs0",
"version":2,
"members":[
{"_id":0,
"host":"node0.example.net:27017",
"tags":{"datacenter":"east"}
},
{"_id":1,
"host":"node1.example.net:27017",
"tags":{"datacenter":"east"}
},
{"_id":2,
"host":"node2.example.net:27017",
"tags":{"datacenter":"east"}
},
}
eastcentralwest
Replica Set Tags
• Enables geographic distribution
SecondarySecondary Primary
eastcentralwest
Replica Set Tags
• Enables geographic distribution
• Allows scaling within each data center
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Primary
Secondary
Secondary
Analytic read patterns
• How does an analyst look at the data on the sharded cluster?
• 1 Year of data = 539 GB
3, 256
3, 192
5, 128
9, 64
17, 32
0
50
100
150
200
250
300
0 2 4 6 8 10 12 14 16 18
Server RAM
Number of shards
Application write patterns
• 16,000 sensors every minute = 267 writes/sec
• Could we handle 16,000 writes in one second?
– 16,000 writes * 4080 bytes = 62 MB
• Load test the app!
Modeling the Application - summary
• We modeled:
– application write patterns (sensors)
– application read patterns (clients)
– analytic read patterns
– data storage requirements
– the network, a little bit
Shard Key
Shard Key characteristics
• Agood shard key has:
– sufficient cardinality
– distributed writes
– targeted reads ("query isolation")
• Shard key should be in every query if possible
– scatter gather otherwise
• Choosing a good shard key is important!
– affects performance and scalability
– changing it later is expensive
Hashed shard key
• Pros:
– Evenly distributed writes
• Cons:
– Random data (and index) updates can be IO intensive
– Range-based queries turn into scatter gather
Shard 1
mongos
Shard 2 Shard 3 Shard N
Low cardinality shard key
• Induces "jumbo chunks"
• Examples: sensor ID
• Makes sense for some use cases besides this one
Shard 1
mongos
Shard 2 Shard 3 Shard N
[ a, b ) [ b, c ) [ c, d ) [ e, f )
Ascending shard key
• Monotonically increasing shard key values cause
"hot spots" on inserts
• Examples: timestamps, _id
Shard 1
mongos
Shard 2 Shard 3 Shard N
[ ISODate(…), $maxKey )
Choosing a shard key for time series
data
• Consider compound shard key:
{arbitrary value, incrementing value}
• Best of both worlds – multi-hot spotting, targeted
reads
Shard 1
mongos
Shard 2 Shard 3 Shard N
[ {V1, ISODate(A)}, {V1, ISODate(B)} ),
[ {V1, ISODate(B)}, {V1, ISODate(C)} ),
[ {V1, ISODate(C)}, {V1, ISODate(D)} ),
…
[ {V4, ISODate(A)}, {V4, ISODate(B)}
[ {V4, ISODate(B)}, {V4, ISODate(C)}
[ {V4, ISODate(C)}, {V4, ISODate(D)}
…
[ {V2, ISODate(A)}, {V2, ISODate(B)} ),
[ {V2, ISODate(B)}, {V2, ISODate(C)} ),
[ {V2, ISODate(C)}, {V2, ISODate(D)} ),
…
[ {V3, ISODate(A)}, {V3, ISODate(B)} ),
[ {V3, ISODate(B)}, {V3, ISODate(C)} ),
[ {V3, ISODate(C)}, {V3, ISODate(D)} ),
…
What is our shard key?
• Let's choose: linkID, date
– example: { linkID: 9000006, date: 2014031206 }
– example: { _id: "900006:2014031206" }
– this application's _id is in this form already, yay!
Summary
• Model the read/write patterns and storage
• Choose an appropriate shard key
• DevOps influenced the application
– write recent summary data to separate database
– replica set tags for summary database
– avoid synchronous sensor checkins
– consider changing client polling frequency
– consider throttling RESTAPI access to app servers
Sign up for our “Path to Proof” Program
and get free expert advice on
implementation, architecture, and
configuration.
www.mongodb.com/lp/contact/path-proof-program
MongoDB for Time Series Data: Sharding
Sr. Solutions Architect, MongoDB
Jake Angerman
Thank You

More Related Content

What's hot (20)

PDF
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Facultad de Informática UCM
 
PDF
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
DataStax Academy
 
PDF
Building Modern Data Streaming Apps with Python
Timothy Spann
 
PDF
Moving to PCI Express based SSD with NVM Express
Odinot Stanislas
 
PPTX
Cloud & GCP 101
Runcy Oommen
 
PDF
Overview of Postgres Utility Processes
EDB
 
PPTX
Asm bot mitigations v3 final- lior rotkovitch
Lior Rotkovitch
 
PDF
Cassandra
Edureka!
 
PDF
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
SindhuVasireddy1
 
PDF
Postgresql 12 streaming replication hol
Vijay Kumar N
 
PPTX
Map db
Debmalya Jash
 
PDF
Userspace drivers-2016
Chris Simmonds
 
PDF
How Apache Kafka® Works
confluent
 
DOCX
Keepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docx
NeoClova
 
PDF
10分で分かるデータストレージ
Takashi Hoshino
 
PDF
KSQL - Stream Processing simplified!
Guido Schmutz
 
PDF
MariaDB 마이그레이션 - 네오클로바
NeoClova
 
PPTX
Android Binder: Deep Dive
Zafar Shahid, PhD
 
PPSX
10. GPU - Video Card (Display, Graphics, VGA)
Akhila Dakshina
 
PDF
Zen and the Art of Streaming Joins - The What, When and Why (Nick Dearden, Co...
confluent
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Facultad de Informática UCM
 
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
DataStax Academy
 
Building Modern Data Streaming Apps with Python
Timothy Spann
 
Moving to PCI Express based SSD with NVM Express
Odinot Stanislas
 
Cloud & GCP 101
Runcy Oommen
 
Overview of Postgres Utility Processes
EDB
 
Asm bot mitigations v3 final- lior rotkovitch
Lior Rotkovitch
 
Cassandra
Edureka!
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
SindhuVasireddy1
 
Postgresql 12 streaming replication hol
Vijay Kumar N
 
Userspace drivers-2016
Chris Simmonds
 
How Apache Kafka® Works
confluent
 
Keepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docx
NeoClova
 
10分で分かるデータストレージ
Takashi Hoshino
 
KSQL - Stream Processing simplified!
Guido Schmutz
 
MariaDB 마이그레이션 - 네오클로바
NeoClova
 
Android Binder: Deep Dive
Zafar Shahid, PhD
 
10. GPU - Video Card (Display, Graphics, VGA)
Akhila Dakshina
 
Zen and the Art of Streaming Joins - The What, When and Why (Nick Dearden, Co...
confluent
 

Viewers also liked (10)

PPTX
MongoDB for Time Series Data Part 3: Sharding
MongoDB
 
PPT
Everything You Need to Know About Sharding
MongoDB
 
PPTX
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB
 
PDF
MongoDB Sharding Fundamentals
Antonios Giannopoulos
 
PDF
MongoDB training for java software engineers
Moshe Kaplan
 
PPT
MongoDB Pros and Cons
johnrjenson
 
PPTX
Sharding Methods for MongoDB
MongoDB
 
PPTX
MySQL High Availability Solutions - Feb 2015 webinar
Andrew Morgan
 
PPTX
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB
 
PPTX
Back to Basics Webinar 3: Introduction to Replica Sets
MongoDB
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB
 
Everything You Need to Know About Sharding
MongoDB
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB
 
MongoDB Sharding Fundamentals
Antonios Giannopoulos
 
MongoDB training for java software engineers
Moshe Kaplan
 
MongoDB Pros and Cons
johnrjenson
 
Sharding Methods for MongoDB
MongoDB
 
MySQL High Availability Solutions - Feb 2015 webinar
Andrew Morgan
 
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
MongoDB
 
Back to Basics Webinar 3: Introduction to Replica Sets
MongoDB
 
Ad

Similar to MongoDB for Time Series Data: Sharding (20)

PPTX
Ops Jumpstart: MongoDB Administration 101
MongoDB
 
PPTX
Hardware Provisioning
MongoDB
 
PPTX
Webinar: Serie Operazioni per la vostra applicazione - Sessione 6 - Installar...
MongoDB
 
PPTX
Back tobasicswebinar part6-rev.
MongoDB
 
PPTX
MongoDB at Scale
MongoDB
 
PPTX
Sharding Methods for MongoDB
MongoDB
 
PDF
MongoDB: Optimising for Performance, Scale & Analytics
Server Density
 
PPTX
Back to Basics: Build Something Big With MongoDB
MongoDB
 
PPT
MongoDB Sharding Webinar 2014
Dylan Tong
 
PPT
MongoDB Knowledge Shareing
Philip Zhong
 
PPTX
2014 05-07-fr - add dev series - session 6 - deploying your application-2
MongoDB
 
PPTX
Scaling MongoDB
MongoDB
 
PPTX
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
PDF
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
MongoDB
 
PPT
Building a CRM on top of ElasticSearch
Mark Greene
 
PPTX
Why databases cry at night
Michael Yarichuk
 
PDF
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
DataStax
 
PDF
Understanding and building big data Architectures - NoSQL
Hyderabad Scalability Meetup
 
PPT
NYJavaSIG - Big Data Microservices w/ Speedment
Speedment, Inc.
 
PPTX
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
tsliwowicz
 
Ops Jumpstart: MongoDB Administration 101
MongoDB
 
Hardware Provisioning
MongoDB
 
Webinar: Serie Operazioni per la vostra applicazione - Sessione 6 - Installar...
MongoDB
 
Back tobasicswebinar part6-rev.
MongoDB
 
MongoDB at Scale
MongoDB
 
Sharding Methods for MongoDB
MongoDB
 
MongoDB: Optimising for Performance, Scale & Analytics
Server Density
 
Back to Basics: Build Something Big With MongoDB
MongoDB
 
MongoDB Sharding Webinar 2014
Dylan Tong
 
MongoDB Knowledge Shareing
Philip Zhong
 
2014 05-07-fr - add dev series - session 6 - deploying your application-2
MongoDB
 
Scaling MongoDB
MongoDB
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
MongoDB
 
Building a CRM on top of ElasticSearch
Mark Greene
 
Why databases cry at night
Michael Yarichuk
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
DataStax
 
Understanding and building big data Architectures - NoSQL
Hyderabad Scalability Meetup
 
NYJavaSIG - Big Data Microservices w/ Speedment
Speedment, Inc.
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
tsliwowicz
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 

Recently uploaded (20)

PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Digital Circuits, important subject in CS
contactparinay1
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 

MongoDB for Time Series Data: Sharding

  • 1. Sr. Solutions Architect, MongoDB Jake Angerman Sharding Time Series Data
  • 2. Let's Pretend We Are DevOps What my friends think I do What society thinks I do What my Mom thinks I do What my boss thinks I do What I think I do What I really do DevOps
  • 3. Sharding Overview Primary Secondary Secondary Shard 1 Primary Secondary Secondary Shard 2 Primary Secondary Secondary Shard 3 Primary Secondary Secondary Shard N … Query Router Query Router Query Router …… Driver Application
  • 4. Why do we need to shard? • Reaching a limit on some resource – RAM (working set) – Disk space – Disk IO – Client network latency on writes (tag aware sharding) – CPU
  • 5. Do we need to shard right now? • Two schools of thought: 1. Shard at the outset to avoid technical debt later 2. Shard later to avoid complexity and overhead today • Either way, shard before you need to! – 256GB data size threshold published in documentation – Chunk migrations can cause memory contention and disk IO Working Set Free RAM Things seemed fine… Working Set Chunk Migration … then I waited too long to shard
  • 6. Develop Nationwide traffic monitoring system
  • 7. Traffic sensors to monitor interstate conditions • 16,000 sensors • Measure • Speed • Travel time • Weather, pavement, and traffic conditions • Support desktop, mobile, and car navigation systems
  • 8. Model After NY State Solution http://511ny.org
  • 9. { _id: “900006:2014031206”, data: [ { speed: NaN, time: NaN }, { speed: NaN, time: NaN }, { speed: NaN, time: NaN }, ... ], conditions: { status: "unknown", pavement: "unknown", weather: "unknown" } } Sample Document Structure Pre-allocated, 60 element array of per-minute data
  • 10. > db.mdbw.stats() { "ns" : "test.mdbw", "count" : 16000, // one hour's worth of documents "size" : 65280000, // size of user data, padding included "avgObjSize" : 4080, "storageSize" : 93356032, // size of data extents, unused space included "numExtents" : 11, "nindexes" : 1, "lastExtentSize" : 31354880, "paddingFactor" : 1, "systemFlags" : 1, "userFlags" : 1, "totalIndexSize" : 801248, "indexSizes" : { "_id_" : 801248 }, "ok" : 1 } collection stats
  • 11. Storage model spreadsheet sensors 16,000 years to keep data 6 docs per day 384,000 docs per year 140,160,000 docs total across all years 840,960,000 indexes per day 801248 bytes storage per hour 63 MB storage per day 1.5 GB storage per year 539 GB storage across all years 3,235 GB
  • 12. Why we need to shard now 539 GB in year one alone 0 500 1,000 1,500 2,000 2,500 3,000 3,500 1 2 3 4 5 6 Year Total storage (GB) 16,000 sensors today… … 47,000 tomorrow?
  • 13. What will our sharded cluster look like? • We need to model the application to answer this question • Model should include: – application write patterns (sensors) – application read patterns (clients) – analytic read patterns – data storage requirements • Two main collections – summary data (fast query times) – historical data (analysis of environmental conditions)
  • 14. Option 1: Everything in one sharded cluster Primary Primary Primary Secondary Secondary Secondary Secondary Secondary Secondary Shard 2 Shard 3 Shard N … Primary Secondary Secondary Shard 1 Primary Shard Primary Secondary Secondary Shard 4 • Issue: prevent analytics jobs from affecting application performance • Summary data is small (16,000 * N bytes) and accessed frequently
  • 15. Option 2: Distinct replica set for summaries Primary Primary Primary Secondary Secondary Secondary Secondary Secondary Secondary Shard 1 Shard 2 Shard N … Primary Secondary Secondary Replica set Primary Secondary Secondary Shard 3 • Pros: Operational separation between business functions • Cons: application must write to two different databases
  • 16. Application read patterns • Web browsers, mobile phones, and in-car navigation devices • Working set will be kept in RAM • 5M subscribers * 1% active * 50 sensors/query * 1 device query/min = 41,667 reads/sec • 41,667 reads/sec * 4080 bytes = 162 MB/sec – and that's without any protocol overhead • Gigabit Ethernet is ≈ 118 MB/sec Primary Secondary Secondary Replica set 1 Gbps
  • 17. Application read patterns (continued) • Options – provision more bandwidth ($$$) – tune application read pattern – add a caching layer – secondary reads from the replica set Primary Secondary Secondary Replica set 1 Gbps 1 Gbps 1 Gbps
  • 18. Secondary Reads from the Replica Set • Stale data OK in this use case • caution: read preference of secondary could be disastrous in a 3-replica set if a secondary fails! • app servers with mixed read preferences of primary and secondary are operationally cumbersome • Use nearest read preference to access all nodes Primary Secondary Secondary Replica set 1 Gbps 1 Gbps 1 Gbps db.collection.find().readPref ( { mode: 'nearest'} )
  • 19. Replica Set Tags • app servers in different data centers use replica set tags plus read preferencenearest • db.collection.find().readPref( { mode: 'nearest', tags: [ {'datacenter': 'east'} ] } ) east Secondary Secondary Primary >rs.conf() {"_id":"rs0", "version":2, "members":[ {"_id":0, "host":"node0.example.net:27017", "tags":{"datacenter":"east"} }, {"_id":1, "host":"node1.example.net:27017", "tags":{"datacenter":"east"} }, {"_id":2, "host":"node2.example.net:27017", "tags":{"datacenter":"east"} }, }
  • 20. eastcentralwest Replica Set Tags • Enables geographic distribution SecondarySecondary Primary
  • 21. eastcentralwest Replica Set Tags • Enables geographic distribution • Allows scaling within each data center Secondary Secondary Secondary Secondary Secondary Secondary Primary Secondary Secondary
  • 22. Analytic read patterns • How does an analyst look at the data on the sharded cluster? • 1 Year of data = 539 GB 3, 256 3, 192 5, 128 9, 64 17, 32 0 50 100 150 200 250 300 0 2 4 6 8 10 12 14 16 18 Server RAM Number of shards
  • 23. Application write patterns • 16,000 sensors every minute = 267 writes/sec • Could we handle 16,000 writes in one second? – 16,000 writes * 4080 bytes = 62 MB • Load test the app!
  • 24. Modeling the Application - summary • We modeled: – application write patterns (sensors) – application read patterns (clients) – analytic read patterns – data storage requirements – the network, a little bit
  • 26. Shard Key characteristics • Agood shard key has: – sufficient cardinality – distributed writes – targeted reads ("query isolation") • Shard key should be in every query if possible – scatter gather otherwise • Choosing a good shard key is important! – affects performance and scalability – changing it later is expensive
  • 27. Hashed shard key • Pros: – Evenly distributed writes • Cons: – Random data (and index) updates can be IO intensive – Range-based queries turn into scatter gather Shard 1 mongos Shard 2 Shard 3 Shard N
  • 28. Low cardinality shard key • Induces "jumbo chunks" • Examples: sensor ID • Makes sense for some use cases besides this one Shard 1 mongos Shard 2 Shard 3 Shard N [ a, b ) [ b, c ) [ c, d ) [ e, f )
  • 29. Ascending shard key • Monotonically increasing shard key values cause "hot spots" on inserts • Examples: timestamps, _id Shard 1 mongos Shard 2 Shard 3 Shard N [ ISODate(…), $maxKey )
  • 30. Choosing a shard key for time series data • Consider compound shard key: {arbitrary value, incrementing value} • Best of both worlds – multi-hot spotting, targeted reads Shard 1 mongos Shard 2 Shard 3 Shard N [ {V1, ISODate(A)}, {V1, ISODate(B)} ), [ {V1, ISODate(B)}, {V1, ISODate(C)} ), [ {V1, ISODate(C)}, {V1, ISODate(D)} ), … [ {V4, ISODate(A)}, {V4, ISODate(B)} [ {V4, ISODate(B)}, {V4, ISODate(C)} [ {V4, ISODate(C)}, {V4, ISODate(D)} … [ {V2, ISODate(A)}, {V2, ISODate(B)} ), [ {V2, ISODate(B)}, {V2, ISODate(C)} ), [ {V2, ISODate(C)}, {V2, ISODate(D)} ), … [ {V3, ISODate(A)}, {V3, ISODate(B)} ), [ {V3, ISODate(B)}, {V3, ISODate(C)} ), [ {V3, ISODate(C)}, {V3, ISODate(D)} ), …
  • 31. What is our shard key? • Let's choose: linkID, date – example: { linkID: 9000006, date: 2014031206 } – example: { _id: "900006:2014031206" } – this application's _id is in this form already, yay!
  • 32. Summary • Model the read/write patterns and storage • Choose an appropriate shard key • DevOps influenced the application – write recent summary data to separate database – replica set tags for summary database – avoid synchronous sensor checkins – consider changing client polling frequency – consider throttling RESTAPI access to app servers
  • 33. Sign up for our “Path to Proof” Program and get free expert advice on implementation, architecture, and configuration. www.mongodb.com/lp/contact/path-proof-program
  • 35. Sr. Solutions Architect, MongoDB Jake Angerman Thank You

Editor's Notes

  • #6: Public Service Announcement need to research the 256 GB limit
  • #10: Compound unique index on linkId & Interval update field used to identify new documents for aggregation
  • #21: you could use the same config file in each data center! 47K is 3X 16K
  • #22: application must read a config file to determine its read pool 47K is 3X 16K
  • #29: may consider hashing _id instead
  • #30: may consider hashing _id instead