MongoDB Indexing
Constraints & Creative
Schemas
Chris Winslett
chris@mongohq.com
Thursday, June 27, 13
My Background
• For the past year, I’ve looked at MongoDB
logs at least once every day.
• We routinely answer the question “how
can I improve performance?”
Thursday, June 27, 13
Who’s this talk for?
• New to MongoDB
• Seeing some slow operations, and need
help debugging
• Running database operations on a sizeable
deploy
• I have a MongoDB deployment, and I’ve hit
a performance wall
Thursday, June 27, 13
What should you learn?
Know where to look on a running MongoDB
to uncover slowness, and discuss solutions.
MongoDB has performance “patterns”.
How to think about improving performance.
And . . .
Thursday, June 27, 13
Schema Design
Design with the end in mind.
Thursday, June 27, 13
MongoDB Indexing
Constraints
• One index per query *
• One range operator per query ($)
• Range operator must be last field in index
• Using RAM well
* except $or, but the sin with $or is appending a sort to the query.
Thursday, June 27, 13
The Tools
• `mongostat` for MongoDB Behavior
• `tail` the logs for current options
• `iostat` for disk util
• `top -c` for CPU usage
Thursday, June 27, 13
First, a Simple One
query getmore command res faults locked db ar|aw netIn netOut conn time
129 4 7 126m 2 my_db:0.0% 3|0 27k 445k 42 15:36:54
64 4 3 126m 0 my_db:0.0% 5|0 12k 379k 42 15:36:55
65 7 8 126m 0 my_db:0.1% 3|0 15k 230k 42 15:36:56
65 3 3 126m 1 my_db:0.0% 3|0 13k 170k 42 15:36:57
66 1 6 126m 1 my_db:0.0% 0|0 14k 262k 42 15:36:58
32 8 5 126m 0 my_db:0.0% 5|0 5k 445k 42 15:36:59
a truncated mongostat
Alerted due to high CPU
Thursday, June 27, 13
log
[conn73454] query my_db.my_collection query: { $query:
{ publisher: "US Weekly" }, orderby: { publishAt: -1 } }
ntoreturn:5 ntoskip:0 nscanned:33236 scanAndOrder:1
keyUpdates:0 numYields: 21 locks(micros) r:317266
nreturned:5 reslen:3127 178ms
Thursday, June 27, 13
Solution
{ $query: { publisher: "US Weekly" }, orderby: { publishedAt: -1 } }
db.my_collection.ensureIndex({“publisher”: 1, publishedAt: -1}, {background: true})
We are fixing this query
With this index
I would show you the logs, but now they are silent.
Thursday, June 27, 13
The Pattern
Inefficient Read Queries from in-memory
table scans cause high CPU load
Caused by not matching indexes to queries.
Thursday, June 27, 13
Example 2
MongoDB Twitter-ish Feed
Customer was building a
network graph of users.
Thursday, June 27, 13
Naive Method
{
creator_id: ObjectId(),
status:“This is so awesome!”
}
Statuses Users
{
_id: ObjectId(),
friends: [array-o-friends]
}
db.status.find({creator_id: {$in: [array-o-friends]}}).sort({_id: -1})
Query
Thursday, June 27, 13
Solution
{
creator_id: ObjectId(),
friends_of_creator: [array-of-viewers],
status:“This is so awesome!”
}
Statuses Users
{
_id: ObjectId(),
friends: [array-o-friends]
}
db.statuses.find({friends_of_creator: ObjectId()}).sort({_id: -1})
Query
Thursday, June 27, 13
The Pattern
With graphs, query on viewable by.
What worked with minimal documents was not scaling.
Thursday, June 27, 13
Similar Issues - Messages
{
sender_id: ObjectId(),
recipient_id: ObjectId(),
message:“This is so awesome!”
}
Naive
{
sender_id: ObjectId(),
recipient_id: ObjectId(),
participants: [ObjectId(), ObjectId()],
thread_id: ObjectId(),
message:“This is so awesome!”
}
Solution
db.messages.find({participants: ObjectId()}).sort({_id: -1})
Query
db.messages.find({$or: [{sender_id: ObjectId()}, {recipient_id: ObjectId()]}).sort({_id: -1})
Naive Query
Thursday, June 27, 13
Example 3
insert query update delete getmore command faults locked % idx miss % qr|qw ar|aw
*0 *0 *0 *0 0 1|0 1422 0 0 0|0 50|0
*0 6 *0 *0 0 6|0 575 0 0 0|0 51|0
*0 3 *0 *0 0 1|0 1047 0 0 0|0 50|0
*0 2 *0 *0 0 3|0 1660 0 0 0|0 50|0
a truncated mongostat
Alerted on high CPU
Thursday, June 27, 13
tail
[initandlisten] connection accepted from ....
[conn4229724] authenticate: { authenticate: ....
[initandlisten] connection accepted from ....
[conn4229725] authenticate: { authenticate: .....
[conn4229717] query ..... 102ms
[conn4229725] query ..... 140ms
amazingly quiet
Thursday, June 27, 13
currentOp
> db.currentOP()
{
	

 "inprog" : [
	

 	

 {
	

 	

 	

 "opid" : 66178716,
	

 	

 	

 "lockType" : "read",
	

 	

 	

 "secs_running" : 760,
	

 	

 	

 "op" : "query",
	

 	

 	

 "ns" : "my_db.my_collection",
	

 	

 	

 "query" : {
keywords: $in: [“keyword1”,“keyword2”],
tags: $in: [“tags1”,“tags2”]
	

 	

 	

 },
orderby: {
“created_at”: -1
},
	

 	

 	

 "numYields" : 21
	

 	

 }
]
}
Thursday, June 27, 13
Solution
> db.currentOP().inprog.filter(function(row) {
return row.secs_running > 100 && row.op == "query"
}).forEach(function(row) {
db.killOp(row.opid)
})
Return Stability to Database
Disable query, and refactor schema.
Thursday, June 27, 13
Refactoring
I have one word for you,“Schema”
Thursday, June 27, 13
Example 4
A map reduce has gradually run
slower and slower.
Thursday, June 27, 13
Finding Offenders
Find the time of the slowest query of the day:
grep '[0-9]{3,100}ms$' $MONGODB_LOG | awk '{print $NF}' | sort -n
Thursday, June 27, 13
Slowest Map Reduce
my_db.$cmd command: {
mapreduce: "my_collection",
map: function() {},
query: { $or: [
{ object.type: "this" },
{ object.type: "that" }
],
time: { $lt: new Date(1359025311290), $gt: new Date(1358420511290) },
object.ver: 1,
origin: "tnh"
},
out: "my_new_collection",
reduce: function(keys, vals) { ....}
} ntoreturn:1 keyUpdates:0 numYields: 32696 locks(micros)
W:143870 r:511858643 w:6279425 reslen:140 421185ms
Thursday, June 27, 13
Solution
Query is slow because it has multiple multi-value operators: $or, $gte, and $lte
Problem
Solution
Change schema to use an “hour_created” attribute:
hour_created: “%Y-%m-%d %H”
Create an index on “hour_created” with followed by “$or” values. Query
using the new “hour_created.”
Thursday, June 27, 13
Words of caution
2 / 4 solutions were to add an index.
New indexes as a solution scales poorly.
Thursday, June 27, 13
Sometimes . . .
It is best to do nothing, except add shards / add hardware.
Go back to the drawing board on the design.
Thursday, June 27, 13
Bad things happen to
good databases?
• ORMs
• Manage your indexes and queries.
• Constraints will set you free.
Thursday, June 27, 13
Road Map for
Refactoring
• Measure, measure, measure.
• Find your slowest queries and determine if
they can be indexed
• Rephrase the problem you are solving by
asking “How do I want to query my data?”
Thursday, June 27, 13
Thank you!
• Questions?
• E-mail me: chris@mongohq.com
Thursday, June 27, 13

MongoDB Indexing Constraints and Creative Schemas

  • 1.
    MongoDB Indexing Constraints &Creative Schemas Chris Winslett [email protected] Thursday, June 27, 13
  • 2.
    My Background • Forthe past year, I’ve looked at MongoDB logs at least once every day. • We routinely answer the question “how can I improve performance?” Thursday, June 27, 13
  • 3.
    Who’s this talkfor? • New to MongoDB • Seeing some slow operations, and need help debugging • Running database operations on a sizeable deploy • I have a MongoDB deployment, and I’ve hit a performance wall Thursday, June 27, 13
  • 4.
    What should youlearn? Know where to look on a running MongoDB to uncover slowness, and discuss solutions. MongoDB has performance “patterns”. How to think about improving performance. And . . . Thursday, June 27, 13
  • 5.
    Schema Design Design withthe end in mind. Thursday, June 27, 13
  • 6.
    MongoDB Indexing Constraints • Oneindex per query * • One range operator per query ($) • Range operator must be last field in index • Using RAM well * except $or, but the sin with $or is appending a sort to the query. Thursday, June 27, 13
  • 7.
    The Tools • `mongostat`for MongoDB Behavior • `tail` the logs for current options • `iostat` for disk util • `top -c` for CPU usage Thursday, June 27, 13
  • 8.
    First, a SimpleOne query getmore command res faults locked db ar|aw netIn netOut conn time 129 4 7 126m 2 my_db:0.0% 3|0 27k 445k 42 15:36:54 64 4 3 126m 0 my_db:0.0% 5|0 12k 379k 42 15:36:55 65 7 8 126m 0 my_db:0.1% 3|0 15k 230k 42 15:36:56 65 3 3 126m 1 my_db:0.0% 3|0 13k 170k 42 15:36:57 66 1 6 126m 1 my_db:0.0% 0|0 14k 262k 42 15:36:58 32 8 5 126m 0 my_db:0.0% 5|0 5k 445k 42 15:36:59 a truncated mongostat Alerted due to high CPU Thursday, June 27, 13
  • 9.
    log [conn73454] query my_db.my_collectionquery: { $query: { publisher: "US Weekly" }, orderby: { publishAt: -1 } } ntoreturn:5 ntoskip:0 nscanned:33236 scanAndOrder:1 keyUpdates:0 numYields: 21 locks(micros) r:317266 nreturned:5 reslen:3127 178ms Thursday, June 27, 13
  • 10.
    Solution { $query: {publisher: "US Weekly" }, orderby: { publishedAt: -1 } } db.my_collection.ensureIndex({“publisher”: 1, publishedAt: -1}, {background: true}) We are fixing this query With this index I would show you the logs, but now they are silent. Thursday, June 27, 13
  • 11.
    The Pattern Inefficient ReadQueries from in-memory table scans cause high CPU load Caused by not matching indexes to queries. Thursday, June 27, 13
  • 12.
    Example 2 MongoDB Twitter-ishFeed Customer was building a network graph of users. Thursday, June 27, 13
  • 13.
    Naive Method { creator_id: ObjectId(), status:“Thisis so awesome!” } Statuses Users { _id: ObjectId(), friends: [array-o-friends] } db.status.find({creator_id: {$in: [array-o-friends]}}).sort({_id: -1}) Query Thursday, June 27, 13
  • 14.
    Solution { creator_id: ObjectId(), friends_of_creator: [array-of-viewers], status:“Thisis so awesome!” } Statuses Users { _id: ObjectId(), friends: [array-o-friends] } db.statuses.find({friends_of_creator: ObjectId()}).sort({_id: -1}) Query Thursday, June 27, 13
  • 15.
    The Pattern With graphs,query on viewable by. What worked with minimal documents was not scaling. Thursday, June 27, 13
  • 16.
    Similar Issues -Messages { sender_id: ObjectId(), recipient_id: ObjectId(), message:“This is so awesome!” } Naive { sender_id: ObjectId(), recipient_id: ObjectId(), participants: [ObjectId(), ObjectId()], thread_id: ObjectId(), message:“This is so awesome!” } Solution db.messages.find({participants: ObjectId()}).sort({_id: -1}) Query db.messages.find({$or: [{sender_id: ObjectId()}, {recipient_id: ObjectId()]}).sort({_id: -1}) Naive Query Thursday, June 27, 13
  • 17.
    Example 3 insert queryupdate delete getmore command faults locked % idx miss % qr|qw ar|aw *0 *0 *0 *0 0 1|0 1422 0 0 0|0 50|0 *0 6 *0 *0 0 6|0 575 0 0 0|0 51|0 *0 3 *0 *0 0 1|0 1047 0 0 0|0 50|0 *0 2 *0 *0 0 3|0 1660 0 0 0|0 50|0 a truncated mongostat Alerted on high CPU Thursday, June 27, 13
  • 18.
    tail [initandlisten] connection acceptedfrom .... [conn4229724] authenticate: { authenticate: .... [initandlisten] connection accepted from .... [conn4229725] authenticate: { authenticate: ..... [conn4229717] query ..... 102ms [conn4229725] query ..... 140ms amazingly quiet Thursday, June 27, 13
  • 19.
    currentOp > db.currentOP() { "inprog": [ { "opid" : 66178716, "lockType" : "read", "secs_running" : 760, "op" : "query", "ns" : "my_db.my_collection", "query" : { keywords: $in: [“keyword1”,“keyword2”], tags: $in: [“tags1”,“tags2”] }, orderby: { “created_at”: -1 }, "numYields" : 21 } ] } Thursday, June 27, 13
  • 20.
    Solution > db.currentOP().inprog.filter(function(row) { returnrow.secs_running > 100 && row.op == "query" }).forEach(function(row) { db.killOp(row.opid) }) Return Stability to Database Disable query, and refactor schema. Thursday, June 27, 13
  • 21.
    Refactoring I have oneword for you,“Schema” Thursday, June 27, 13
  • 22.
    Example 4 A mapreduce has gradually run slower and slower. Thursday, June 27, 13
  • 23.
    Finding Offenders Find thetime of the slowest query of the day: grep '[0-9]{3,100}ms$' $MONGODB_LOG | awk '{print $NF}' | sort -n Thursday, June 27, 13
  • 24.
    Slowest Map Reduce my_db.$cmdcommand: { mapreduce: "my_collection", map: function() {}, query: { $or: [ { object.type: "this" }, { object.type: "that" } ], time: { $lt: new Date(1359025311290), $gt: new Date(1358420511290) }, object.ver: 1, origin: "tnh" }, out: "my_new_collection", reduce: function(keys, vals) { ....} } ntoreturn:1 keyUpdates:0 numYields: 32696 locks(micros) W:143870 r:511858643 w:6279425 reslen:140 421185ms Thursday, June 27, 13
  • 25.
    Solution Query is slowbecause it has multiple multi-value operators: $or, $gte, and $lte Problem Solution Change schema to use an “hour_created” attribute: hour_created: “%Y-%m-%d %H” Create an index on “hour_created” with followed by “$or” values. Query using the new “hour_created.” Thursday, June 27, 13
  • 26.
    Words of caution 2/ 4 solutions were to add an index. New indexes as a solution scales poorly. Thursday, June 27, 13
  • 27.
    Sometimes . .. It is best to do nothing, except add shards / add hardware. Go back to the drawing board on the design. Thursday, June 27, 13
  • 28.
    Bad things happento good databases? • ORMs • Manage your indexes and queries. • Constraints will set you free. Thursday, June 27, 13
  • 29.
    Road Map for Refactoring •Measure, measure, measure. • Find your slowest queries and determine if they can be indexed • Rephrase the problem you are solving by asking “How do I want to query my data?” Thursday, June 27, 13
  • 30.
    Thank you! • Questions? •E-mail me: [email protected] Thursday, June 27, 13