Data Modeling for the Real World

Perl Engineer & Evangelist, 10gen
Mike Friedman
#MongoDBdays
Schema Design
Four Real-World Use
Cases

Single Table En
Agenda
• Why is schema design important
• 4 Real World Schemas
– Inbox
– History
– IndexedAttributes
– Multiple Identities
• Conclusions

Why is Schema Design
important?
• Largest factor for a performant system
• Schema design with MongoDB is different
• RDBMS – "What answers do I have?"
• MongoDB – "What question will I have?"

Design Goals
• Efficiently send new messages to recipients
• Efficiently read inbox

3 Approaches (there are
more)
• Fan out on Read
• Fan out on Write
• Fan out on Write with Bucketing

// Shard on "from"
db.shardCollection( "mongodbdays.inbox", { from: 1 } )
// Make sure we have an index to handle inbox reads
db.inbox.ensureIndex( { to: 1, sent: 1 } )
msg = {
from: "Joe",
to: [ "Bob", "Jane" ],
sent: new Date(),
message: "Hi!",
}
// Send a message
db.inbox.save( msg )
// Read my inbox
db.inbox.find( { to: "Joe" } ).sort( { sent: -1 } )
Fan out on read

Fan out on read – I/O
Shard
1 Shard 2
Shard
3
Send
Message

Fan out on read – I/O
Shard
1 Shard 2
Shard
3
Read
Inbox
Send
Message

Considerations
• Write: One document per message sent
• Read: Find all messages with my own name in
the recipient field
• Read: Requires scatter-gather on sharded
cluster
• A lot of random I/O on a shard to find everything

// Shard on “recipient” and “sent”
db.shardCollection( "mongodbdays.inbox", { ”recipient”: 1, ”sent”: 1 } )
msg = {
from: "Joe",
sent: new Date(),
message: "Hi!",
}
// Send a message
for ( recipient in msg.to ) {
msg.recipient = msg.to[recipient]
db.inbox.save( msg );
}
// Read my inbox
db.inbox.find( { recipient: "Joe" } ).sort( { sent: -1 } )
Fan out on write

Fan out on write – I/O
Shard
1
Shard
2
Shard
3
Send
Message

Fan out on write – I/O
Read
Inbox
Send
Message
Shard
1
Shard
2
Shard
3

Considerations
• Write: One document per recipient
• Read: Find all of the messages with me as the
recipient
• Can shard on recipient, so inbox reads hit one
shard
• But still lots of random I/O on the shard

// Shard on "owner / sequence"
db.shardCollection( "mongodbdays.inbox",
{ owner: 1, sequence: 1 } )
db.shardCollection( "mongodbdays.users", { user_name: 1 } )
msg = {
from: "Joe",
sent: new Date(),
message: "Hi!",
}
Fan out on write with buckets

// Send a message
for( recipient in msg.to) {
count = db.users.findAndModify({
query: { user_name: msg.to[recipient] },
update: { "$inc": { "msg_count": 1 } },
upsert: true,
new: true }).msg_count;
sequence = Math.floor(count / 50);
db.inbox.update({
owner: msg.to[recipient], sequence: sequence },
{ $push: { "messages": msg } },
{ upsert: true } );
}
// Read my inbox
db.inbox.find( { owner: "Joe" } )
.sort ( { sequence: -1 } ).limit( 2 )

• Each “inbox” document is an array of messages
• Append a message onto “inbox” of recipient
• Bucket inboxes so there’s not too many
messages per document
• Can shard on recipient, so inbox reads hit one
shard
• 1 or 2 documents to read the whole inbox

Fan out on write with buckets – I/O
Shard
1
Shard
2
Shard
3
Send
Message

Shard
1
Shard
2
Shard
3
Fan out on write with buckets – I/O
Read
Inbox
Send
Message

Data Modeling for the Real World

Design Goals
• Need to retain a limited amount of history e.g.
– Hours, Days, Weeks
– May be legislative requirement (e.g. HIPPA, SOX, DPA)
• Need to query efficiently by
– match
– ranges

more)
• Bucket by Number of messages
• Fixed size array
• Bucket by date + TTL collections

db.inbox.find()
{ owner: "Joe", sequence: 25,
messages: [
{ from: "Joe",
sent: ISODate("2013-03-01T09:59:42.689Z"),
message: "Hi!"
},
…
] }
// Query with a date range
db.inbox.find ({owner: "friend1",
messages: {
$elemMatch: {sent:{$gte: ISODate("…") }}}})
// Remove elements based on a date
db.inbox.update({owner: "friend1" },
{ $pull: { messages: {
sent: { $gte: ISODate("…") } } } } )
Bucket by number of
messages

Considerations
• Shrinking documents, space can be reclaimed
with
– db.runCommand ( { compact: '<collection>' } )
• Removing the document after the last element in
the array as been removed
– { "_id" : …, "messages" : [ ], "owner" : "friend1",
"sequence" : 0 }

msg = {
from: "Your Boss",
to: [ "Bob" ],
sent: new Date(),
message: "CALL ME NOW!"
}
// 2.4 Introduces $each, $sort and $slice for $push
db.messages.update(
{ _id: 1 },
{ $push: { messages: { $each: [ msg ],
$sort: { sent: 1 },
$slice: -50 }
}
}
)
Fixed Size Array

Considerations
• Need to compute the size of the array based on
retention period

// messages: one doc per user per day
db.inbox.findOne()
{
_id: 1,
to: "Joe",
sequence: ISODate("2013-02-04T00:00:00.392Z"),
messages: [ ]
}
// Auto expires data after 31536000 seconds = 1 year
db.messages.ensureIndex( { sequence: 1 },
{ expireAfterSeconds: 31536000 } )
TTL Collections

Design Goal
• Application needs to stored a variable number of
attributes e.g.
– User defined Form
– Meta Data tags
• Queries needed
– Equality
– Range based
• Need to be efficient, regardless of the number of
attributes

more)
• Attributes as Embedded Document
• Attributes as Objects in an Array

db.files.insert( { _id: "local.0",
attr: { type: "text", size: 64,
created: ISODate("..." } } )
attr: { type: "text", size: 128} } )
db.files.insert( { _id: "mongod",
attr: { type: "binary", size: 256,
created: ISODate("...") } } )
// Need to create an index for each item in the sub-document
db.files.ensureIndex( { "attr.type": 1 } )
db.files.find( { "attr.type": "text"} )
// Can perform range queries
db.files.ensureIndex( { "attr.size": 1 } )
db.files.find( { "attr.size": { $gt: 64, $lte: 16384 } } )
Attributes as a Sub-
Document

Considerations
• Each attribute needs an Index
• Each time you extend, you add an index
• Lots and lots of indexes

db.files.insert( {_id: "local.0",
attr: [ { type: "text" },
{ size: 64 },
{ created: ISODate("...") } ] } )
attr: [ { type: "text" },
{ size: 128 } ] } )
db.files.insert( { _id: "mongod",
attr: [ { type: "binary" },
{ size: 256 },
{ created: ISODate("...") } ] } )
db.files.ensureIndex( { attr: 1 } )
Attributes as Objects in Array

Considerations
• Only one index needed on attr
• Can support range queries, etc.
• Index can be used only once per query

Design Goal
• Ability to look up by a number of different
identities e.g.
• Username
• Email address
• FB Handle
• LinkedIn URL

more)
• Identifiers in a single document
• Separate Identifiers from Content

db.users.findOne()
{ _id: "joe",
email: "joe@example.com,
fb: "joe.smith", // facebook
li: "joe.e.smith", // linkedin
other: {…}
}
// Shard collection by _id
db.shardCollection("mongodbdays.users", { _id: 1 } )
// Create indexes on each key
db.users.ensureIndex( { email: 1} )
db.users.ensureIndex( { fb: 1 } )
db.users.ensureIndex( { li: 1 } )
Single Document by User

Read by _id (shard key)
Shard 1 Shard 2 Shard 3
find( { _id: "joe"} )

Read by email (non-shard
key)
find ( { email: joe@example.com }
)

Considerations
• Lookup by shard key is routed to 1 shard
• Lookup by other identifier is scatter gathered
across all shards
• Secondary keys cannot have a unique index

// Create unique index
db.identities.ensureIndex( { identifier : 1} , { unique: true} )
// Create a document for each users document
db.identities.save(
{ identifier : { hndl: "joe" }, user: "1200-42" } )
db.identities.save(
{ identifier : { email: "joe@abc.com" }, user: "1200-42" } )
db.identities.save(
{ identifier : { li: "joe.e.smith" }, user: "1200-42" } )
db.shardCollection( "mydb.identities", { identifier : 1 } )
// Create unique index
db.users.ensureIndex( { _id: 1} , { unique: true} )
db.shardCollection( "mydb.users", { _id: 1 } )
Document per Identity

Read requires 2 reads
db.identities.find({"identifier" : {
"hndl" : "joe" }})
db.users.find( { _id: "1200-42"}
)

Considerations
• Lookup to Identities is a routed query
• Lookup to Users is a routed query
• Unique indexes available
• Must do two queries per lookup

Summary
• Multiple ways to model a domain problem
• Understand the key uses cases of your app
• Balance between ease of query vs. ease of write
• Random I/O should be avoided

Perl Engineer & Evangelist, 10gen
Mike Friedman
#MongoDBdays
Thank You

Next Sessions at 3:40
5th Floor:
West Side Ballroom 3&4:Advanced Replication Internals
West Side Ballroom 1&2: Building a High-Performance Distributed
Task Queue on MongoDB
Juilliard Complex: WhiteBoard Q&A
Lyceum Complex: Ask the Experts
7th Floor:
Empire Complex: Managing a Maturing MongoDB Ecosystem
SoHo Complex: MongoDB Indexing Constraints and Creative
Schemas

Data Modeling for the Real World

More Related Content

What's hot (19)

Viewers also liked (6)

Similar to Data Modeling for the Real World (20)

Recently uploaded (20)

Data Modeling for the Real World