On Brewing Fresh Espresso: LinkedIn’s Distributed Data
Serving Platform
Swaroop Jagadish
http://www.linkedin.com/in/swaroopjagadish
LinkedIn Confidential ©2013 All Rights Reserved
Outline
LinkedIn Data Ecosystem
Espresso: Design Points
Data Model and API
Architecture
Deep Dive: Fault Tolerance
Deep Dive: Secondary Indexing
Espresso In Production
Future work
2
The World’s Largest Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
225M+ 2M+
Company Pages
Connecting Talent  Opportunity. At scale…
LinkedIn Confidential ©2013 All Rights Reserved 3
LinkedIn Data Ecosystem
4
Espresso: Key Design Points
 Source-of-truth
– Master-Slave, Timeline consistent
– Query-after-write
– Backup/Restore
– High Availability
 Horizontally Scalable
 Rich functionality
– Hierarchical data model
– Document oriented
– Transactions within a hierarchy
– Secondary Indexes
5
Espresso: Key Design Points
 Agility – no “pause the world” operations
– “On the fly” Schema Evolution
– Elasticity
 Integration with the data ecosystem
– Change stream with freshness in O(seconds)
– ETL to Hadoop
– Bulk import
 Modular and Pluggable
– Off-the-shelf: MySQL, Lucene, Avro
6
Data Model and API
7
Application View
8
key
value
REST API:
/mailbox/msg_meta/bob/2
Partitioning
9
/mailbox/msg_meta/bob/2
MemberId is the partitioning key
Document based data model
Richer than a plain key-value store
Hierarchical keys
Values are rich documents and may contain
nested types
10
from : {
name : "Chris",
email : "chris@linkedin.com"
}
subject : "Go Giants!"
body : "World Series 2012! w00t!"
unread : true
Messages
mailboxID : String
messageID : long
from : {
name : String
email : String
}
subject : String
body : String
unread : boolean
REST based API
• Secondary Index query
– GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true
+isInbox:true”&start=0&count=15
• Partial updates
POST /MailboxDB/MessageMeta/bob/1
Content-Type: application/json
Content-Length: 21
{“unread” : “false”}
• Conditional operations
– Get a message, only if recently updated
GET /MailboxDB/MessageMeta/bob/1
If-Modifed-Since: Wed, 31 Oct 2012 02:54:12 GMT
11
Transactional writes within a hierarchy
mboxId value
George { “numUnread”:
2 }
MessageCounter
mboxId msgId value etag
George 0 {…, “unread”: false, …} 7abf8091
George 1 {…, “unread”: true, …} b648bc5f
George 2 {…, “unread”: true, …} 4fde8701
Message/Message/George/0 {…, “unread”: false, …} 7abf8091
/Message/George/0 {…, “unread”: true, …}
/MessageCounter/George {…, “numUnread”: “+1”, …}
1. Read, record etags
2. Prepare after-image
3.Update
mboxId value
George { “numUnread”:
3 }
Espresso Architecture
13
14
15
16
17
18
19
Cluster Management and Fault
Tolerance
20
Generic Cluster Manager: Apache Helix
 Generic cluster management
– State model + constraints
– Ideal state of distribution of partitions
across the cluster
– Migrate cluster from current state to
ideal state
• More Info
• SoCC 2012
• http://helix.incubator.apache.org
21
Espresso Partition Layout: Master, Slave
 3 Storage Engine nodes, 2-way replication
22
Apache Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Cluster Management
Cluster Expansion
Node Failover
Cluster Expansion
 Initial State with 3 Storage Nodes. Step1: Compute new Ideal
state
24
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Node 4
Cluster Expansion
 Step 2: Bootstrap new node’s partitions by restoring from
backups
25
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Node 4
P4 P8 P12
P7 P9P1
Snapshots
Cluster Expansion
 Step 3: Catch up from live replication stream
26
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Node 4
P4 P8 P12
P7 P9P1
Snapshots
Cluster Expansion
 Step 4: Migrate masters and slaves to rebalance
27
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2 P3
P5 P6
P10
Node 2
P5 P6 P7
P2
P11 P12
Node 3
P9 P10 P11
P3 P4
P8
Master
Slave
Offline
Node 4
P4 P8 P12
P7 P9P1
Cluster Expansion
 Partitions are balanced. Router starts sending traffic to new
node
28
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1 Node 2
P5 P6 P7
P2 P11 P12
Node 3
Master
Slave
Offline
Node 4
P1 P2 P3
P5 P6 P10
P9 P10 P11
P3 P4 P8
P4 P8 P12
P1 P7 P9
Node Failover
• During failure or planned maintenance
29
Node 1
P1 P2 P3
P10P5 P6
Node 2
P5 P6 P7
P12P2 P11
Node 3
P9 P10 P11
P8P3 P4
Node 4
P4 P8 P12
P7 P9P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…
Node Failover
• Step 1: Detect Node failure
30
Node 1
P1 P2 P3
P10P5 P6
Node 2
P5 P6 P7
P12P2 P11
Node 3
P9 P10 P11
P8P3 P4
Node 4
P4 P8 P12
P7 P9P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…
Node Failover
• Step 2: Compute new ideal state for promoting slaves to
master
31
Node 1
P1 P2 P3
P5 P6
Node 2
P5 P6 P7
P12P2
Node 3
P10 P11
P8P3 P4
Node 4
P4 P8 P12
P7 P9P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…
P11P10
P9
Failover Performance
32
Secondary indexing
33
Espresso Secondary Indexing
• Local Secondary Index Requirements
• Read after write
• Consistent with primary data under failure
• Rich query support: match, prefix, range, text search
• Cost-to-serve proportional to working set
• Pluggable Index Implementations
• MySQL B-Tree
• Inverted index using Apache Lucene with MySQL backing store
• Inverted index using Prefix Index
• Fastbit based bitmap index
Lucene based implementation
• Requires entire index to be memory-resident to support low latency
query response times
• For the Mailbox application, we have two options
Optimizations for Lucene based implementation
• Concurrent transactions on the same Lucene
index leads to inconsistency
• Need to acquire a lock
• Opening an index repeatedly is expensive
• Group commit to amortize index opening cost
write
Request 2
Request 3
Request 4
Request 5
Request 1
Optimizations for Lucene based implementation
 High value users of the site accumulate large
mailboxes
– Query performance degrades with a large index
 Performance shouldn’t get worse with more usage!
 Time Partitioned Indexes: Partition index into buckets
based on created time
Espresso in Production
38
Espresso in Production
 Unified Social Content Platform –social activity aggregation
 High Read:Write ratio
39
Espresso in Production
 InMail - Allows members to communicate with each other
 Large storage footprint
 Low latency requirement for secondary index queries involving text
search and relational predicates
40
Performance
 Average Failover Latency with 1024 partitions is
around 300ms
 Primary Data Reads and Writes
 For Single Storage Node on SSD
 Average row size = 1KB
41
Operation Average Latency Average
Throughput
Reads ~3ms 40,000 per
second
Writes ~6ms 20,000 per
second
Performance
 Partition-key level Secondary Index using Lucene
 One Index per Mailbox use-case
 Base data on SAS, Indexes on SSDs
 Average throughput per index = ~1000 per second
(after the group commit and partitioned index
optimizations)
42
Operation Average Latency
Queries (average
of 5 indexed
fields)
~20ms
Writes (Around
30 indexed fields)
~20ms
Durability and Consistency
 Within a Data Center
 Across Data Centers
Durability and Consistency
 Within a Data Center
– Write latency vs Durability
 Asynchronous replication
– May lead to data loss
– Tooling can mitigate some of this
 Semi-synchronous replication
– Wait for at least one relay to acknowledge
– During failover, slaves wait for catchup
 Consistency over availability
 Helix selects slave with least replication lag to take over
mastership
 Failover time is ~300ms in practice
Durability and Consistency
 Across data centers
– Asynchronous replication
– Stale reads possible
– Active-active: Conflict resolution via last-writer-wins
Lessons learned
Dealing with transient failures
Planned upgrades
Slave reads
Storage Devices
– SSDs vs SAS disks
Scaling Cluster Management
46
Future work
Coprocessors
– Synchronous, Asynchronous
Richer query processing
– Group-by, Aggregation
47
Key Takeaways
Espresso is a timeline consistent,
document-oriented distributed database
Feature rich: Secondary indexing,
transactions over related documents,
seamless integration with the data
ecosystem
In production since June 2012 serving
several key use-cases
48
49
Questions?

More Related Content

PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PDF
Changelog Stream Processing with Apache Flink
PPTX
Monitoring Apache Kafka
PDF
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
PPTX
Kafka replication apachecon_2013
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Apache Flink Adoption at Shopify
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Where is my bottleneck? Performance troubleshooting in Flink
Changelog Stream Processing with Apache Flink
Monitoring Apache Kafka
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Kafka replication apachecon_2013
Evening out the uneven: dealing with skew in Flink
Apache Flink Adoption at Shopify

What's hot (20)

PDF
Apache Kafka Architecture & Fundamentals Explained
PDF
How Uber scaled its Real Time Infrastructure to Trillion events per day
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PPTX
Click-Through Example for Flink’s KafkaConsumer Checkpointing
PDF
The Apache Spark File Format Ecosystem
PDF
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
PPTX
Hive + Tez: A Performance Deep Dive
PDF
Kafka Streams: What it is, and how to use it?
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Thrift vs Protocol Buffers vs Avro - Biased Comparison
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
PDF
Fundamentals of Apache Kafka
PDF
Log Structured Merge Tree
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PDF
A Deep Dive into Kafka Controller
PDF
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
PDF
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
PDF
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
PPTX
RocksDB compaction
PDF
Producer Performance Tuning for Apache Kafka
Apache Kafka Architecture & Fundamentals Explained
How Uber scaled its Real Time Infrastructure to Trillion events per day
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Click-Through Example for Flink’s KafkaConsumer Checkpointing
The Apache Spark File Format Ecosystem
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Hive + Tez: A Performance Deep Dive
Kafka Streams: What it is, and how to use it?
Apache Iceberg - A Table Format for Hige Analytic Datasets
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Fundamentals of Apache Kafka
Log Structured Merge Tree
Building robust CDC pipeline with Apache Hudi and Debezium
A Deep Dive into Kafka Controller
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
RocksDB compaction
Producer Performance Tuning for Apache Kafka
Ad

Similar to Espresso: LinkedIn's Distributed Data Serving Platform (Talk) (20)

PDF
Espresso_DevOps_BLR_Meetup_28_Mar_2015
PDF
Espresso - Shahnawaz Saifi & Kiran Chand - DevOps Bangalore meetup March 28...
PDF
Scaling distributed data systems: A LinkedIn Case study
PDF
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
PDF
Sigmod 2013 - On Brewing Fresh Espresso - LinkedIn's Distributed Data Serving...
PDF
Scala like distributed collections - dumping time-series data with apache spark
PDF
Mesos at OpenTable
PPTX
Sanger, upcoming Openstack for Bio-informaticians
PPTX
Flexible compute
PDF
Introduction To Apache Mesos
PPTX
Espresso Database Replication with Kafka, Tom Quiggle
PDF
Building a Secure and Resilient Foundation for Banking at Intesa Sanpaolo wit...
PDF
MesosCon EU 2017 - Criteo - Operating Mesos-based Infrastructures
PPTX
Logs @ OVHcloud
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
PPTX
ARCHITECTING INFLUXENTERPRISE FOR SUCCESS
PPT
Document Databases & RavenDB
PPTX
Design and implementation patterns for reviving relational monoliths
PPT
UnConference for Georgia Southern Computer Science March 31, 2015
PDF
Elasticsearch for Logs & Metrics - a deep dive
Espresso_DevOps_BLR_Meetup_28_Mar_2015
Espresso - Shahnawaz Saifi & Kiran Chand - DevOps Bangalore meetup March 28...
Scaling distributed data systems: A LinkedIn Case study
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Sigmod 2013 - On Brewing Fresh Espresso - LinkedIn's Distributed Data Serving...
Scala like distributed collections - dumping time-series data with apache spark
Mesos at OpenTable
Sanger, upcoming Openstack for Bio-informaticians
Flexible compute
Introduction To Apache Mesos
Espresso Database Replication with Kafka, Tom Quiggle
Building a Secure and Resilient Foundation for Banking at Intesa Sanpaolo wit...
MesosCon EU 2017 - Criteo - Operating Mesos-based Infrastructures
Logs @ OVHcloud
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
ARCHITECTING INFLUXENTERPRISE FOR SUCCESS
Document Databases & RavenDB
Design and implementation patterns for reviving relational monoliths
UnConference for Georgia Southern Computer Science March 31, 2015
Elasticsearch for Logs & Metrics - a deep dive
Ad

More from Amy W. Tang (12)

PPTX
Data Infrastructure at LinkedIn
PPTX
LinkedIn Segmentation & Targeting Platform: A Big Data Application
PDF
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
PDF
Building Distributed Systems Using Helix
PDF
LinkedIn Graph Presentation
PDF
Data Infrastructure at LinkedIn
PDF
Data Infrastructure at LinkedIn
PDF
Voldemort on Solid State Drives
PDF
Untangling Cluster Management with Helix
PDF
All Aboard the Databus
PDF
Introduction to Databus
PDF
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
LinkedIn Segmentation & Targeting Platform: A Big Data Application
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building Distributed Systems Using Helix
LinkedIn Graph Presentation
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
Voldemort on Solid State Drives
Untangling Cluster Management with Helix
All Aboard the Databus
Introduction to Databus
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Recently uploaded (20)

PPTX
Training Program for knowledge in solar cell and solar industry
PPTX
Configure Apache Mutual Authentication
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
Statistics on Ai - sourced from AIPRM.pdf
PPTX
Microsoft User Copilot Training Slide Deck
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PDF
sbt 2.0: go big (Scala Days 2025 edition)
DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
Auditboard EB SOX Playbook 2023 edition.
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
Training Program for knowledge in solar cell and solar industry
Configure Apache Mutual Authentication
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
sustainability-14-14877-v2.pddhzftheheeeee
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
Early detection and classification of bone marrow changes in lumbar vertebrae...
MuleSoft-Compete-Deck for midddleware integrations
Taming the Chaos: How to Turn Unstructured Data into Decisions
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
Statistics on Ai - sourced from AIPRM.pdf
Microsoft User Copilot Training Slide Deck
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
sbt 2.0: go big (Scala Days 2025 edition)
Basics of Cloud Computing - Cloud Ecosystem
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Auditboard EB SOX Playbook 2023 edition.
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Enhancing plagiarism detection using data pre-processing and machine learning...

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

  • 1. On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform Swaroop Jagadish http://www.linkedin.com/in/swaroopjagadish LinkedIn Confidential ©2013 All Rights Reserved
  • 2. Outline LinkedIn Data Ecosystem Espresso: Design Points Data Model and API Architecture Deep Dive: Fault Tolerance Deep Dive: Secondary Indexing Espresso In Production Future work 2
  • 3. The World’s Largest Professional Network Members Worldwide 2 new Members Per Second 100M+ Monthly Unique Visitors 225M+ 2M+ Company Pages Connecting Talent  Opportunity. At scale… LinkedIn Confidential ©2013 All Rights Reserved 3
  • 5. Espresso: Key Design Points  Source-of-truth – Master-Slave, Timeline consistent – Query-after-write – Backup/Restore – High Availability  Horizontally Scalable  Rich functionality – Hierarchical data model – Document oriented – Transactions within a hierarchy – Secondary Indexes 5
  • 6. Espresso: Key Design Points  Agility – no “pause the world” operations – “On the fly” Schema Evolution – Elasticity  Integration with the data ecosystem – Change stream with freshness in O(seconds) – ETL to Hadoop – Bulk import  Modular and Pluggable – Off-the-shelf: MySQL, Lucene, Avro 6
  • 10. Document based data model Richer than a plain key-value store Hierarchical keys Values are rich documents and may contain nested types 10 from : { name : "Chris", email : "[email protected]" } subject : "Go Giants!" body : "World Series 2012! w00t!" unread : true Messages mailboxID : String messageID : long from : { name : String email : String } subject : String body : String unread : boolean
  • 11. REST based API • Secondary Index query – GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true +isInbox:true”&start=0&count=15 • Partial updates POST /MailboxDB/MessageMeta/bob/1 Content-Type: application/json Content-Length: 21 {“unread” : “false”} • Conditional operations – Get a message, only if recently updated GET /MailboxDB/MessageMeta/bob/1 If-Modifed-Since: Wed, 31 Oct 2012 02:54:12 GMT 11
  • 12. Transactional writes within a hierarchy mboxId value George { “numUnread”: 2 } MessageCounter mboxId msgId value etag George 0 {…, “unread”: false, …} 7abf8091 George 1 {…, “unread”: true, …} b648bc5f George 2 {…, “unread”: true, …} 4fde8701 Message/Message/George/0 {…, “unread”: false, …} 7abf8091 /Message/George/0 {…, “unread”: true, …} /MessageCounter/George {…, “numUnread”: “+1”, …} 1. Read, record etags 2. Prepare after-image 3.Update mboxId value George { “numUnread”: 3 }
  • 14. 14
  • 15. 15
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. 19
  • 20. Cluster Management and Fault Tolerance 20
  • 21. Generic Cluster Manager: Apache Helix  Generic cluster management – State model + constraints – Ideal state of distribution of partitions across the cluster – Migrate cluster from current state to ideal state • More Info • SoCC 2012 • http://helix.incubator.apache.org 21
  • 22. Espresso Partition Layout: Master, Slave  3 Storage Engine nodes, 2-way replication 22 Apache Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline
  • 24. Cluster Expansion  Initial State with 3 Storage Nodes. Step1: Compute new Ideal state 24 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline Node 4
  • 25. Cluster Expansion  Step 2: Bootstrap new node’s partitions by restoring from backups 25 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline Node 4 P4 P8 P12 P7 P9P1 Snapshots
  • 26. Cluster Expansion  Step 3: Catch up from live replication stream 26 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline Node 4 P4 P8 P12 P7 P9P1 Snapshots
  • 27. Cluster Expansion  Step 4: Migrate masters and slaves to rebalance 27 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P3 P5 P6 P10 Node 2 P5 P6 P7 P2 P11 P12 Node 3 P9 P10 P11 P3 P4 P8 Master Slave Offline Node 4 P4 P8 P12 P7 P9P1
  • 28. Cluster Expansion  Partitions are balanced. Router starts sending traffic to new node 28 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 Node 2 P5 P6 P7 P2 P11 P12 Node 3 Master Slave Offline Node 4 P1 P2 P3 P5 P6 P10 P9 P10 P11 P3 P4 P8 P4 P8 P12 P1 P7 P9
  • 29. Node Failover • During failure or planned maintenance 29 Node 1 P1 P2 P3 P10P5 P6 Node 2 P5 P6 P7 P12P2 P11 Node 3 P9 P10 P11 P8P3 P4 Node 4 P4 P8 P12 P7 P9P1 Helix Partition: P1 Node: 1 … Partition: P12 Node: 4 Database Cluster Node: 4 M: P4 – Active … S: P7 – Active …
  • 30. Node Failover • Step 1: Detect Node failure 30 Node 1 P1 P2 P3 P10P5 P6 Node 2 P5 P6 P7 P12P2 P11 Node 3 P9 P10 P11 P8P3 P4 Node 4 P4 P8 P12 P7 P9P1 Helix Partition: P1 Node: 1 … Partition: P12 Node: 4 Database Cluster Node: 4 M: P4 – Active … S: P7 – Active …
  • 31. Node Failover • Step 2: Compute new ideal state for promoting slaves to master 31 Node 1 P1 P2 P3 P5 P6 Node 2 P5 P6 P7 P12P2 Node 3 P10 P11 P8P3 P4 Node 4 P4 P8 P12 P7 P9P1 Helix Partition: P1 Node: 1 … Partition: P12 Node: 4 Database Cluster Node: 4 M: P4 – Active … S: P7 – Active … P11P10 P9
  • 34. Espresso Secondary Indexing • Local Secondary Index Requirements • Read after write • Consistent with primary data under failure • Rich query support: match, prefix, range, text search • Cost-to-serve proportional to working set • Pluggable Index Implementations • MySQL B-Tree • Inverted index using Apache Lucene with MySQL backing store • Inverted index using Prefix Index • Fastbit based bitmap index
  • 35. Lucene based implementation • Requires entire index to be memory-resident to support low latency query response times • For the Mailbox application, we have two options
  • 36. Optimizations for Lucene based implementation • Concurrent transactions on the same Lucene index leads to inconsistency • Need to acquire a lock • Opening an index repeatedly is expensive • Group commit to amortize index opening cost write Request 2 Request 3 Request 4 Request 5 Request 1
  • 37. Optimizations for Lucene based implementation  High value users of the site accumulate large mailboxes – Query performance degrades with a large index  Performance shouldn’t get worse with more usage!  Time Partitioned Indexes: Partition index into buckets based on created time
  • 39. Espresso in Production  Unified Social Content Platform –social activity aggregation  High Read:Write ratio 39
  • 40. Espresso in Production  InMail - Allows members to communicate with each other  Large storage footprint  Low latency requirement for secondary index queries involving text search and relational predicates 40
  • 41. Performance  Average Failover Latency with 1024 partitions is around 300ms  Primary Data Reads and Writes  For Single Storage Node on SSD  Average row size = 1KB 41 Operation Average Latency Average Throughput Reads ~3ms 40,000 per second Writes ~6ms 20,000 per second
  • 42. Performance  Partition-key level Secondary Index using Lucene  One Index per Mailbox use-case  Base data on SAS, Indexes on SSDs  Average throughput per index = ~1000 per second (after the group commit and partitioned index optimizations) 42 Operation Average Latency Queries (average of 5 indexed fields) ~20ms Writes (Around 30 indexed fields) ~20ms
  • 43. Durability and Consistency  Within a Data Center  Across Data Centers
  • 44. Durability and Consistency  Within a Data Center – Write latency vs Durability  Asynchronous replication – May lead to data loss – Tooling can mitigate some of this  Semi-synchronous replication – Wait for at least one relay to acknowledge – During failover, slaves wait for catchup  Consistency over availability  Helix selects slave with least replication lag to take over mastership  Failover time is ~300ms in practice
  • 45. Durability and Consistency  Across data centers – Asynchronous replication – Stale reads possible – Active-active: Conflict resolution via last-writer-wins
  • 46. Lessons learned Dealing with transient failures Planned upgrades Slave reads Storage Devices – SSDs vs SAS disks Scaling Cluster Management 46
  • 47. Future work Coprocessors – Synchronous, Asynchronous Richer query processing – Group-by, Aggregation 47
  • 48. Key Takeaways Espresso is a timeline consistent, document-oriented distributed database Feature rich: Secondary indexing, transactions over related documents, seamless integration with the data ecosystem In production since June 2012 serving several key use-cases 48