SlideShare a Scribd company logo
Dandelion Hashtable
A non-blocking resizable hashtable
with fast deletes & memory-awareness
Antonios Katsarakis, Vasilis Gavrielatos, Nikos Ntarmos
Huawei Research, Edinburgh UK
High performance Distributed Computed (HPDC’24)
Beyond billion requests per second
on a commodity server
Ubiquitous in todays cloud and HPC
caching, in-memory storage, key-value stores,
transactional and analytical DBs, bioinformatics, …
Store large amounts of data
as <Key, Value> (KV) pairs
Offer thread-safe Gets, Puts, Inserts, Deletes
+ index Resizes to grow capacity vertically
Concurrent growing hashtables
2
Keep data in-memory
Ensure strong consistency
Exploit concurrency to deliver
high throughput needs of modern services
Modern hashtables
3
How fast is the state-of-the-art?
Throughput almost a billion requests per second !!
Problem: Such high throughput only on cache-resident workloads
where accesses served by h/w caches, seldom reach main memory
– e.g., evaluating high access skew or small datasets
4
State-of-the-art performance
What about larger (i.e. memory-resident) workloads?
Despite sacrificing functionality for performance, cannot
maximize throughput in memory-resident workloads
Two issues
1. Concurrent but practically blocking
 Lock-free Gets but blocks on several other occasions
2. In-memory but not memory-aware
 Inefficient handling of memory accesses/utilization
Deficiencies of state-of-the-art
5
Can you elaborate?
Closer look on state-of-the-art designs
6
For simplicity <8B keys, 8B values> KV pairs and efficient concurrent designs that can inline those in their index
Easy deletes: rm from linked-list
Good occupancy
A lot of memory traffic
too much pointer chasing!
Easy deletes: used slots in HDR
Low memory traffic: no pointer chasing
Bad occupancy: must resize early after
4 collisions on any Bin!
Open addressing
1. Hash(key) to a slot
2. probe following slots until
find key or empty slot
Closed addressing
1. Hash(key) to a bin
2. search Bin for the key
 One-to-one relation for a key and a Bin
Good occupancy
Low memory traffic: no pointer chasing
Deletes = tombstones  cannot free slots
unless block all ops & copy KVs to new index
Challenging tradeoff … is that all impacting performance?
Issue: CPU stalls on each request
7
Regardless of scheme most hashtables stall the CPU on every request
(Get, Put, Insert, Delete) waiting one or more memory accesses
Insert(X, 3)
Get(C)
Delete(D)
Cannot overlap memory accesses
 each request blocks CPU waiting for memory!
Insert(X, 3) Get(C) Delete(D)
Another issue: resize is also blocking
8
Hashtables optimizing for memory accesses and common case (i.e. when not resizing)
stall all requests until all KV pairs are copied to the new index when resizing.
Insert(X, 3)
Get(C)
Delete(D)
Memory-resident hashtables = GBs of data  blocking for seconds!
To recap …
Fastest concurrent in-memory hashtables
9
Sacrifice functionality or blocking ops Not memory-aware
To resolve these issues we designed the Dandelion Hashtable …
Highly concurrent + cache-friendly  scalable to many threads
Non-blocking index operations: Gets, Puts, Inserts, Deletes – that immediately
free slots
Non-blocking parallel resizes
concurrent ops complete with strongest consistency while other threads resize
Memory-aware
good occupancy
min memory accesses: ~1x access per request
masked memory latency: to avoid CPU stalling
Beyond fully-featured
transactions, iterator, namespaces, variable-sized KVs, GC, snapshots, etc.
Dandelion Hashtable (DLHT)
10
What’s the secret sauce?
DLHT’s bounded cacheline chaining
11
1. No open addressing to avoid severe blocking on deletes
2. Rethinks closed addressing 
Bounded cacheline-chaining = best of KV-chaining + Single-cacheline Bins
- Cacheline Bins + can link up to 3 extra buckets from Link Array
 delays resize for high occupancy
- Link array << Bin array (e.g. 4x or 8x smaller)
Most bins consist of one bucket  most requests = 1 memory access
- If all Link array buckets are linked  non-blocking resize (details next)
Close addressing
Good occupancy
Too much memory traffic
Low memory traffic
Bad occupancy
Open addressing
Compact bin header
12
3. Compact Bin Header (HDR): consists of <8B sync hdr, 8B link meta>
- Link meta: 2x 32bit offsets to Link array
idx_1: 1st linked bucket
idx_2_and_3: 2nd and 3rd consecutive in link array buckets
- Fitting all synchronization state in 8B <bin state, 15 slot states, version>
enables lock-free index operations (Get, Put, Insert, Delete)
and practically non-blocking Resizes
See paper for lock-free index ops … Let’s look on resizing together!
Index resize: triggered by Inserts if 1) all 15 slots of a bin are filled or 2) no free buckets to link.
Parallel resize: index broken in chunks (e.g. 16K bins)Resize threads collaborate with min synchronization.
Each resizing thread picks a not-yet-transferred chunk to transfer.
Non-blocking ops: Blocking all ops for whole resize (GBs of data) Ops wait at most related (one) bin transfer.
If another Insert needs resize, it joins effort of resizing.
Recall each bin has a bin state in sync meta  NoTransfer, InTransfer, and TransferDone.
To transfer a chunk a thread transfers all its Bins one-by-one:
1) bin_state = InTransfer 2) transfer all bin’s slots to new index 3) bin_state = TransferDone
Ops always check the bin state. if NoTransfer proceed in current/old index.
If InTransfer they wait until TransferDone to perform their op in new index.
Practically non-blocking resize
13
Insert(X, 3)
No free link buckets
Get(Y)
Get(X)
Get(X)
Insert(X, 3)
Bin_state: NoTransfer
Bin_state: TransferDone
Cool! What about the issue of CPU stalling on memory accesses?
14
Stalls CPU on every request waiting
one or more memory accesses
Insert(X, 3)
Get(C)
Delete(D)
Insert(X, 3) Get(C) Delete(D)
4. Pipelined batched processing
In DLHT that minimizes memory accesses via a batched API,
we can easily pipeline requests + exploit software prefetching
 overlap memory accesses to bins
– while ensuring in-order request completion
Overlapping memory accesses
Insert(X, 3)
Get(C)
Delete(D)
In-order
batch
Reality check: Do all these translate into performance?
Not supported
Ins/Del
Performance across the board
1
close
addressing
3.5x
open
addressing
1.7x
open addressing with
dels
& fastest growing
12x
3.9x
Commodity server: 18-core Intel Xeon Gold 6254 (2 sockets), 8x 32GB DDR4-2933
Workload: 4GB memory-resident, random access
See paper for many more results!
To conclude
State-of-the-art concurrent in-memory hashtables
Sacrifice core functionality for performance
 still practically blocking + not memory-aware
DLHT
1. No open addressing  no severe blocking on deletes
2. Closed addressing via bounded cacheline chaining
 High occupancy + Most requests: 1 memory access
3. Memory compact bin header
 Lock-free operations + parallel non-blocking resizes
4. Pipelined batched request processing
 overlap memory access + in-order completion
5. Beyond fully-featured (see paper): hashset, transactions,
namespaces, GC, variable sizes, iterators, snapshots, etc..
Memory-resident workload performance
 Surpassing 1.6 Billion in-memory rps
 3.5x Gets (12x Dels) vs. fastest closed (open) addressing
 3.9x faster resizes vs. fastest growing hashtable
Open addressing
Thank you!! Questions?
Backup Slides
1
Get performance
18
Commodity server: 18-core Intel Xeon Gold 6254 (2 sockets), 8x 32GB DDR4-2933
Workload: >4GB random memory-resident
1.7x
3.5x
Memory b/w bound  more dimms = higher throughput
Insert-delete performance
19
2.9x
12x
Inserting 800M keys (growing index)
20
3.9x
21
DLHT recap
Reality check: Do all these translate into performance?
1. Eschews open addressing  no severe blocking on deletes
2. Improved closed addressing via bounded cacheline chaining
 Most bins: 1 bucket = Most requests: 1 memory access
 Can link up to 3 extra buckets (15 slots) for high occupancy
3. Memory compact bin header
 Lock-free operations: 8B sync header fits all needed state
 Sync header incl. bin state to enable parallel non-blocking resizes
4. Pipelined batched request processing
 Exploits software prefetching to overlap memory accesses
 Guarantee in-order request completion
5. Many extra features (see paper): hashset, transactions, namespaces, GC,
variable-sized keys, single-threaded optimizations, iterators, snapshots, etc..
Open addressing
Hash Joins
2
Million
requests
/
sec
See paper for more results!
YCSB – Workload Mix
2
Multi-key Transactions
2
Fault-tolerant Dist. Transactions
2
(TATP, 5 servers, 3-way replication)
HISILICON SEMICONDUCTOR
HUAWEI TECHNOLOGIES CO., LTD. Page 26
Lighting fast use-cases (single-node and distributed)
Fast map for Single-threaded
app
Single node in-memory storage
DB Lock manager Single-node OLTP Transactions
In-memory caching
OLAP (Joins, Aggregation,
etc.)
Large slower memory:
CXL/NVM/Far memory
Remote
KVS
Distributed reliable
Transactions
Replicated in-memory
storage
Performance
2
Million
requests
/
sec
Takeaway: writes can block reads  good write
performance of is crucial even at low-write ratio !
See paper for more results!

More Related Content

PPTX
파이썬 xml 이해하기
Yong Joon Moon
 
PDF
AWS 기반의 마이크로 서비스 아키텍쳐 구현 방안 :: 김필중 :: AWS Summit Seoul 20
Amazon Web Services Korea
 
PDF
Building an Enterprise-Grade Azure Governance Model
Karl Ots
 
PDF
AWS와 함께 하는 클라우드 컴퓨팅 - 홍민우 AWS 매니저
Amazon Web Services Korea
 
PPTX
Node.jsで使えるファイルDB"NeDB"のススメ
Isamu Suzuki
 
PDF
Node js 入門
Satoshi Takami
 
PDF
LINE's messaging service architecture underlying more than 200 million monthl...
kawamuray
 
PDF
천만 사용자를 위한 AWS 아키텍처 보안 모범 사례 (윤석찬, 테크에반젤리스트)
Amazon Web Services Korea
 
파이썬 xml 이해하기
Yong Joon Moon
 
AWS 기반의 마이크로 서비스 아키텍쳐 구현 방안 :: 김필중 :: AWS Summit Seoul 20
Amazon Web Services Korea
 
Building an Enterprise-Grade Azure Governance Model
Karl Ots
 
AWS와 함께 하는 클라우드 컴퓨팅 - 홍민우 AWS 매니저
Amazon Web Services Korea
 
Node.jsで使えるファイルDB"NeDB"のススメ
Isamu Suzuki
 
Node js 入門
Satoshi Takami
 
LINE's messaging service architecture underlying more than 200 million monthl...
kawamuray
 
천만 사용자를 위한 AWS 아키텍처 보안 모범 사례 (윤석찬, 테크에반젤리스트)
Amazon Web Services Korea
 

What's hot (20)

DOC
Balaji Resume
Balaji Ommudali
 
PPTX
Cloud Security Fundamentals Webinar
Joseph Holbrook, Chief Learning Officer (CLO)
 
PDF
Reactive Applications with Apache Pulsar and Spring Boot
VMware Tanzu
 
PDF
VMware Cloud on AWS POC HCX デプロイガイド
Noritaka Kuroiwa
 
PDF
C# コーディングガイドライン 2013/02/26
Yoshihisa Ozaki
 
PPTX
微服務資料管理的天堂路 - CQRS / Event Sourcing 的應用與實踐
Andrew Wu
 
PDF
AWS Redshift Analyzeの必要性とvacuumの落とし穴
Moto Fukao
 
PDF
20190522 AWS Black Belt Online Seminar AWS Step Functions
Amazon Web Services Japan
 
PPTX
Introduce AWS Lambda for newbie and Non-IT
Chitpong Wuttanan
 
PPTX
VMware Advance Troubleshooting Workshop - Day 3
Vepsun Technologies
 
PDF
GitOps with Amazon EKS Anywhere by Dan Budris
Weaveworks
 
PDF
Styled components presentation
Maciej Matuszewski
 
PDF
Kubernetes 101
Crevise Technologies
 
PDF
Mercari JPのモノリスサービスをKubernetesに移行した話 PHP Conference 2022 9/24
Shin Ohno
 
PDF
Anthos Application Modernization Platform
GDG Cloud Bengaluru
 
PDF
[AWS Builders] AWS와 함께하는 클라우드 컴퓨팅
Amazon Web Services Korea
 
PPTX
Google cloud platform
Piyumi Niwanthika Herath
 
PDF
AWS Summit Seoul 2023 | AWS에서 최소한의 비용으로 구현하는 멀티리전 DR 자동화 구성
Amazon Web Services Korea
 
PDF
Introduction into ES6 JavaScript.
boyney123
 
PDF
Pseudo GTID and Easy MySQL Replication Topology Management
Shlomi Noach
 
Balaji Resume
Balaji Ommudali
 
Cloud Security Fundamentals Webinar
Joseph Holbrook, Chief Learning Officer (CLO)
 
Reactive Applications with Apache Pulsar and Spring Boot
VMware Tanzu
 
VMware Cloud on AWS POC HCX デプロイガイド
Noritaka Kuroiwa
 
C# コーディングガイドライン 2013/02/26
Yoshihisa Ozaki
 
微服務資料管理的天堂路 - CQRS / Event Sourcing 的應用與實踐
Andrew Wu
 
AWS Redshift Analyzeの必要性とvacuumの落とし穴
Moto Fukao
 
20190522 AWS Black Belt Online Seminar AWS Step Functions
Amazon Web Services Japan
 
Introduce AWS Lambda for newbie and Non-IT
Chitpong Wuttanan
 
VMware Advance Troubleshooting Workshop - Day 3
Vepsun Technologies
 
GitOps with Amazon EKS Anywhere by Dan Budris
Weaveworks
 
Styled components presentation
Maciej Matuszewski
 
Kubernetes 101
Crevise Technologies
 
Mercari JPのモノリスサービスをKubernetesに移行した話 PHP Conference 2022 9/24
Shin Ohno
 
Anthos Application Modernization Platform
GDG Cloud Bengaluru
 
[AWS Builders] AWS와 함께하는 클라우드 컴퓨팅
Amazon Web Services Korea
 
Google cloud platform
Piyumi Niwanthika Herath
 
AWS Summit Seoul 2023 | AWS에서 최소한의 비용으로 구현하는 멀티리전 DR 자동화 구성
Amazon Web Services Korea
 
Introduction into ES6 JavaScript.
boyney123
 
Pseudo GTID and Easy MySQL Replication Topology Management
Shlomi Noach
 
Ad

Similar to Dandelion Hashtable: beyond billion requests per second on a commodity server (20)

PDF
Dandelion Hashtable: beyond billion requests per second on a commodity server...
Antonios Katsarakis
 
PPTX
Modern software design in Big data era
Bill GU
 
PPTX
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Zhe Zhang
 
PPTX
HBase at Flurry
ddlatham
 
PDF
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
In-Memory Computing Summit
 
PPTX
Debunking the Myths of HDFS Erasure Coding Performance
DataWorks Summit/Hadoop Summit
 
PDF
Hbase: an introduction
Jean-Baptiste Poullet
 
PDF
Architecture_L5 (3).pdf wwwwwwwwwwwwwwwwwwwwwwwwwww
shivenpatel42
 
PDF
Introduction to Galera Cluster
Codership Oy - Creators of Galera Cluster
 
PDF
Dandelion: Hundreds of Millions of Distributed Replicated Transactions with F...
Antonios Katsarakis
 
PDF
Under the Hood of a Shard-per-Core Database Architecture
ScyllaDB
 
PDF
Performance and predictability (1)
RichardWarburton
 
PDF
Performance and Predictability - Richard Warburton
JAXLondon2014
 
PDF
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
Fred de Villamil
 
PPTX
SQL Server In-Memory OLTP introduction (Hekaton)
Shy Engelberg
 
PDF
Data Grids with Oracle Coherence
Ben Stopford
 
PDF
Engineering fast indexes
Daniel Lemire
 
PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
PDF
System design handwritten notes guidance
Shabista Imam
 
PDF
Cache optimization
Vani Kandhasamy
 
Dandelion Hashtable: beyond billion requests per second on a commodity server...
Antonios Katsarakis
 
Modern software design in Big data era
Bill GU
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Zhe Zhang
 
HBase at Flurry
ddlatham
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
In-Memory Computing Summit
 
Debunking the Myths of HDFS Erasure Coding Performance
DataWorks Summit/Hadoop Summit
 
Hbase: an introduction
Jean-Baptiste Poullet
 
Architecture_L5 (3).pdf wwwwwwwwwwwwwwwwwwwwwwwwwww
shivenpatel42
 
Introduction to Galera Cluster
Codership Oy - Creators of Galera Cluster
 
Dandelion: Hundreds of Millions of Distributed Replicated Transactions with F...
Antonios Katsarakis
 
Under the Hood of a Shard-per-Core Database Architecture
ScyllaDB
 
Performance and predictability (1)
RichardWarburton
 
Performance and Predictability - Richard Warburton
JAXLondon2014
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
Fred de Villamil
 
SQL Server In-Memory OLTP introduction (Hekaton)
Shy Engelberg
 
Data Grids with Oracle Coherence
Ben Stopford
 
Engineering fast indexes
Daniel Lemire
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
System design handwritten notes guidance
Shabista Imam
 
Cache optimization
Vani Kandhasamy
 
Ad

More from Antonios Katsarakis (9)

PDF
The L2AW theorem
Antonios Katsarakis
 
PDF
Invalidation-Based Protocols for Replicated Datastores
Antonios Katsarakis
 
PDF
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Antonios Katsarakis
 
PDF
Hermes Reliable Replication Protocol - Poster
Antonios Katsarakis
 
PDF
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
Antonios Katsarakis
 
PDF
Scale-out ccNUMA - Eurosys'18
Antonios Katsarakis
 
PPTX
Tensor Processing Unit (TPU)
Antonios Katsarakis
 
PPTX
Distributed Processing Frameworks
Antonios Katsarakis
 
PPTX
Spark Overview and Performance Issues
Antonios Katsarakis
 
The L2AW theorem
Antonios Katsarakis
 
Invalidation-Based Protocols for Replicated Datastores
Antonios Katsarakis
 
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Antonios Katsarakis
 
Hermes Reliable Replication Protocol - Poster
Antonios Katsarakis
 
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
Antonios Katsarakis
 
Scale-out ccNUMA - Eurosys'18
Antonios Katsarakis
 
Tensor Processing Unit (TPU)
Antonios Katsarakis
 
Distributed Processing Frameworks
Antonios Katsarakis
 
Spark Overview and Performance Issues
Antonios Katsarakis
 

Recently uploaded (20)

PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Software Development Methodologies in 2025
KodekX
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Software Development Methodologies in 2025
KodekX
 

Dandelion Hashtable: beyond billion requests per second on a commodity server

  • 1. Dandelion Hashtable A non-blocking resizable hashtable with fast deletes & memory-awareness Antonios Katsarakis, Vasilis Gavrielatos, Nikos Ntarmos Huawei Research, Edinburgh UK High performance Distributed Computed (HPDC’24) Beyond billion requests per second on a commodity server
  • 2. Ubiquitous in todays cloud and HPC caching, in-memory storage, key-value stores, transactional and analytical DBs, bioinformatics, … Store large amounts of data as <Key, Value> (KV) pairs Offer thread-safe Gets, Puts, Inserts, Deletes + index Resizes to grow capacity vertically Concurrent growing hashtables 2
  • 3. Keep data in-memory Ensure strong consistency Exploit concurrency to deliver high throughput needs of modern services Modern hashtables 3 How fast is the state-of-the-art?
  • 4. Throughput almost a billion requests per second !! Problem: Such high throughput only on cache-resident workloads where accesses served by h/w caches, seldom reach main memory – e.g., evaluating high access skew or small datasets 4 State-of-the-art performance What about larger (i.e. memory-resident) workloads?
  • 5. Despite sacrificing functionality for performance, cannot maximize throughput in memory-resident workloads Two issues 1. Concurrent but practically blocking  Lock-free Gets but blocks on several other occasions 2. In-memory but not memory-aware  Inefficient handling of memory accesses/utilization Deficiencies of state-of-the-art 5 Can you elaborate?
  • 6. Closer look on state-of-the-art designs 6 For simplicity <8B keys, 8B values> KV pairs and efficient concurrent designs that can inline those in their index Easy deletes: rm from linked-list Good occupancy A lot of memory traffic too much pointer chasing! Easy deletes: used slots in HDR Low memory traffic: no pointer chasing Bad occupancy: must resize early after 4 collisions on any Bin! Open addressing 1. Hash(key) to a slot 2. probe following slots until find key or empty slot Closed addressing 1. Hash(key) to a bin 2. search Bin for the key  One-to-one relation for a key and a Bin Good occupancy Low memory traffic: no pointer chasing Deletes = tombstones  cannot free slots unless block all ops & copy KVs to new index Challenging tradeoff … is that all impacting performance?
  • 7. Issue: CPU stalls on each request 7 Regardless of scheme most hashtables stall the CPU on every request (Get, Put, Insert, Delete) waiting one or more memory accesses Insert(X, 3) Get(C) Delete(D) Cannot overlap memory accesses  each request blocks CPU waiting for memory! Insert(X, 3) Get(C) Delete(D)
  • 8. Another issue: resize is also blocking 8 Hashtables optimizing for memory accesses and common case (i.e. when not resizing) stall all requests until all KV pairs are copied to the new index when resizing. Insert(X, 3) Get(C) Delete(D) Memory-resident hashtables = GBs of data  blocking for seconds! To recap …
  • 9. Fastest concurrent in-memory hashtables 9 Sacrifice functionality or blocking ops Not memory-aware To resolve these issues we designed the Dandelion Hashtable …
  • 10. Highly concurrent + cache-friendly  scalable to many threads Non-blocking index operations: Gets, Puts, Inserts, Deletes – that immediately free slots Non-blocking parallel resizes concurrent ops complete with strongest consistency while other threads resize Memory-aware good occupancy min memory accesses: ~1x access per request masked memory latency: to avoid CPU stalling Beyond fully-featured transactions, iterator, namespaces, variable-sized KVs, GC, snapshots, etc. Dandelion Hashtable (DLHT) 10 What’s the secret sauce?
  • 11. DLHT’s bounded cacheline chaining 11 1. No open addressing to avoid severe blocking on deletes 2. Rethinks closed addressing  Bounded cacheline-chaining = best of KV-chaining + Single-cacheline Bins - Cacheline Bins + can link up to 3 extra buckets from Link Array  delays resize for high occupancy - Link array << Bin array (e.g. 4x or 8x smaller) Most bins consist of one bucket  most requests = 1 memory access - If all Link array buckets are linked  non-blocking resize (details next) Close addressing Good occupancy Too much memory traffic Low memory traffic Bad occupancy Open addressing
  • 12. Compact bin header 12 3. Compact Bin Header (HDR): consists of <8B sync hdr, 8B link meta> - Link meta: 2x 32bit offsets to Link array idx_1: 1st linked bucket idx_2_and_3: 2nd and 3rd consecutive in link array buckets - Fitting all synchronization state in 8B <bin state, 15 slot states, version> enables lock-free index operations (Get, Put, Insert, Delete) and practically non-blocking Resizes See paper for lock-free index ops … Let’s look on resizing together!
  • 13. Index resize: triggered by Inserts if 1) all 15 slots of a bin are filled or 2) no free buckets to link. Parallel resize: index broken in chunks (e.g. 16K bins)Resize threads collaborate with min synchronization. Each resizing thread picks a not-yet-transferred chunk to transfer. Non-blocking ops: Blocking all ops for whole resize (GBs of data) Ops wait at most related (one) bin transfer. If another Insert needs resize, it joins effort of resizing. Recall each bin has a bin state in sync meta  NoTransfer, InTransfer, and TransferDone. To transfer a chunk a thread transfers all its Bins one-by-one: 1) bin_state = InTransfer 2) transfer all bin’s slots to new index 3) bin_state = TransferDone Ops always check the bin state. if NoTransfer proceed in current/old index. If InTransfer they wait until TransferDone to perform their op in new index. Practically non-blocking resize 13 Insert(X, 3) No free link buckets Get(Y) Get(X) Get(X) Insert(X, 3) Bin_state: NoTransfer Bin_state: TransferDone Cool! What about the issue of CPU stalling on memory accesses?
  • 14. 14 Stalls CPU on every request waiting one or more memory accesses Insert(X, 3) Get(C) Delete(D) Insert(X, 3) Get(C) Delete(D) 4. Pipelined batched processing In DLHT that minimizes memory accesses via a batched API, we can easily pipeline requests + exploit software prefetching  overlap memory accesses to bins – while ensuring in-order request completion Overlapping memory accesses Insert(X, 3) Get(C) Delete(D) In-order batch Reality check: Do all these translate into performance?
  • 15. Not supported Ins/Del Performance across the board 1 close addressing 3.5x open addressing 1.7x open addressing with dels & fastest growing 12x 3.9x Commodity server: 18-core Intel Xeon Gold 6254 (2 sockets), 8x 32GB DDR4-2933 Workload: 4GB memory-resident, random access See paper for many more results!
  • 16. To conclude State-of-the-art concurrent in-memory hashtables Sacrifice core functionality for performance  still practically blocking + not memory-aware DLHT 1. No open addressing  no severe blocking on deletes 2. Closed addressing via bounded cacheline chaining  High occupancy + Most requests: 1 memory access 3. Memory compact bin header  Lock-free operations + parallel non-blocking resizes 4. Pipelined batched request processing  overlap memory access + in-order completion 5. Beyond fully-featured (see paper): hashset, transactions, namespaces, GC, variable sizes, iterators, snapshots, etc.. Memory-resident workload performance  Surpassing 1.6 Billion in-memory rps  3.5x Gets (12x Dels) vs. fastest closed (open) addressing  3.9x faster resizes vs. fastest growing hashtable Open addressing Thank you!! Questions?
  • 18. Get performance 18 Commodity server: 18-core Intel Xeon Gold 6254 (2 sockets), 8x 32GB DDR4-2933 Workload: >4GB random memory-resident 1.7x 3.5x Memory b/w bound  more dimms = higher throughput
  • 20. Inserting 800M keys (growing index) 20 3.9x
  • 21. 21 DLHT recap Reality check: Do all these translate into performance? 1. Eschews open addressing  no severe blocking on deletes 2. Improved closed addressing via bounded cacheline chaining  Most bins: 1 bucket = Most requests: 1 memory access  Can link up to 3 extra buckets (15 slots) for high occupancy 3. Memory compact bin header  Lock-free operations: 8B sync header fits all needed state  Sync header incl. bin state to enable parallel non-blocking resizes 4. Pipelined batched request processing  Exploits software prefetching to overlap memory accesses  Guarantee in-order request completion 5. Many extra features (see paper): hashset, transactions, namespaces, GC, variable-sized keys, single-threaded optimizations, iterators, snapshots, etc.. Open addressing
  • 25. Fault-tolerant Dist. Transactions 2 (TATP, 5 servers, 3-way replication)
  • 26. HISILICON SEMICONDUCTOR HUAWEI TECHNOLOGIES CO., LTD. Page 26 Lighting fast use-cases (single-node and distributed) Fast map for Single-threaded app Single node in-memory storage DB Lock manager Single-node OLTP Transactions In-memory caching OLAP (Joins, Aggregation, etc.) Large slower memory: CXL/NVM/Far memory Remote KVS Distributed reliable Transactions Replicated in-memory storage
  • 27. Performance 2 Million requests / sec Takeaway: writes can block reads  good write performance of is crucial even at low-write ratio ! See paper for more results!

Editor's Notes

  • #3: Let me start by saying that recent Distributed Datastores keep data in-memory while offering a read/write API. Such datastores consist the backbone of modern online services, which are latency-sensitive and must serve numerous concurrent requests. As a result, distributed datastores must offer high-performance. Furthermore, because datastores are deployed over commodity failure prone h/w they must also be able to tolerate faults. This mandates data replication
  • #4: Let me start by saying that recent Distributed Datastores keep data in-memory while offering a read/write API. Such datastores consist the backbone of modern online services, which are latency-sensitive and must serve numerous concurrent requests. As a result, distributed datastores must offer high-performance. Furthermore, because datastores are deployed over commodity failure prone h/w they must also be able to tolerate faults. This mandates data replication
  • #5: Let me start by saying that recent Distributed Datastores keep data in-memory while offering a read/write API. Such datastores consist the backbone of modern online services, which are latency-sensitive and must serve numerous concurrent requests. As a result, distributed datastores must offer high-performance. Furthermore, because datastores are deployed over commodity failure prone h/w they must also be able to tolerate faults. This mandates data replication
  • #6: Let me start by saying that recent Distributed Datastores keep data in-memory while offering a read/write API. Such datastores consist the backbone of modern online services, which are latency-sensitive and must serve numerous concurrent requests. As a result, distributed datastores must offer high-performance. Furthermore, because datastores are deployed over commodity failure prone h/w they must also be able to tolerate faults. This mandates data replication
  • #7: Let me start by saying that recent Distributed Datastores keep data in-memory while offering a read/write API. Such datastores consist the backbone of modern online services, which are latency-sensitive and must serve numerous concurrent requests. As a result, distributed datastores must offer high-performance. Furthermore, because datastores are deployed over commodity failure prone h/w they must also be able to tolerate faults. This mandates data replication
  • #8: Let me start by saying that recent Distributed Datastores keep data in-memory while offering a read/write API. Such datastores consist the backbone of modern online services, which are latency-sensitive and must serve numerous concurrent requests. As a result, distributed datastores must offer high-performance. Furthermore, because datastores are deployed over commodity failure prone h/w they must also be able to tolerate faults. This mandates data replication
  • #9: Let me start by saying that recent Distributed Datastores keep data in-memory while offering a read/write API. Such datastores consist the backbone of modern online services, which are latency-sensitive and must serve numerous concurrent requests. As a result, distributed datastores must offer high-performance. Furthermore, because datastores are deployed over commodity failure prone h/w they must also be able to tolerate faults. This mandates data replication
  • #10: TODO add DRAMHiT
  • #11: Let me start by saying that recent Distributed Datastores keep data in-memory while offering a read/write API. Such datastores consist the backbone of modern online services, which are latency-sensitive and must serve numerous concurrent requests. As a result, distributed datastores must offer high-performance. Furthermore, because datastores are deployed over commodity failure prone h/w they must also be able to tolerate faults. This mandates data replication
  • #12: Let me start by saying that recent Distributed Datastores keep data in-memory while offering a read/write API. Such datastores consist the backbone of modern online services, which are latency-sensitive and must serve numerous concurrent requests. As a result, distributed datastores must offer high-performance. Furthermore, because datastores are deployed over commodity failure prone h/w they must also be able to tolerate faults. This mandates data replication
  • #13: Let me start by saying that recent Distributed Datastores keep data in-memory while offering a read/write API. Such datastores consist the backbone of modern online services, which are latency-sensitive and must serve numerous concurrent requests. As a result, distributed datastores must offer high-performance. Furthermore, because datastores are deployed over commodity failure prone h/w they must also be able to tolerate faults. This mandates data replication
  • #14: Let me start by saying that recent Distributed Datastores keep data in-memory while offering a read/write API. Such datastores consist the backbone of modern online services, which are latency-sensitive and must serve numerous concurrent requests. As a result, distributed datastores must offer high-performance. Furthermore, because datastores are deployed over commodity failure prone h/w they must also be able to tolerate faults. This mandates data replication
  • #15: Let me start by saying that recent Distributed Datastores keep data in-memory while offering a read/write API. Such datastores consist the backbone of modern online services, which are latency-sensitive and must serve numerous concurrent requests. As a result, distributed datastores must offer high-performance. Furthermore, because datastores are deployed over commodity failure prone h/w they must also be able to tolerate faults. This mandates data replication
  • #22: Let me start by saying that recent Distributed Datastores keep data in-memory while offering a read/write API. Such datastores consist the backbone of modern online services, which are latency-sensitive and must serve numerous concurrent requests. As a result, distributed datastores must offer high-performance. Furthermore, because datastores are deployed over commodity failure prone h/w they must also be able to tolerate faults. This mandates data replication