Dandelion Hashtable: beyond billion requests per second on a commodity server

Dandelion Hashtable
A non-blocking resizable hashtable
with fast deletes & memory-awareness
Antonios Katsarakis, Vasilis Gavrielatos, Nikos Ntarmos
Huawei Research, Edinburgh UK
High performance Distributed Computed (HPDC’24)
Beyond billion requests per second
on a commodity server

Ubiquitous in todays cloud and HPC
caching, in-memory storage, key-value stores,
transactional and analytical DBs, bioinformatics, …
Store large amounts of data
as <Key, Value> (KV) pairs
Offer thread-safe Gets, Puts, Inserts, Deletes
+ index Resizes to grow capacity vertically
Concurrent growing hashtables
2

Keep data in-memory
Ensure strong consistency
Exploit concurrency to deliver
high throughput needs of modern services
Modern hashtables
3
How fast is the state-of-the-art?

Throughput almost a billion requests per second !!
Problem: Such high throughput only on cache-resident workloads
where accesses served by h/w caches, seldom reach main memory
– e.g., evaluating high access skew or small datasets
4
State-of-the-art performance
What about larger (i.e. memory-resident) workloads?

Despite sacrificing functionality for performance, cannot
maximize throughput in memory-resident workloads
Two issues
1. Concurrent but practically blocking
 Lock-free Gets but blocks on several other occasions
2. In-memory but not memory-aware
 Inefficient handling of memory accesses/utilization
Deficiencies of state-of-the-art
5
Can you elaborate?

Closer look on state-of-the-art designs
6
For simplicity <8B keys, 8B values> KV pairs and efficient concurrent designs that can inline those in their index
Easy deletes: rm from linked-list
Good occupancy
A lot of memory traffic
too much pointer chasing!
Easy deletes: used slots in HDR
Low memory traffic: no pointer chasing
Bad occupancy: must resize early after
4 collisions on any Bin!
Open addressing
1. Hash(key) to a slot
2. probe following slots until
find key or empty slot
Closed addressing
1. Hash(key) to a bin
2. search Bin for the key
 One-to-one relation for a key and a Bin
Good occupancy
Low memory traffic: no pointer chasing
Deletes = tombstones  cannot free slots
unless block all ops & copy KVs to new index
Challenging tradeoff … is that all impacting performance?

Issue: CPU stalls on each request
7
Regardless of scheme most hashtables stall the CPU on every request
(Get, Put, Insert, Delete) waiting one or more memory accesses
Insert(X, 3)
Get(C)
Delete(D)
Cannot overlap memory accesses
 each request blocks CPU waiting for memory!
Insert(X, 3) Get(C) Delete(D)

Another issue: resize is also blocking
8
Hashtables optimizing for memory accesses and common case (i.e. when not resizing)
stall all requests until all KV pairs are copied to the new index when resizing.
Insert(X, 3)
Get(C)
Delete(D)
Memory-resident hashtables = GBs of data  blocking for seconds!
To recap …

Fastest concurrent in-memory hashtables
9
Sacrifice functionality or blocking ops Not memory-aware
To resolve these issues we designed the Dandelion Hashtable …

Highly concurrent + cache-friendly  scalable to many threads
Non-blocking index operations: Gets, Puts, Inserts, Deletes – that immediately
free slots
Non-blocking parallel resizes
concurrent ops complete with strongest consistency while other threads resize
Memory-aware
good occupancy
min memory accesses: ~1x access per request
masked memory latency: to avoid CPU stalling
Beyond fully-featured
transactions, iterator, namespaces, variable-sized KVs, GC, snapshots, etc.
Dandelion Hashtable (DLHT)
10
What’s the secret sauce?

DLHT’s bounded cacheline chaining
11
1. No open addressing to avoid severe blocking on deletes
2. Rethinks closed addressing 
Bounded cacheline-chaining = best of KV-chaining + Single-cacheline Bins
- Cacheline Bins + can link up to 3 extra buckets from Link Array
 delays resize for high occupancy
- Link array << Bin array (e.g. 4x or 8x smaller)
Most bins consist of one bucket  most requests = 1 memory access
- If all Link array buckets are linked  non-blocking resize (details next)
Close addressing
Good occupancy
Too much memory traffic
Low memory traffic
Bad occupancy
Open addressing

Compact bin header
12
3. Compact Bin Header (HDR): consists of <8B sync hdr, 8B link meta>
- Link meta: 2x 32bit offsets to Link array
idx_1: 1st linked bucket
idx_2_and_3: 2nd and 3rd consecutive in link array buckets
- Fitting all synchronization state in 8B <bin state, 15 slot states, version>
enables lock-free index operations (Get, Put, Insert, Delete)
and practically non-blocking Resizes
See paper for lock-free index ops … Let’s look on resizing together!

Index resize: triggered by Inserts if 1) all 15 slots of a bin are filled or 2) no free buckets to link.
Parallel resize: index broken in chunks (e.g. 16K bins)Resize threads collaborate with min synchronization.
Each resizing thread picks a not-yet-transferred chunk to transfer.
Non-blocking ops: Blocking all ops for whole resize (GBs of data) Ops wait at most related (one) bin transfer.
If another Insert needs resize, it joins effort of resizing.
Recall each bin has a bin state in sync meta  NoTransfer, InTransfer, and TransferDone.
To transfer a chunk a thread transfers all its Bins one-by-one:
1) bin_state = InTransfer 2) transfer all bin’s slots to new index 3) bin_state = TransferDone
Ops always check the bin state. if NoTransfer proceed in current/old index.
If InTransfer they wait until TransferDone to perform their op in new index.
Practically non-blocking resize
13
Insert(X, 3)
No free link buckets
Get(Y)
Get(X)
Get(X)
Insert(X, 3)
Bin_state: NoTransfer
Bin_state: TransferDone
Cool! What about the issue of CPU stalling on memory accesses?

14
Stalls CPU on every request waiting
one or more memory accesses
Insert(X, 3)
Get(C)
Delete(D)
Insert(X, 3) Get(C) Delete(D)
4. Pipelined batched processing
In DLHT that minimizes memory accesses via a batched API,
we can easily pipeline requests + exploit software prefetching
 overlap memory accesses to bins
– while ensuring in-order request completion
Overlapping memory accesses
Insert(X, 3)
Get(C)
Delete(D)
In-order
batch
Reality check: Do all these translate into performance?

Not supported
Ins/Del
Performance across the board
1
close
addressing
3.5x
open
addressing
1.7x
open addressing with
dels
& fastest growing
12x
3.9x
Commodity server: 18-core Intel Xeon Gold 6254 (2 sockets), 8x 32GB DDR4-2933
Workload: 4GB memory-resident, random access
See paper for many more results!

To conclude
State-of-the-art concurrent in-memory hashtables
Sacrifice core functionality for performance
 still practically blocking + not memory-aware
DLHT
1. No open addressing  no severe blocking on deletes
2. Closed addressing via bounded cacheline chaining
 High occupancy + Most requests: 1 memory access
3. Memory compact bin header
 Lock-free operations + parallel non-blocking resizes
4. Pipelined batched request processing
 overlap memory access + in-order completion
5. Beyond fully-featured (see paper): hashset, transactions,
namespaces, GC, variable sizes, iterators, snapshots, etc..
Memory-resident workload performance
 Surpassing 1.6 Billion in-memory rps
 3.5x Gets (12x Dels) vs. fastest closed (open) addressing
 3.9x faster resizes vs. fastest growing hashtable
Open addressing
Thank you!! Questions?

Get performance
18
Commodity server: 18-core Intel Xeon Gold 6254 (2 sockets), 8x 32GB DDR4-2933
Workload: >4GB random memory-resident
1.7x
3.5x
Memory b/w bound  more dimms = higher throughput

Insert-delete performance
19
2.9x
12x

Inserting 800M keys (growing index)
20
3.9x

21
DLHT recap
Reality check: Do all these translate into performance?
1. Eschews open addressing  no severe blocking on deletes
2. Improved closed addressing via bounded cacheline chaining
 Most bins: 1 bucket = Most requests: 1 memory access
 Can link up to 3 extra buckets (15 slots) for high occupancy
3. Memory compact bin header
 Lock-free operations: 8B sync header fits all needed state
 Sync header incl. bin state to enable parallel non-blocking resizes
4. Pipelined batched request processing
 Exploits software prefetching to overlap memory accesses
 Guarantee in-order request completion
5. Many extra features (see paper): hashset, transactions, namespaces, GC,
variable-sized keys, single-threaded optimizations, iterators, snapshots, etc..
Open addressing

Hash Joins
2
Million
requests
/
sec
See paper for more results!

Fault-tolerant Dist. Transactions
2
(TATP, 5 servers, 3-way replication)

HISILICON SEMICONDUCTOR
HUAWEI TECHNOLOGIES CO., LTD. Page 26
Lighting fast use-cases (single-node and distributed)
Fast map for Single-threaded
app
Single node in-memory storage
DB Lock manager Single-node OLTP Transactions
In-memory caching
OLAP (Joins, Aggregation,
etc.)
Large slower memory:
CXL/NVM/Far memory
Remote
KVS
Distributed reliable
Transactions
Replicated in-memory
storage

Performance
2
Million
requests
/
sec
Takeaway: writes can block reads  good write
performance of is crucial even at low-write ratio !
See paper for more results!

Dandelion Hashtable: beyond billion requests per second on a commodity server

More Related Content

What's hot (20)

Similar to Dandelion Hashtable: beyond billion requests per second on a commodity server (20)

More from Antonios Katsarakis (9)

Recently uploaded (20)

Dandelion Hashtable: beyond billion requests per second on a commodity server

Editor's Notes