Learn about Agoda's performance tuning strategies for ScyllaDB. Worakarn shares how they optimized disk performance, fine-tuned compaction strategies, and adjusted SSTable settings to match their workload for peak efficiency.
Trying to figure out MCP by actually building an app from scratch with open s...Julien SIMON
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdfNabajyoti Banik
OFFOFFBOX™ – A New Era for African Film | Startup Presentationambaicciwalkerbrian
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdfPrecisely
Presentation about Hardware and Software in Computersnehamodhawadiya
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)Enterprise Knowledge
REPORT: Heating appliances market in Poland 2024SPIUG
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...Sandesh Rao
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
1. A ScyllaDB Community
How Agoda Scaled 50x
Throughput with ScyllaDB
Worakarn Isaratham
Lead Software Engineer
2. Worakarn Isaratham (he/him)
■ Lead Software Engineer, Agoda
■ Based in Bangkok, Thailand
■ Experience in distributed computing,
software testing
■ Interested in dependable software systems
3. ■ ScyllaDB in Agoda Feature Store
■ Capacity Problem
■ Potential Solutions
Presentation Agenda
5. Online Feature Serving
Client SDK
Cache
ScyllaDB
App Servers
3.5M EPS 1.7M EPS
200k EPS
P99 Latency: 5 ms
P99 Latency: 8 ms
Average 5 features / entities
6. Growth
Since the start of 2023
■ Servers traffic: 50x
Peak servers traffic, on the busiest DC
7. Growth
Since the start of 2023
■ Servers traffic: 50x
■ ScyllaDB traffic: 10x
10K EPS
Peak ScyllaDB traffic, on the busiest DC
8. A Capacity Problem
■ A new use case wanted to onboard
■ Problematic usage pattern:
■ Bursty traffic from cold cache, hitting ScyllaDB at 120K EPS.
■ Many duplicated requests in very quick succession
■ Keep retrying any failed requests
12x of the load then
2x of the load now!
9. A Capacity Problem
■ One DC was able to survive this load
without errors.
■ The other DC got lots of problems
■ Very high error rate
■ Took 40 minutes to finish all
the retries
■ Metrics were pointing to slow
read on ScyllaDB nodes
10. Slow Disks
Bad DC Good DC Advantage
Disks SATA SSD
RAID 0
NVMe SSD
RAID 0
Read iops 6868 79566 11.6x
Read
bandwidth
1.5G 10.1G 6.7x
Write iops 6615 41104 6.2x
Write
bandwidth
1.9G 6.3G 3.3x
11. Just Buy New Disks?
● New disks were ordered
● Improved user-side caching, reduced
this load to 7K.
● How long could we survive?
Capacity
12. Cache-Avoiding Load Test
■ Use artificial, one-time-used load to avoid ScyllaDB caching.
25K 5K
Normal load
ScyllaDB cache
one-time-used entities
BYPASS CACHE
Flush, Restart ScyllaDB
Baseline EPS for SATA
13. Idea 1: Different Data Modeling
Current: one tall table
Alternative: one table per feature set
15. Idea 2: Change Compaction Strategy
■ Our workload is “Read-mostly, many updates”. Size-tiered strategy is recommended.
Prioritized read latency
Slow disk read
Large SSTable files
Size-tiered
Compaction
Leveled
Compaction
17. Idea 3: Increase Summary File Size
■ ScyllaDB uses summary files to help navigate to index files
summary file size ≈ data file size × summary ratio
High ratio
Larger
summary
More
efficient
index
Less disk I/O
20. Rollout
Jul 2023
New summary ratio applied
Oct 2023
Migrated to NVMe disks
Focus shifted to other components.
Still trying out some new ideas on ScyllaDB.
Leveled Compaction:
Only applied to new table,
need data migration
21. Recent Experiments
● Partitioned By Feature Set, clustered by Entity
○ Disastrous! 400x worse
● All features as a blob in a single row
○ +35% throughput
22. Lessons
● Fast disks are essential!
● Benchmark your load
● Tailor your data model to fit the needs