Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra

Nadav Har'El, ScyllaDB
The Generalist Engineer meetup, Tel-Aviv
Ides of March, 2016
SeastarSeastar Or how we implemented a
10-times faster Cassandra

2
● Israeli but multi-national startup company
– 15 developers cherry-picked from 10 countries.
● Founded 2013 (“Cloudius Systems”)
– by Avi Kivity and Dor Laor of KVM fame.
● Fans of open-source: OSv, Seastar, ScyllaDB.

3
Make Cassandra 10 times faster
Your mission, should
you choose to accept it:

4
“Make Cassandra 10 times faster”
● Why 10?
● Why Cassandra?
– Popular NoSQL database (2nd to MongoDB).
– Powerful and widely applicable.
– Example of a wider class of middleware.
● Why “mission impossible”?
– Cassandra not considered particularly slow -
– Considered faster than MongoDB, Hbase, et al.
– “disk is bottleneck” (no longer, with SSD!)

5
Our first attempt: OSv
● New OS design specifically for cloud VMs:
– Run a single application per VM (“unikernel”)
– Run existing Linux applications (Cassandra)
– Run these faster than Linux.

6
OSv
●
Some of the many ideas we used in OSv:
– Single address space.
– System call is just a function call.
– Faster context switches.
– No spin locks.
– Smaller code.
– Redesigned network stack (Van Jacobson).

7
OSv
● Writing an entire OS from scratch was a really
fun exercise for our generalist engineers.
●
Full description of OSv is beyond the scope of
this talk. Check out:
– “OSv—Optimizing the Operating System for Virtual
Machines”, Usenix ATC 2014.

8
Cassandra on OSv
● Cassandra-stress, READ, 4 vcpu:
On OSv, 34% faster than Linux
● Very nice, but not even close to our goal.
What are the remaining bottlenecks?

9
Bottlenecks: API locks
● In one profile, we saw 20% of run on lock()
and unlock() operations. Most uncontended
– Posix APIs allow threads to share
● file descriptors
● sockets
– As many as 20 lock/unlock for each network packet!
● Uncontended locks were efficient on UP (flag to
disable preemption),
But atomic operations slow on many cores.

10
Bottlenecks: API copies
● Write/send system calls copies user data to
kernel
– Even on OSv with no user-kernel separation
– Part of the socket API
● Similar for read

11
Bottlenecks: context switching
● One thread per CPU is optimal, >1 require:
– Context switch time
– Stacks consume memory and polute CPU cache
– Thread imbalance
● Requires fully non-blocking APIs
– Cassandra's uses mmap() for disk….

12
Bottlenecks:
unscalable applications
● Contended locks ruin scalability to many cores
– Memcache's counter and shared cache
● Solution: per-cpu data.
● Even lock-free atomic algorithms are unscalable
– Cache line bouncing
● Again, better to shard, not share, data.
– Becomes worse as core count grows
● NUMA

13
Therefore
● Need to provide a better APIs for server
applications
– Not file descriptors, sockets, threads, etc.
● Need to write better applications.

14
Framework
● One thread per CPU
– Event-driven programming
– Everything (network & disk) is non-blocking
– How to write complex applications?

15
Framework
● Sharded (shared-nothing) applications
– Important!

16
Framework
● Language with no runtime overheads or built-in
data sharing

17
Seastar
● C++14 library
● For writing new high-performance server applications
● Share-nothing model, fully asynchronous
● Futures & Continuations based
– Unified API for all asynchronous operations
– Compose complex asyncrhonous operations
– The key to complex applications
● (Optionally) full zero-copy user-space TCP/IP (over DPDK)
● Open source: http://www.seastar-project.org/

18
Seastar linear scaling in #cores

19
Seastar linear scaling in #cores

20
Brief introduction to Seastar

21
Sharded application design
● One thread per CPU
● Each thread handles one shard of data
– No shared data (“share nothing”)
– Separate memory per CPU (NUMA aware)
– Message-passing between CPUs
– No locks or cache line bounces
● Reactor (event loop) per thread
● User-space network stack also sharded

22
Futures and continuations
● Futures and continuations are the building
blocks of asynchronous programming in
Seastar.
● Can be composed together to a large, complex,
asynchronous program.

23
● A future is a result which may not be available yet:
– Data buffer from the network
– Timer expiration
– Completion of a disk write
– The result of a computation which requires the values
from one or more other futures.
● future<int>
● future<>

24
● An asynchronous function (also “promise”) is
a function returning a future:
– future<> sleep(duration)
– future<temporary_buffer<char>> read()
● The function sets up for the future to be fulfilled
– sleep() sets a timer to fulfill the future it returns

25
● A continuation is a callback, typically a lambda
executed when a future becomes ready
– sleep(1s).then([] {
std::cerr << “done”;
});
● A continuation can hold state (lambda capture)
– future<int> slow_incr(int i) {
sleep(10ms).then(
[i] { return i+1; });
}

26
● Continuations can be nested:
– future<int> get();
future<> put(int);
get().then([] (int value) {
put(value+1).then([] {
std::cout << “done”;
});
});
● Or chained:
– get().then([] (int value) {
return put(value+1);
}).then([] {
std::cout << “done”;
});

27
● Parallelism is easy:
– sleep(100ms).then([] {
std::cout << “100msn”;
});
sleep(200ms).then([] {
std::cout << “200msn”;

28
● In Seastar, every asynchronous operation is a
future:
– Network read or write
– Disk read or write
– Timers
– …
– A complex combination of other futures
● Useful for everything from writing network stack to
writing a full, complex, application.

29
Network zero-copy
● future<temporary_buffer>
input_stream::read()
– temporary_buffer points at driver-provided pages, if
possible.
– Automatically discarded after use (C++).
● future<> output_stream::
write(temporary_buffer)
– Future becomes ready when TCP window allows further
writes (usually immediately).
– Buffer discarded after data is ACKed.

30
Two TCP/IP implementations
Networking API
Seastar (native) Stack POSIX (hosted) stack
Linux kernel (sockets)
User-space TCP/IP
Interface layer
DPDK
Virtio Xen
igb ixgb

31
Disk I/O
● Asynchronous and zero copy, using AIO and
O_DIRECT.
● Not implemented well by all filesystems
– XFS recommended
● Focusing on SSD
● Future thought:
– Direct NVMe support,
– Implement filesystem in Seastar.

32
More info on Seastar
● http://seastar-project.com
● https://github.com/scylladb/seastar
● http://docs.seastar-project.org/
● http://docs.seastar-project.org/master/md_doc_tu
torial.html

33
ScyllaDB
● NoSQL database, implemented in Seastar.
● Fully compatible with Cassandra:
– Same CQL queries
– Copy over a complete Cassandra database
– Use existing drivers
– Use existing cassandra.yaml
– Use same nodetool or JMX console
– Can be clustered (of course...)

34
ScyllaDBCassandra
Key cache
Row cache
On-
heap /
Off-heap
Linux page cache
SSTables
Unified cache
SSTables
● Don't double-cache.
● Don't cache unrelated rows.
● Don't cache unparsed sstables.
● Can fit much more into cache.
● No page faults, threads, etc.

35
Scylla vs. Cassandra
● Single node benchmark:
– 2 x 12-core x 2 hyperthread Intel(R) Xeon(R) CPU
E5-2690 v3 @ 2.60GHz
cassandra-stress
Benchmark
ScyllaDB Cassandra
Write 1,871,556 251,785
Read 1,585,416 95,874
Mixed 1,372,451 108,947

36
● We really got a x7 – x16 speedup!
● Read speeded up more -
– Cassandra writes are simpler
– Row-cache benefits further improve Scylla's read
● Almost 2 million writes per second on single
machine!
– Google reported in their blogs achieving 1 million writes
per second on 330 (!) machines
– (2 years ago, and RF=3… but still impressive).

37
3 node cluster, 2x12 cores each; RF=3, CL=quorum

38
Better latency, at all load levels

39
What will you do with 10x performance?
● Shrink your cluster by a factor of 10
● Use stronger (but slower) data models
● Run more queries - more value from your data
● Stop using caches in front of databases

41
Do we qualify?
In 3 years, our small team wrote:
● A complete kernel and library (OSv).
● An asynchronous programming framework
(Seastar).
● A complete Cassandra-compatible NoSQL
database (ScyllaDB).

43
This project has received funding from the European Union’s
Horizon 2020 research and innovation programme under grant
agreement No 645402.

Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra

More Related Content

What's hot (20)

Similar to Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra (20)

Recently uploaded (20)

Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra