Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra

Avoiding the Pit of Despair: Event Sourcing
with Akka and Cassandra
Luke Tillman (@LukeTillman)
Language Evangelist at DataStax

Promo Codes
50% off Priority Passes: LukeT50
25% off Training and Certification: LukeTCert

• Evangelist with a focus on Developers
• Long-time Developer on RDBMS (lots of .NET)
Who are you?!
3

1 An Intro to Akka and Event Sourcing
2 An Event Journal in Cassandra
3 Accounting for Deletes
4 Lessons Learned
4

An Intro to Akka and Event Sourcing
5

Akka
• An actor framework for building concurrent
and distributed applications
• Originally for the JVM (written in Scala,
includes Java bindings)
• Ported to .NET/CLR (written in C#, includes
F# bindings)
• Both open source, on GitHub
6
http://akka.io
http://getakka.net

Actors in Akka
• Lightweight, isolated processes
• No shared state (so nothing to
lock or synchronize)
• Actors have a mailbox (message
queue)
• Process messages one at a time
– Update state
– Change behavior
– Send messages to other Actors
7
Actor
mailbox
state
behavior
messages
sent asynchronously
send messages to other
Actors (could be replies)

Obligatory E-Commerce Example
8
ShoppingCartActor
mailbox
state
behavior
messages
sent asynchronously
send messages to other
Actors (could be replies)
Examples:
InitializeCart
AddItemToCart
RemoveItemFromCart
ChangeItemQuantity
ApplyDiscount
GetCartItems
{
cart_id: 1345,
user_id: 4762
created_on: "7/10/2015",
items: [
{
item_id: 7621,
quantity: 1,
unit_price: 19.99
},
{
item_id: 9134,
quantity: 2,
unit_price: 16.99
}
]
}
Examples:
CartItems
ItemAdded
GetDiscount
if (items.length > 5)
Become(Discounted)

Actors in Akka
• Break a complex system down
into lots of smaller pieces
• Can easily scale to millions of
actors on a single machine
– 2.5 million per GB of heap
• Since actors only communicate
via async message passing, they
can also be distributed across
many machines
– Location Transparency
9
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
send messages via network

Persistent Actors in Akka
• Actor mailbox and state are
transient by default in Akka
– Crash/Restart, messages and state
are lost
10
Actor
mailbox
state
behavior

are lost
• We could just write code in the
actor to persist the current state
to storage
10
Actor
mailbox
state
behavior

are lost
• We could just write code in the
actor to persist the current state
to storage
• Akka Persistence plugin provides
an API for persisting these to
durable storage using Event
Sourcing
10
PersistentActor
mailbox
state
behavior
persistenceId
sequenceNr
Persist(event)
SaveSnapshot(payload)

Persistence with Event Sourcing
• Instead of keeping the
current state, keep a journal
of all the deltas (events)
• Append only (no UPDATE or
DELETE)
• We can replay our journal of
events to get the current
state
13
Shopping Cart (id = 1345)
user_id= 4762
created_on= 7/10/2015…
Cart Created
item_id= 7621
quantity= 1
price= 19.99
Item Added
item_id= 9134
quantity= 2
price= 16.99
Item Added
Item Removed item_id= 7621
Qty Changed
item_id= 9134
quantity= 1

Speeding up Replays with Snapshots
12
user_id= 4762
Cart Created
item_id= 7621
quantity= 1
price= 19.99
Item Added
item_id= 9134
quantity= 2
price= 16.99
Item Added

12
user_id= 4762
Cart Created
item_id= 7621
quantity= 1
price= 19.99
Item Added
item_id= 9134
quantity= 2
price= 16.99
Item Added
{
event_id: 3,
cart_id: 1345,
user_id: 4762
created_on: "7/10/2015",
items: [
{
item_id: 7621,
quantity: 1,
price: 19.99
},
{
item_id: 9134,
quantity: 2,
price: 16.99
}
]
}
Take Snapshot

12
user_id= 4762
Cart Created
item_id= 7621
quantity= 1
price= 19.99
Item Added
item_id= 9134
quantity= 2
price= 16.99
Item Added
Qty Changed
item_id= 9134
quantity= 1
{
event_id: 3,
cart_id: 1345,
user_id: 4762
created_on: "7/10/2015",
items: [
{
item_id: 7621,
quantity: 1,
price: 19.99
},
{
item_id: 9134,
quantity: 2,
price: 16.99
}
]
}

12
user_id= 4762
Cart Created
item_id= 7621
quantity= 1
price= 19.99
Item Added
item_id= 9134
quantity= 2
price= 16.99
Item Added
Qty Changed
item_id= 9134
quantity= 1
{
event_id: 3,
cart_id: 1345,
user_id: 4762
created_on: "7/10/2015",
items: [
{
item_id: 7621,
quantity: 1,
price: 19.99
},
{
item_id: 9134,
quantity: 2,
price: 16.99
}
]
}
Load Snapshot

Event Sourcing in Practice
• Typically two kinds of storage:
– Event Journal Store
– Snapshot Store
• A history of how we got to the
current state can be useful
• We've also got a lot more data
to store than we did before
18
user_id= 4762
Cart Created
item_id= 7621
quantity= 1
price= 19.99
Item Added
item_id= 9134
quantity= 2
price= 16.99
Item Added
Qty Changed
item_id= 9134
quantity= 1

Why Cassandra?
• Lots of Persistence implementations available
– Akka: Cassandra, JDBC, Kafka, MongoDB, etc.
– Akka.NET: Cassandra, MS SQL, Postgres
• Cassandra is really easy to scale out as you
need to store more events for more actors
• Workload and Data Shape are great fits for C*
– Transactional, Write-Heavy workload
– Sequentially written, immutable events (looks a lot
like time series data)
19

The Async Journal API
20
Task ReplayMessagesAsync(
string persistenceId,
long fromSequenceNr,
long toSequenceNr,
long max,
Action<IPersistentRepr> replayCallback);
Task<long> ReadHighestSequenceNrAsync(
long fromSequenceNr);
Task WriteMessagesAsync(
IEnumerable<IPersistentRepr> messages);
Task DeleteMessagesToAsync(
long toSequenceNr,
bool isPermanent);
def asyncReplayMessages(
persistenceId: String,
fromSequenceNr: Long,
toSequenceNr: Long,
max: Long)
(replayCallback: PersistentRepr => Unit)
: Future[Unit]
def asyncReadHighestSequenceNr(
fromSequenceNr: Long)
: Future[Long]
def asyncWriteMessages(
messages: immutable.Seq[PersistentRepr])
: Future[Unit]
def asyncDeleteMessagesTo(
toSequenceNr: Long,
permanent: Boolean)
: Future[Unit]

The Journal API Summary
• Write Method
– For a given actor, write a group
of messages
• Delete Method
– For a given actor, permanently
or logically delete all messages
up to a given sequence number
• Read Methods
– For a given actor, read back all
the messages between two
sequence numbers
– For a given actor, read the
highest sequence number that's
been written
21

An Event Journal in Cassandra
Data Modeling for Reads and Writes
22

A Simple First Attempt
• Use persistence_id as partition key
– all messages for a given persistence Id
together
• Use sequence_number as clustering
column
– order messages by sequence number
inside a partition
• Read all messages between two
sequence numbers
• Read the highest sequence number
23
CREATE TABLE messages (
persistence_id text,
sequence_number bigint,
message blob,
PRIMARY KEY (
persistence_id, sequence_number)
);
SELECT * FROM messages
WHERE persistence_id = ?
AND sequence_number >= ?
AND sequence_number <= ?;
SELECT sequence_number FROM messages
ORDER BY sequence_number DESC LIMIT 1;

A Simple First Attempt
• Write a group of messages
• Use a Cassandra Batch statement to
ensure all messages (success) or no
messages (failure) get written
• What's the problem with this data
model (ignoring implementing deletes
for now)?
24
message blob,
PRIMARY KEY (
persistence_id, sequence_number)
);
BEGIN BATCH
INSERT INTO messages ... ;
APPLY BATCH;

Cassandra Data Modeling Anti-Pattern #1
Unbounded Partition Growth
• Cassandra has a hard limit of 2
billion cells in a partition
• But there's also a practical limit
– Depends on row/cell data size, but
likely not more than millions of rows
26
Journal
INSERT INTO messages ...
persistence_id=
'57ab...'
seq_nr=
1
seq_nr=
2
message=
0x00...
message=
0x00...
∞?

Fixing the Unbounded Partition Growth Problem
• General strategy: add a column to
the partition key
– Compound partition key
• Can be data that's already part of
the model, or a "synthetic" column
• Allow users to configure a partition
size in the plugin
– Partition Size = number of rows per
partition
– This should not be changeable once
messages have been written
• Partition number for a given
sequence number is then easy to
calculate
– (seqNr – 1) / partitionSize
(100 – 1) / 100 = partition 0
(101 – 1) / 100 = partition 1
27
partition_number bigint,
message blob,
PRIMARY KEY (
(persistence_id, partition_number),
sequence_number)
);

• Read all messages between two
sequence numbers
• Read the highest sequence number
28
message blob,
PRIMARY KEY (
sequence_number)
);
AND partition_number = ?
ORDER BY sequence_number DESC LIMIT 1;
(repeat until we reach sequence number or run out of partitions)
(repeat until we run out of partitions)

• Write a group of messages
• A Cassandra Batch statement
might now write to multiple
partitions (if the sequence numbers
cross a partition boundary)
• Is that a problem?
29
message blob,
PRIMARY KEY (
sequence_number)
);
BEGIN BATCH
APPLY BATCH;

RTFM: Cassandra Batches Edition
30
"Batches are atomic by default. In the context of a Cassandra batch
operation, atomic means that if any of the batch succeeds, all of it will."
- DataStax CQL Docs
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/batch_r.html
"Although an atomic batch guarantees that if any part of the batch succeeds,
all of it will, no other transactional enforcement is done at the batch level.
For example, there is no batch isolation. Clients are able to read the first
updated rows from the batch, while other rows are still being updated on the
server."
- DataStax CQL Docs
Atomic? That's kind of a loaded word.

Multiple Partition Batch Failure Scenario
29
Journal
BEGIN BATCH
...
APPLY BATCH;
CL = QUORUM
RF = 3

29
Journal
BEGIN BATCH
...
APPLY BATCH;
Batch
Log
Batch
Log
Batch
Log
CL = QUORUM
RF = 3

• Once written to the
Batch Log successfully,
we know all the writes
in the batch will
succeed eventually
(atomic?)
29
Journal
BEGIN BATCH
...
APPLY BATCH;
CL = QUORUM
RF = 3

in the batch will
succeed eventually
(atomic?)
• Batch has been
partially applied
29
Journal
BEGIN BATCH
...
APPLY BATCH;
CL = QUORUM
RF = 3

in the batch will
succeed eventually
(atomic?)
• Batch has been
partially applied
• Possible to read a
partially applied batch
since there is no batch
isolation
29
Journal
BEGIN BATCH
...
APPLY BATCH;
CL = QUORUM
RF = 3
WriteTimeout
- writeType = BATCH

Reading Partially Applied Batches
37

RTFM: Cassandra Batches Edition Part 2
38
"For example, there is no batch isolation. Clients are able to read the first
updated rows from the batch, while other rows are still being updated on the
server. However, transactional row updates within a partition key are
isolated: clients cannot read a partial update."
- DataStax CQL Docs
What we really need is Isolation.
When writing a group of messages, ensure that
we write the group to a single partition.

Logic Changes to Ensure Batch Isolation
• Still use configurable Partition Size
– not a "hard limit" but a "best attempt"
• On write, see if messages will all fit in the
current partition
• If not, roll over to the next partition early
• Reading is slightly more complicated
– For a given sequence number it might be in
partition n or (n+1)
39
seq_nr = 97
seq_nr = 98
seq_nr = 1
99
100
101
partition_nr = 1
partition_nr = 2
PartitionSize=100

Implementing Logical Deletes, Option 1
• Add an is_deleted column
to our messages table
• Read all messages between
two sequence numbers
41
message blob,
is_deleted bool,
PRIMARY KEY (
sequence_number)
);
... sequence_number message is_deleted
... 1 0x00 true
... 2 0x00 true
... 3 0x00 false
... 4 0x00 false

• Pros:
– On replay, easy to check if a
message has been deleted (comes
included in message query's data)
• Cons:
– Messages not immutable any
more
– Issue lots of UPDATEs to mark
each message as deleted
– Have to scan through a lot of rows
to find max deleted sequence
number if we want to avoid
issuing unnecessary UPDATEs
42
message blob,
is_deleted bool,
PRIMARY KEY (
sequence_number)
);

• Add a marker column and
make it a clustering column
– Messages written with 'A'
– Deletes get written with 'D'
43
marker text,
message blob,
PRIMARY KEY (
sequence_number, marker)
);
... sequence_number marker message
... 1 A 0x00
... 1 D null
... 2 A 0x00
... 3 A 0x00

• Pros
– On replay, easy to peek at next
row to check if deleted (comes
included in message query's data)
– Message data stays immutable
• Cons
– Issue lots of INSERTs to mark
each message as deleted
– Have to scan through a lot of rows
to find max deleted sequence
number if we want to avoid
issuing unnecessary INSERTs
– Potentially twice as many rows to
store
44
marker text,
message blob,
PRIMARY KEY (
);

Looking at Physical Deletes
• Physically delete messages to a
given sequence number
• Still probably want to scan
through rows to see what's
already been deleted first
45
marker text,
message blob,
PRIMARY KEY (
);
BEGIN BATCH
DELETE FROM messages
AND marker = 'A'
AND sequence_number = ?;
...
APPLY BATCH;
• Can't range delete, so we have
to do lots of individual
DELETEs

Looking at Physical Deletes
• With how DELETEs work in
Cassandra, is there a potential
problem with this query?
46
marker text,
message blob,
PRIMARY KEY (
);

Tombstone Hell: Queue-like Data Sets
47

Queue-like Data Sets
46
Journal persistence_id
'57ab...'
partition_nr
1
message=
0x00...
seq_nr=1
marker='A'
...
Delete messages to a sequence number
BEGIN BATCH
WHERE persistence_id = '57ab...'
AND partition_nr = 1
AND marker = 'A'
AND sequence_nr = 1;
...
APPLY BATCH;
message=
0x00...
seq_nr=2
marker='A'

46
'57ab...'
partition_nr
1
message=
0x00...
seq_nr=1
marker='A'
seq_nr=1
marker='A'
Tombstone
NO DATA HERE
...
Delete messages to a sequence number
BEGIN BATCH
AND partition_nr = 1
AND marker = 'A'
AND sequence_nr = 1;
...
APPLY BATCH;
message=
0x00...
seq_nr=2
marker='A'
seq_nr=2
marker='A'
Tombstone
NO DATA HERE

• At some point compaction runs and we
don't have two versions any more, but
tombstones don't go away immediately
– Tombstones remain for gc_grace_seconds
– Default is 10 days
46
'57ab...'
partition_nr
1
seq_nr=1
marker='A'
Tombstone
NO DATA HERE
...
seq_nr=2
marker='A'
Tombstone
NO DATA HERE

51
'57ab...'
partition_nr
1
seq_nr=1
marker='A'
Tombstone
NO DATA HERE
...
Read all messages between 2 sequence numbers
AND partition_number = 1
AND sequence_number >= 1
AND sequence_number <= [max value];
seq_nr=2
marker='A'
Tombstone
NO DATA HERE
seq_nr=3
marker='A'
Tombstone
NO DATA HERE
seq_nr=4
marker='A'
Tombstone
NO DATA HERE

Avoid Tombstone Hell
52
We need a way to avoid reading
tombstones when replaying messages.
If we know what sequence number we've already deleted to
before we query, we could make that lower bound smarter.

A Third Option for Deletes
• Use marker as a clustering
column, but change the
clustering order
– Messages still 'A', Deletes 'D'
53
marker text,
message blob,
PRIMARY KEY (
marker, sequence_number)
);
AND marker = 'A'
... sequence_number marker message
... 1 A 0x00
... 2 A 0x00
... 3 A 0x00

• Messages data no longer has
deleted information, so how do we
know what's already been deleted?
• Get max deleted sequence number
• Can avoid tombstones if done
before getting message data
54
marker text,
message blob,
PRIMARY KEY (
);
AND marker = 'D'
ORDER BY marker DESC,
sequence_number DESC
LIMIT 1;

• Pros
– Message data stays immutable
– Issue a single INSERT when
deleting to a sequence number
– Read a single row to find out
what's been deleted (no more
scanning)
– Can avoid reading tombstones
created by physical deletes
• Cons
– Requires a separate query to find
out what's been deleted before
getting message data
55
marker text,
message blob,
PRIMARY KEY (
);

Final Schema in Akka and Akka.NET
56
marker text,
message blob,
PRIMARY KEY (
);
marker text,
message blob,
PRIMARY KEY (
);
https://github.com/krasserm/akka-persistence-cassandra https://github.com/akkadotnet/Akka.Persistence.Cassandra

Summary
• Seemingly simple data models can
get a lot more complicated
• Avoid unbounded partition growth
– Add data to your partition key
• Be aware of how Cassandra Logged Batches work
– If you need isolation, only write to a single partition
• Avoid queue-like data sets and be aware of how tombstones might
impact your queries
– Try to query with ranges that avoid tombstones
58

Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra

More Related Content

What's hot (15)

Viewers also liked (12)

Similar to Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra (20)

Recently uploaded (20)

Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra