Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your recovery story?"

Failure is not fatal:
what is your recovery story?
Steven Wu @stevenzwu

A streaming job can fail
Flink
Streaming
Job
Sink
Micro
Service
Source
Data Enrichment

The application can have a bug
Flink
Streaming
Job
Sink
Micro
Service
Source
Data Enrichment

The dependency service may return bad data
Flink
Streaming
Job
Sink
Micro
Service
Source
Data Enrichment

The sink can fail
Flink
Streaming
Job
Sink
Micro
Service
Source
Data Enrichment

How can we recover?
Source: https://pixabay.com/en/man-working-what-to-do-311326/

Agenda
Hive
Backfill
Flink
Rewind
Caveats

We are building a stream processing
platform on top of Apache Flink

That integrates with Netflix ecosystem
Titus
And
others
….

Geico caveman, https://memegenerator.net

Demo: how to bootstrap
new project

Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your recovery story?"

This is the generated skeleton code
createSource("example-kafka-source")
.addSink(getSink("null-sink"))
.name(“null-sink”);

User can add business logic
createSource("example-kafka-source")
.keyBy(<key selector>)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce(<window function>);
.addSink(getSink(“hive-sink"))
.name(“hive-sink”);

User can override source configuration
Override Kafka cluster VIP
kafka-test:2181 kafka-prod:2181

User can override any job config

Streaming data also go to Hive in
addition to Kafka
Hive
Kafka

Live job continues to run
Live Job
SinkKafka
TimeNow ->outage period

User can start a parallel backfill job
reading from Hive
Live Job
Sink
Backfill Job
Kafka
TimeNow ->outage period

We implemented a Hive source with
DataStream API
public class HiveSource<OUT>
extends RichParallelSourceFunction<OUT>
implements CheckpointedFunction,
ResultTypeQueryable<OUT> {
// ...
}

We provide dynamic source that allows user
to switch from Kafka to Hive

(1)
(2)
(3)
We provide dynamic source that allows user
to switch from Kafka to Hive

Create a Hive backfill job under the same
application as live job

User needs to override selected source

It is NOT a lambda architecture
● Single streaming code base
● Just switch source from Kafka to Hive

● Warm-up issue
● Ordering issue
Hive backfill is likely not good for stateful jobs

For stateful jobs, each input record is evaluated
against application state accumulated over time
Image adapted from Stephen Ewen

Backfill job started with an empty state
SinkBackfill Job
Timeoutage period

Why don’t we add a warm-up period
to build up the proper state
Backfill Job
Timeoutage periodWarm-up period
Sink

Need to avoid output during
warm-up period
SinkBackfill Job
Timeoutage periodWarm-up period
Emit outputNo output

Hive backfill is likely not good for stateful jobs
● Warm-up issue
● Ordering issue

Kafka messages are ordered within a partition
Source: kafka.apache.org

MapReduce data processing is driven by
this concept of input splits
Hive
Table
file-1
file-2
S3 files
(physical)
file-X
...
1
2
3
...
Y
Input Splits
(logical)
Reducer
Reducer
...
Mapper
Mapper
Mapper
...

f0 f1 f2 f3
s0 s1 s2 s3 s4 s5 s6 s7
files
splits
Job
Manager
Split
calculation
Job manager does split calculation

Job manager broadcasts input splits to all task managers
f0 f1 f2 f3
s0 s1 s2 s3 s4 s5 s6 s7
files
splits
Job
Manager
Task
Manager
Task
Manager
Task
Manager
s0 … s7
Split
calculation

Task managers run the same split assignment algorithm
f0 f1 f2 f3
s0 s1 s2 s3 s4 s5 s6 s7
files
splits
Job
Manager
Task
Manager
Task
Manager
Task
Manager
s0 … s7
s0
s1
s2
s3
s4
s5
s6
s7
Split
calculation
Split
Assignment

There is no guarantee of order for data files
f0 (hour=0) f1 f2 (hour=23) f3 (hour=12)
s0 s1 s2 s3 s4 s5 s6 s7
files
splits
Task
Manager
s0 (hour=0)
...
s3 (hour=23) s6 (hour=12) s9 (hour=3)
...

Does ordering matter?
● Usually not for stateless jobs
● Probably important for stateful jobs

Late events can be dropped
Source: flink.apache.org
● When watermark is past the end
timestamp of the window
● Allowed lateness can give some extra
time buffer

Checkpoint snapshots and uploads state to DFS
Source: http://flink.apache.org/
checkpoint store
● HDFS
● S3

Checkpoint achieves fault tolerance
Time
Checkpoint x-1 Checkpoint x Now

Time
Checkpoint x-1 Checkpoint x
Checkpoint achieves fault tolerance

Enable external checkpoint
CheckpointConfig config = env.getCheckpointConfig();
config.enableExternalizedCheckpoints(
ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

Time
Checkpoint y Checkpoint x
outage period
Checkpoint x+1
Now
Rewind job to a checkpoint before outage period

Rewind job to a checkpoint before outage period
Time
Checkpoint y Checkpoint x
outage period
Checkpoint x+1

There are no warmup and ordering
issues with Flink rewind
● Application state is correct after rewind
● Flink job is still reading the same Kafka
source

Choose the external checkpoint option

Kafka retention matters
Time
outage period NowKafka retention
As far as we
can go back

Can we have 10 days of
Kafka retention?

Anatomy of data stream
Adapted from “An elastic batch-and stream-processing stack with Pravega and Apache Flink” by Stephan Ewen and Flavio Junqueira
PresentRecent
Past
Distant
Past

Anatomy of data stream
PresentRecent
Past
Distant
Past

Can’t keep 10 days of data in local disk
● d2.8xl : 10 Gbps of network, 48 TB of disk
● Assuming 2 Gbps ingestion rate per
instance, 10 days of data requires 216 TB
of disk

EBS is more expensive than S3
Cost (per month)
EBS: throughput
optimized HDD
$0.045 per GB

Cost (per month)
EBS: throughput
optimized HDD
$0.045 per GB
S3 standard $0.021 per GB

Cost (per month)
EBS: throughput
optimized HDD
$0.045 per GB
S3 standard $0.021 per GB
S3 Standard-
Infrequent Access
$0.013 per GB

What if Kafka offloads historical data to
S3 infrequent access tier
PresentRecent
Past
Distant
Past
Infrequent
Access

Here are the benefits of tiered storage
● Only deal with Kafka source
● Support 10-day retention cost efficiently

There are systems implemented
tiered storage

Hive Backfill
v.s.
Flink Rewind

Hive backfill Flink rewind
Warm-up issue Yes No
Ordering issue Yes No

Applicability Stateless Stateless and
stateful

Applicability Stateless Stateless and
stateful
Data retention Weeks or months Hours or days

Pros for Hive backfill
● Long-term storage
● No delay for processing latest events
● Can achieve fast recovery

Stateless Hive Backfill
Stateful Flink Rewind
Here is our recommendation to users

Caveat 1: Don’t overwhelm external services
Flink
Streaming
Job
Sink
Micro
Service
Source
10x load
10x load
● Size cluster properly
● Rate limit operator

Flink
Streaming
Job
Sink
A/B
Service
Source
Time
Caveat 2: Your dependency may not participate in rewind

Flink
Streaming
Job
Sink
A/B
Service
Source
Time
Process
live msg X

Flink
Streaming
Job
Sink
A/B
Service
Source
Time
Process
live msg X
Alice?

Flink
Streaming
Job
Sink
A/B
Service
Source
Time
Process
live msg X
Alice? Cell A

Flink
Streaming
Job
Sink
A/B
Service
Source
Time
Allocation
change
Process
live msg X

Flink
Streaming
Job
Sink
A/B
Service
Source
Time
outage period
Allocation
change
Process
live msg X

Flink
Streaming
Job
Sink
A/B
Service
Source
Time
outage period
RewindAllocation
change
Process
live msg X

Flink
Streaming
Job
Sink
A/B
Service
Source
Time
outage period
RewindAllocation
change
Process
live msg X Reprocess
old msg X

Flink
Streaming
Job
Sink
A/B
Service
Source
Time
outage period
RewindAllocation
change
Process
old msg X
Alice?

Flink
Streaming
Job
Sink
A/B
Service
Source
Time
outage period
RewindAllocation
change
Process
old msg X
Alice? Cell B

Solution 1: Support lookup with historical view
Flink
Streaming
Job
Sink
A/B
Service
Source
Time
outage period
Process
old msg X
Alice at 1pm? Cell A
1pm 4pm
RewindAllocation
change

Solution 2: Convert table lookup to a
streaming source
Flink
Streaming
Job
Sink
A/B
Service
Source
Table lookup

Flink
Streaming
Job
Sink
A/B
Source
Source
State
A/B data becomes
part of app state
Solution 2: Convert table lookup to a
streaming source

● Idempotent sink
○ ElasticSearch, Cassandra
Caveat #3: watch out for the impact to
downstream consumers

● Idempotent sink
○ ElasticSearch, Cassandra
● Resettable sink
○ Drop Hive partition with bad data
Caveat #3: watch out for the impact to
downstream consumers

Here is a more complicated case
with Kafka sink
Topic1 Job1 Topic2 Job2 Topic3

At 4pm, job1 rewinded to checkpoint
taken at 1pm
Time
outage period
1pm 4pm

What should job2 do?
Time
outage period
???
Time
outage period
1pm 4pm 1pm 4pm 5pm

Anatomy of topic2
Outage (bad data)
1pm 4pm
Before outage After job1 rewind
1pm

Rewind job2 to 1pm
Time
outage period
bad data
● Correct app state
● Still reprocess
bad data
1pm
1pm

Rewind job2 to 4pm
Time
outage period
bad data
● Still bad app state
● Skip bad data
1pm
4pm

Stop job1 and job2 first
4pm
bad data
1pm

Wipe out all messages from topic2
All bad data are gone!

Rewind job2 to 1pm checkpoint
Time
outage period
1pm
● Correct app state
●

Rewind job1 to 1pm checkpoint
Time
outage period● Correct app state
● Skip bad data
1pm
1pm

It is difficult to execute
● Very involved process
● Need coordination btw job1 and job2

Caveats recap
● Don’t overwhelm external services
● Your dependency may not participate in
rewind
● Watch out for the impact to downstream
consumers

Steven Wu @stevenzwu
What defines us is
how well we rise
after falling

Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your recovery story?"

More Related Content

What's hot (19)

Similar to Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your recovery story?" (20)

More from Flink Forward (20)

Recently uploaded (20)

Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your recovery story?"