Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join datasets for Personalization"

Shriya Arora
Data Engineering & Infrastructure
Taming large state for
Personalization

What is this talk about?
● Deriving signal from high volume real-time events
● Using Flink State management to achieve real-time join
● Operations and path to production
● Challenges and Learnings

What is a Play-Impression Takerate?
● Number of merched impressions per user play
● Attributes of impressions leading to the play
● Attributes of the play coming from different impressions

What do we use it for?
● Ranking Videos
● Targeting and Reach
● Content Promotion
● Asset Personalization

Volume and Scale:
● 130M members
● ~10B Impressions
● ~2.5B Play Events
● 140M Play hours/day

Why do we need a streaming solution for take-rate
● Model Training on fresher data
○ Reduce time delay between event generation and signal
○ Faster feedback around launches
○ Events relevance temporal in nature
● Long turnaround time on error correction
○ Long running batch jobs have all-or-none failure modes
○ Lack of Real-time auditing delays error-detection

What are the challenges we will need to solve ?
● High-volume input streams
● Out-of-order and late-arriving events
● Large State
○ ~1TB State/ region

Approaches:
#1 Window Joins
○ Events are delayed independent of each other
#2 Aggregation over Windows followed by Join
○ Stream can be reduced as they are held in state

Approaches:
#3 CoProcess Function with Single MapState
○ High variance in stream volumes and logic
#4 CoProcess Function with two Value states
○ Each stream gets its own value state

A tale of two states
● CoProcess Function
● Save each Keyed stream into its own ValueState
● For each event in stream, reduce state on duplicates
● For each event in either stream, cross query across states
● Use timerService to expire events from State

Data Flow Architecture
Play stream
Impressions
stream State 1
State 2
F(x) + Ts
F(y) + Ts
Co-process Fn
Output
keyBy

Anatomy of CoProcess Function
def processElement1{value: T, ctx:Context ..}
Access elements of the first stream,
update and reduce state, lookup state 2
for out-of-order joins, apply timer
def processElement2{value: K, ctx:Context ..}
Access elements of the second stream,
lookup and join to state 1, apply timer
def onTimer{ts: Long ...}
Clear up state based on event time ts.

Challenges with Operations
Visibility into application event time progression
○ Flink UI bug: FLINK-8949

Challenges with Operations cont..
● Visibility into State size
○ RocksDB Statistics have to be logged manually

Future Work
● State migration
● Data restatement and recovery

Questions?
Follow us!
@netflixdata
@shriyarora

Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join datasets for Personalization"

More Related Content

What's hot (17)

Similar to Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join datasets for Personalization" (20)

More from Flink Forward (20)

Recently uploaded (20)

Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join datasets for Personalization"