Shriya Arora
Data Engineering & Infrastructure
Taming large state for
Personalization
What is this talk about?
● Deriving signal from high volume real-time events
● Using Flink State management to achieve real-time join
● Operations and path to production
● Challenges and Learnings
What are Merched Impressions?
What is a Play-Impression Takerate?
● Number of merched impressions per user play
● Attributes of impressions leading to the play
● Attributes of the play coming from different impressions
What do we use it for?
● Ranking Videos
● Targeting and Reach
● Content Promotion
● Asset Personalization
Volume and Scale:
● 130M members
● ~10B Impressions
● ~2.5B Play Events
● 140M Play hours/day
Why do we need a streaming solution for take-rate
● Model Training on fresher data
○ Reduce time delay between event generation and signal
○ Faster feedback around launches
○ Events relevance temporal in nature
● Long turnaround time on error correction
○ Long running batch jobs have all-or-none failure modes
○ Lack of Real-time auditing delays error-detection
What are the challenges we will need to solve ?
● High-volume input streams
● Out-of-order and late-arriving events
● Large State
○ ~1TB State/ region
Approaches:
#1 Window Joins
○ Events are delayed independent of each other
#2 Aggregation over Windows followed by Join
○ Stream can be reduced as they are held in state
Approaches:
#3 CoProcess Function with Single MapState
○ High variance in stream volumes and logic
#4 CoProcess Function with two Value states
○ Each stream gets its own value state
A tale of two states
● CoProcess Function
● Save each Keyed stream into its own ValueState
● For each event in stream, reduce state on duplicates
● For each event in either stream, cross query across states
● Use timerService to expire events from State
Data Flow Architecture
Play stream
Impressions
stream State 1
State 2
F(x) + Ts
F(y) + Ts
Co-process Fn
Output
keyBy
Anatomy of CoProcess Function
def processElement1{value: T, ctx:Context ..}
Access elements of the first stream,
update and reduce state, lookup state 2
for out-of-order joins, apply timer
def processElement2{value: K, ctx:Context ..}
Access elements of the second stream,
lookup and join to state 1, apply timer
def onTimer{ts: Long ...}
Clear up state based on event time ts.
State management
A tale of two states
Challenges with Operations
Visibility into application event time progression
○ Flink UI bug: FLINK-8949
Challenges with Operations cont..
● Visibility into State size
○ RocksDB Statistics have to be logged manually
Future Work
● State migration
● Data restatement and recovery
Questions?
Follow us!
@netflixdata
@shriyarora

More Related Content

PDF
Koalas: Pandas on Apache Spark
PPTX
Apache Spark Streaming
PDF
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
PPTX
Apache Spark Fundamentals
PDF
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
PDF
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
PPTX
Scaling Flink in Cloud
PPTX
10 skills to sharpen strategy formulation capability
Koalas: Pandas on Apache Spark
Apache Spark Streaming
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Apache Spark Fundamentals
Tutorial on Sequence Aware Recommender Systems - ACM RecSys 2018
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Scaling Flink in Cloud
10 skills to sharpen strategy formulation capability

What's hot (17)

PPTX
EVCache at Netflix
PDF
Parquet Strata/Hadoop World, New York 2013
PDF
Top 5 Mistakes When Writing Spark Applications
PDF
Apache Druid 101
PPTX
Team 4 Presentation - Grocery Sales Forecasting
PPTX
Netflix talk at ML Platform meetup Sep 2019
PPTX
Apache spark
PDF
Adaptation and Evaluation of Recommendationsfor Short-term Shopping Goals
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
AI & Machine Learning Pipelines with Knative
PDF
Filtering vs Enriching Data in Apache Spark
PDF
Introduction to Apache Flink
PDF
Apache Spark At Scale in the Cloud
PDF
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
PDF
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
PPTX
Optimizing Apache Spark SQL Joins
PDF
Deep Dive: Memory Management in Apache Spark
EVCache at Netflix
Parquet Strata/Hadoop World, New York 2013
Top 5 Mistakes When Writing Spark Applications
Apache Druid 101
Team 4 Presentation - Grocery Sales Forecasting
Netflix talk at ML Platform meetup Sep 2019
Apache spark
Adaptation and Evaluation of Recommendationsfor Short-term Shopping Goals
Stage Level Scheduling Improving Big Data and AI Integration
AI & Machine Learning Pipelines with Knative
Filtering vs Enriching Data in Apache Spark
Introduction to Apache Flink
Apache Spark At Scale in the Cloud
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Optimizing Apache Spark SQL Joins
Deep Dive: Memory Management in Apache Spark
Ad

Similar to Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join datasets for Personalization" (20)

PPTX
Flink Forward Berlin 2018: Dongwon Kim - "Real-time driving score service usi...
PPTX
Real-time driving score service using Flink
PDF
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
PDF
Scalability truths and serverless architectures
PDF
#TwitterRealTime - Real time processing @twitter
PDF
Empowering Real-Time Decision Making with Data Streaming
PPTX
Monitoring with riemann
PDF
Reactive mistakes - ScalaDays Chicago 2017
PDF
Let's get to know the Data Streaming
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
PPTX
Keystone event processing pipeline on a dockerized microservices architecture
PDF
Story of migrating event pipeline from batch to streaming
PDF
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
PPTX
Transport Layer
PDF
Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...
PPTX
Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics
PDF
Stream processing with Apache Flink (Timo Walther - Ververica)
PDF
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
PDF
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
PDF
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Flink Forward Berlin 2018: Dongwon Kim - "Real-time driving score service usi...
Real-time driving score service using Flink
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Scalability truths and serverless architectures
#TwitterRealTime - Real time processing @twitter
Empowering Real-Time Decision Making with Data Streaming
Monitoring with riemann
Reactive mistakes - ScalaDays Chicago 2017
Let's get to know the Data Streaming
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Keystone event processing pipeline on a dockerized microservices architecture
Story of migrating event pipeline from batch to streaming
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Transport Layer
Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...
Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics
Stream processing with Apache Flink (Timo Walther - Ververica)
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPT
Geologic Time for studying geology for geologist
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
The various Industrial Revolutions .pptx
PPTX
observCloud-Native Containerability and monitoring.pptx
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
1 - Historical Antecedents, Social Consideration.pdf
DOCX
search engine optimization ppt fir known well about this
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
August Patch Tuesday
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Unlock new opportunities with location data.pdf
PPTX
Benefits of Physical activity for teenagers.pptx
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Geologic Time for studying geology for geologist
A comparative study of natural language inference in Swahili using monolingua...
Enhancing emotion recognition model for a student engagement use case through...
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Developing a website for English-speaking practice to English as a foreign la...
The various Industrial Revolutions .pptx
observCloud-Native Containerability and monitoring.pptx
Chapter 5: Probability Theory and Statistics
Zenith AI: Advanced Artificial Intelligence
1 - Historical Antecedents, Social Consideration.pdf
search engine optimization ppt fir known well about this
DP Operators-handbook-extract for the Mautical Institute
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
August Patch Tuesday
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Unlock new opportunities with location data.pdf
Benefits of Physical activity for teenagers.pptx

Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join datasets for Personalization"

  • 1. Shriya Arora Data Engineering & Infrastructure Taming large state for Personalization
  • 2. What is this talk about? ● Deriving signal from high volume real-time events ● Using Flink State management to achieve real-time join ● Operations and path to production ● Challenges and Learnings
  • 3. What are Merched Impressions?
  • 4. What is a Play-Impression Takerate? ● Number of merched impressions per user play ● Attributes of impressions leading to the play ● Attributes of the play coming from different impressions
  • 5. What do we use it for? ● Ranking Videos ● Targeting and Reach ● Content Promotion ● Asset Personalization
  • 6. Volume and Scale: ● 130M members ● ~10B Impressions ● ~2.5B Play Events ● 140M Play hours/day
  • 7. Why do we need a streaming solution for take-rate ● Model Training on fresher data ○ Reduce time delay between event generation and signal ○ Faster feedback around launches ○ Events relevance temporal in nature ● Long turnaround time on error correction ○ Long running batch jobs have all-or-none failure modes ○ Lack of Real-time auditing delays error-detection
  • 8. What are the challenges we will need to solve ? ● High-volume input streams ● Out-of-order and late-arriving events ● Large State ○ ~1TB State/ region
  • 9. Approaches: #1 Window Joins ○ Events are delayed independent of each other #2 Aggregation over Windows followed by Join ○ Stream can be reduced as they are held in state
  • 10. Approaches: #3 CoProcess Function with Single MapState ○ High variance in stream volumes and logic #4 CoProcess Function with two Value states ○ Each stream gets its own value state
  • 11. A tale of two states ● CoProcess Function ● Save each Keyed stream into its own ValueState ● For each event in stream, reduce state on duplicates ● For each event in either stream, cross query across states ● Use timerService to expire events from State
  • 12. Data Flow Architecture Play stream Impressions stream State 1 State 2 F(x) + Ts F(y) + Ts Co-process Fn Output keyBy
  • 13. Anatomy of CoProcess Function def processElement1{value: T, ctx:Context ..} Access elements of the first stream, update and reduce state, lookup state 2 for out-of-order joins, apply timer def processElement2{value: K, ctx:Context ..} Access elements of the second stream, lookup and join to state 1, apply timer def onTimer{ts: Long ...} Clear up state based on event time ts.
  • 15. A tale of two states
  • 16. Challenges with Operations Visibility into application event time progression ○ Flink UI bug: FLINK-8949
  • 17. Challenges with Operations cont.. ● Visibility into State size ○ RocksDB Statistics have to be logged manually
  • 18. Future Work ● State migration ● Data restatement and recovery