SlideShare a Scribd company logo
Real-time Processing with Flink
for Machine Learning at Netflix
Elliot Chow
Agenda
Recommendations @ Netflix
Data For Machine Learning
Processing with Flink
State/Join
Event-Time & Watermarks
Checkpointing
Monitoring and Understanding The Job
Recommendations
Recommendations
139 million+ members
190+ countries
450 billion+ unique
events/day
700+ Kafka topics
Scale
Impressions
Member Activity
Log-in
Click
Play
Search
...
Recommendations Data
Context
Features - inputs to recommendation algorithms
...
Sessionization
Join with Recommendations Data
Output Data Format
Processing with Flink
Historically...
Spark + Spark Streaming
Some Challenges
Processing-time
Checkpointing performance and compatibility
Switching to Flink
Event-time Processing
Incremental Checkpointing
Custom Serializers
Internal Netflix Support
High-level Data Flow
Challenges and Considerations
Challenges and Considerations
Many microservices involved
Different join keys
Different expiration policies
Scale
Join / Window Implementation
Attempt I
Attempt I
class Event // ...
class State // ...
class Output // ...
def insert(input: Event, state: State): State = // ...
def emit(time: Timestamp): (State, List[Output]) = // ...
Attempt I
class Event // ...
class State // ...
class Output // ...
def insert(input: Event, state: State): State = // ...
def emit(time: Timestamp): (State, List[Output]) = // ...
Store State in ValueState for each member
Call insert in processElement
Call emit in onTimer
Use custom Protobuf TypeSerializer
Attempt I - Issues
State object is too large
Out-of-memory, even with rate-limiting outliers
Serialization/deserialization of entire state for inserting events
is too costly
Attempt I - Issues
State object is too large
Out-of-memory, even with rate-limiting outliers
Serialization/deserialization of entire state for inserting events
is too costly
All windows get triggered simultaneously
Bursty resource usage
Attempt II
Use Flink's windowing API
Sliding Windows
Attempt II - Issues
Many copies of each event
Attempt II - Issues
Difficult to manage expiration for different events
Attempt III
Custom ProcessFunction
Manual window management
Break down state into many state objects
Use MapState, ListState, and ValueState where appropriate
Use a combination of event-time and processing-time timers
Attempt III
Maintain frequently-accessed metadata in ValueState
Minimum/maximum timestamps
Existing timers
Number of events and bytes (rate-limiting)
Attempt III
Optimize for writes (RocksDB backend)
Only read metadata during inserts
Insert (append) events to ListState
Deduplicate events at read time; write back deduplicated
events
Attempt III
Randomly offset the windows
Member Window Start Window End
1 __ : 00 __ : 09
1 __ : 10 __ : 19
2 __ : 01 __ : 10
2 __ : 11 __ : 20
Event-Time & Watermarks
Event-Time & Watermarks
Watermarking Crash Course
Event-time: time associated with the actual event
Watermark: a time marker stating that all data prior to this
time has been seen
Event-time triggers fire based on the watermark
Event-Time & Watermarks
Watermarking Crash Course
Example: BoundedOutOfOrdernessTimestampExtractor
where outOfOrderness = 10 minutes
Event-Time 10:00 10:08 10:05 10:06 10:15
Max Event-Time 10:00 10:08 10:08 10:08 10:15
Watermark 09:50 09:58 09:58 09:58 10:05
Event-Time & Watermarks
Watermarking Crash Course
Watermark is maintained per partition
The watermark of an operator is computed as the minimum
watermark of its inputs
Partition 1 09:50 09:58 09:58 09:58 10:05
Partition 2 09:53 09:57 09:58 10:03 10:08
Operator 09:50 09:57 09:58 09:58 10:05
A Couple Quick Observations
1. Event-time timestamps must be correct
2. If the watermark of any partition stops progressing, time will
stop
Why Has Time Stopped?
Why Has Time Stopped?
System is unhealthy
Delays in input data sources
Backpressure
Underprovisioned cluster
Even a single, bad TM can drag the entire job
Why Has Time Stopped?
Why Has Time Stopped?
System appears healthy - somewhere, there is not enough data
Why Has Time Stopped?
System appears healthy - somewhere, there is not enough data
Scheduled jobs
Why Has Time Stopped?
System appears healthy - somewhere, there is not enough data
Scheduled jobs
Region Failover
Why Has Time Stopped?
System appears healthy - somewhere, there is not enough data
Scheduled jobs
Region Failover
Kafka Skip-Partitions Feature
Why Has Time Stopped?
System appears healthy - somewhere, there is not enough data
Scheduled jobs
Region Failover
Kafka Skip-Partitions Feature
Topic is overprovisioned (# partitions : events/second > 1)
Why Has Time Stopped?
(Slightly) Custom Watermark
Assigner
Based on BoundedOutOfOrdernessTimestampExtractor
1. Detect inactivity
2. Force time to forward when inactive
3. Record metrics per partition per source
Possible Improvements
More sophisticated inactivity detection
More flexible forced-time-progression
Detect inactivity at the source
Checkpointing Large State
Checkpointing Large State
One unresponsive TM can cause slowness or even failure of entire
checkpoint
Checkpointing Large State
Resource intensive (2x-3x CPU/Network)
Checkpointing Large State
Reduce interval and add min-pause between checkpoints
Increases duplicates when restoring job
Large catch-up after restore
An Observation About The State
Large portion of total state is recommendations data
Only ID and timestamp are needed for the join
Move Some State Out Of Flink
Keep only ID and timestamp in Flink
Move data to an external store
Fetching becomes an order of magnitude slower (network call
vs. local disk)
Possible Improvements
Checkpoint to/restore from persistent EBS
Incremental savepoint
Clean restart after checkpoint
Monitoring and Understanding
The Job
Monitoring and Understanding
The Job
Flink Metrics
numberOfFailedCheckpoints, lastCheckpointDuration
inputQueueLength, outputQueueLength
currentLowWatermark
fullRestarts, downtime
...
Monitoring and Understanding
The Job
Instance-/Container-Level Metrics
CPU, Network, Disk, Memory, GC, ...
Check for unbalanced processing
Monitoring and Understanding
The Job
Time and Watermarks
Event timestamps of inputs
Relative to wall-clock time & watermark
Watermark relative to wall-clock time
At different operators
Break down by task
Monitoring and Understanding
The Job
Performance
Issues often only appear at scale
Time all parts of application
Look at CPU flamegraphs
Replay from earliest offset (Kafka)
Monitoring and Understanding
The Job
State
Difficult to get insights about entire state at a point in time
Take a savepoint
Manually schedule timer for every key to collect metrics
Wrap-Up
Job has been running well in production, especially after
moving to 1.7
Continue to work on robustness, failure recovery, and
operational ease
Trade-off some consistency for higher availability
Auto-scaling
Thanks!
Questions?

More Related Content

What's hot (20)

PDF
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward
 
PPTX
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
PDF
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward
 
PPTX
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
PDF
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward
 
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
PPTX
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
PPTX
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward
 
PPTX
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward
 
PDF
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Till Rohrmann
 
PDF
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Flink Forward
 
PDF
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Thomas Weise
 
PDF
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward
 
PPTX
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward
 
PDF
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward
 
PDF
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward
 
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Flink Forward
 
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Flink Forward
 
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward
 
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Till Rohrmann
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward
 
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Flink Forward
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Thomas Weise
 
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward
 
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward
 
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward
 
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward
 
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Flink Forward
 

Similar to Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine Learning at Netflix - Elliot Chow (20)

PPTX
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
PDF
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
PPTX
ETL in Playbuzz
Dmitry Burstein
 
PPTX
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Soroosh Khodami
 
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
PDF
Big Data Warsaw
Maximilian Michels
 
PDF
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Evention
 
PDF
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
PDF
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
C4Media
 
PDF
Stream Processing with Apache Flink
C4Media
 
PPTX
Have your cake and eat it too, further dispelling the myths of the lambda arc...
Dimos Raptis
 
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
PDF
Rule Based Asset Management Workflow Automation at Netflix
HostedbyConfluent
 
PDF
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
rschuppe
 
PDF
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
PDF
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Neil Avery
 
PDF
Flink forward-2017-netflix keystones-paas
Monal Daxini
 
PDF
WJAX 2017: Workflow and state machines at scale
Bernd Ruecker
 
PPTX
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
StreamNative
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
ETL in Playbuzz
Dmitry Burstein
 
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Soroosh Khodami
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Big Data Warsaw
Maximilian Michels
 
Stream processing with Apache Flink - Maximilian Michels Data Artisans
Evention
 
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
C4Media
 
Stream Processing with Apache Flink
C4Media
 
Have your cake and eat it too, further dispelling the myths of the lambda arc...
Dimos Raptis
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Rule Based Asset Management Workflow Automation at Netflix
HostedbyConfluent
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
rschuppe
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Neil Avery
 
Flink forward-2017-netflix keystones-paas
Monal Daxini
 
WJAX 2017: Workflow and state machines at scale
Bernd Ruecker
 
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
StreamNative
 
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
The Current State of Table API in 2022
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Ad

Recently uploaded (20)

PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
The Future of Artificial Intelligence (AI)
Mukul
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 

Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine Learning at Netflix - Elliot Chow