Flink’s Pluggable Failure Handling:
deal with streaming errors the smart way!
About me
2
Panagiotis (Panos) Garefalakis (@pgaref)
Software Engineer - SPA - Confluent
Flink runtime team - Apache Flink contributor
Apache Hive, ORC Comiter & PMC Member respectively
PhD in Distributed Systems, Imperial College London 2020
Overview
3
Intro
Flink framework and how users leverage it to implement streaming applications
Background
Main components of Flink’s distributed execution runtime and failure handling
Implementation
Introduce Pluggable Failure Enrichers component as part of the JobMaster
Demo
Custom Failure Enrichers in just 4 steps and a Confluent Cloud demo!
Lessons Learned
Mistakes to avoid when running your own Failure Enrichers
Summary
Key points and useful links
1.
2.
3.
4.
5.
6.
Kafka
Databases
Key/Value Stores
Files
Apps
Sources
Real-time Stream Processing
Sinks
Stream Processing with Flink
Real-time Stream Processing
Stream Processing with Flink
Kafka
Databases
Key/Value Stores
Files
Apps
Sources Sinks
Operator
Edge
Job Graph
Job
Writing Streaming Apps
INSERT INTO results
SELECT color, COUNT(*)
FROM events
WHERE color <> orange
GROUP BY color;
results
COUNT
WHERE color <> orange
events
GROUP BY
color
FILTER
Writing Streaming Apps
INSERT INTO results
SELECT color, COUNT(*)
FROM events
WHERE color <> orange
GROUP BY color;
GROUP BY
color
events
results
COUNT
WHERE color <> orange
FILTER
Writing Streaming Apps
INSERT INTO results
SELECT color, COUNT(*)
FROM events
WHERE color <> orange
GROUP BY color;
GROUP BY
color
events
results
COUNT
WHERE color <> orange
FILTER
Writing Streaming Apps
INSERT INTO results
SELECT color, COUNT(*)
FROM events
WHERE color <> orange
GROUP BY color;
GROUP BY
color
events
results
COUNT
WHERE color <> orange
FILTER
Writing Streaming Apps
INSERT INTO results
SELECT color, COUNT(*)
FROM events
WHERE color <> orange
GROUP BY color;
GROUP BY
color
events
results
COUNT
WHERE color <> orange
FILTER
1
4 …
1
3
…
Running Streaming Apps Task Manager
Task Slot
State Backend
Task Slot
Client
Task Manager
Task Slot
State Backend
Task Slot
Job Manager
Dispatcher
Resource Manager
Job Master
Scheduler
REST
Endpoint
Slot Allocator
Checkpoint Coordinator
Running Streaming Apps Task Manager
Task Slot
State Backend
Task Slot
Client
Task Manager
Task Slot
State Backend
Task Slot
Submit Job
Job Manager
Dispatcher
Resource Manager
Job Master
Execution Graph
Scheduler
REST
Endpoint
Checkpoint Coordinator
Slot Allocator
Running Streaming Apps Task Manager
Task Slot
State Backend
Task Slot
Client
Task Manager
Task Slot
State Backend
Task Slot
Assign Slot
Submit Job
Job Manager
Dispatcher
Resource Manager
Job Master
Execution Graph
Scheduler
Checkpoint Coordinator
REST
Endpoint
Slot Allocator
Running Streaming Apps Task Manager
Task Slot
State Backend
Task Slot
Data
Shuffle
Client
Task Manager
Task Slot
State Backend
Task Slot
Submit Task
Submit Job
Job Manager
Dispatcher
Resource Manager
Job Master
Execution Graph
Scheduler
Checkpoint Coordinator
REST
Endpoint
Slot Allocator
Running Streaming Apps Task Manager
Task Slot
State Backend
Task Slot
Data
Shuffle
Client
Task Manager
Task Slot
State Backend
Task Slot
Submit/Stop/Cancel
Tasks, Checkpoint
Submit Job
Results
Job Manager
Dispatcher
Resource Manager
Job Master
Execution Graph
Scheduler
Checkpoint Coordinator
REST
Endpoint
Slot Allocator
Local Failures Task Manager
Task Slot
State Backend
Task Slot
Data
Shuffle
Client
Task Manager
Task Slot
State Backend
Task Slot
Submit/Stop/Cancel
Tasks, Checkpoint
Submit Job
Results
Job Manager
Dispatcher
Resource Manager
Job Master
Execution Graph
Scheduler
Checkpoint Coordinator
Exception!
REST
Endpoint
Permissions Errors
Serialization Errors
ClassClast Erros
etc.
Slot Allocator
Global Failures Task Manager
Task Slot
State Backend
Task Slot
Data
Shuffle
Client
Task Manager
Task Slot
State Backend
Task Slot
Submit/Stop/Cancel
Tasks, Checkpoint
Submit Job
Results
Job Manager
Dispatcher
Resource Manager
Job Master
Execution Graph
Scheduler
Checkpoint Coordinator
REST
Endpoint
Checkpoint Errors
Op Coordinator Errors
etc.
Slot Allocator
Failure Handling Task Manager
Task Slot
State Backend
Task Slot
Data
Shuffle
Client
Task Manager
Task Slot
State Backend
Task Slot
Submit/Stop/Cancel
Tasks, Checkpoint
Submit Job
Results
Job Manager
Dispatcher
Resource Manager
Job Master
Execution Graph
Scheduler
Checkpoint Coordinator
Failure Handler
REST
Endpoint
Slot Allocator
Failure Handling Task Manager
Task Slot
State Backend
Task Slot
Data
Shuffle
Client
Task Manager
Task Slot
State Backend
Task Slot
Submit/Stop/Cancel
Tasks, Checkpoint
Submit Job
Results
Job Manager
Dispatcher
Resource Manager
Job Master
Execution Graph
Scheduler
Checkpoint Coordinator
Failure Handler
REST
Endpoint
Restart Task
Permissions
Exception?
Expose OOM
Errors to Users?
Slot Allocator
Extending Failure Handling
20
Enrich failures with extra metadata (e.g., type of failure)
Expose failures to downstream consumers (e.g, notification systems)
Support custom logic (pluggable interface)
Pluggable Failure Enrichers Task Manager
Task Slot
State Backend
Task Slot
Data
Shuffle
Client
Task Manager
Task Slot
State Backend
Task Slot
Job Manager
Dispatcher
Resource Manager
Job Master
Execution Graph
Scheduler
Checkpoint Coordinator
Failure Handler
Failure Enrichers
REST
Endpoint
FLIP-304
Flink 1.18
Submit Job
Slot Allocator
Pluggable Failure Enrichers Task Manager
Task Slot
State Backend
Task Slot
Data
Shuffle
Client
Task Manager
Task Slot
State Backend
Task Slot
Submit/Stop/Cancel
Tasks, Checkpoint
Submit Job
Results
Job Manager
Dispatcher
Resource Manager
Job Master
Execution Graph
Scheduler
Checkpoint Coordinator
Failure Handler
Failure Enrichers
REST
Endpoint
FLIP-304
Flink 1.18
Type Classifier
Slot Allocator
Pluggable Failure Enrichers Task Manager
Task Slot
State Backend
Task Slot
Data
Shuffle
Client
Task Manager
Task Slot
State Backend
Task Slot
Submit/Stop/Cancel
Tasks, Checkpoint
Submit Job
Results
Job Manager
Dispatcher
Resource Manager
Job Master
Execution Graph
Scheduler
Checkpoint Coordinator
Failure Handler
Failure Enrichers
REST
Endpoint
FLIP-304
Flink 1.18
Type Classifier
ClassCast
Exception
Handle
Task Failure
Slot Allocator
Pluggable Failure Enrichers Task Manager
Task Slot
State Backend
Task Slot
Data
Shuffle
Client
Task Manager
Task Slot
State Backend
Task Slot
Submit/Stop/Cancel
Tasks, Checkpoint
Submit Job
Results
Job Manager
Dispatcher
Resource Manager
Job Master
Execution Graph
Scheduler
Checkpoint Coordinator
Failure Handler
Failure Enrichers
REST
Endpoint
FLIP-304
Flink 1.18
Type Classifier
Handle
Task Failure
exceptionName:
ClassCastException
"failureLabels": {
"type": "USER"
}
ClassCast
Exception
Slot Allocator
Failure Enricher Implementation
FLIP-304
Flink 1.18
public class TypeClassifier implements FailureEnricher {
private static final String typeKey = "TYPE";
@Override
public Set<String> getOutputKeys() {
return Stream.of(typeKey).collect(Collectors.toSet());
}
@Override
public CompletableFuture<Map<String, String>> processFailure(Throwable cause, final Context ctx) {
final Map<String, String> labels = new HashMap();
if (ExceptionUtils.findThrowable(cause, ClassCastException.class).isPresent()) {
labels.put(typeKey, "USER");
} else {
labels.put(typeKey, "SYSTEM");
}
return CompletableFuture.completedFuture(labels);
}
}
Step 1: Implement your enricher
Failure Enricher Implementation
FLIP-304
Flink 1.18
public class TypeClassifierFactory implements FailureEnricherFactory {
@Override
public FailureEnricher createFailureEnricher(Configuration conf) {
return new TypeClassifier();
}
}
Step 2: Create an enricher factory
Step 3: Package jar
Step 4: Modify Flink configuration
jobmanager.failure-enrichers = org.apache.flink.test.plugin.jar.failure.TypeClassifier
META-INF/services/org.apache.flink.core.failure.FailureEnricherFactory
Confluent
Cloud
Flink UI
28
Flink UI
29
Lessons Learned
Documentation
https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/advanced/failure_enrichers
FLIP
https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
Apache Flink 1.18
https://www.confluent.io/blog/announcing-apache-flink-1-18
● Failure Enrichers might throw exceptions too, make sure they are properly handled!
○ There is no way to enforce no exceptions are thrown (pluggable component) and this
could result to throwing away labels
● Bundle Failure Enrichers’ dependencies when you are using third party libraries!
○ PluginLoader only allows whitelisted classes of the parent / system classloader
● Logs and system-tests are your friends!
Summary
Documentation
https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/advanced/failure_enrichers
FLIP
https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
Apache Flink 1.18
https://www.confluent.io/blog/announcing-apache-flink-1-18
● Flink service providers deal with a plethora of failures types coming from different sources
and followed by a variety of corrective actions
● Pluggable Failure Enrichers allow for
○ custom logic (classification, tagging, alerting etc.)
○ custom metadata labels
○ asynchronous execution
○ simple implementation and packaging (independent jars)
FailureEnrichers - Flink Meetup Bay Area.pptx
Backup Slides
33
Icons
34
Central Nervous
System
Early Production
Streaming
Stream
Designer
Data
Everywhere
Kafka Cluster
Database Databases Data Lake DB Warehouse Data Center Cloud Cloud to Cloud Hybrid Cloud Cloud Dev Equal Cloud Cloud Management
Server On-premise Serverless Replicator Operator Kafka KSQL Rocket ksqlDB KSQL Circle Connector Microservices Schema Registry
Streams Event Streams Number of
Data Sources
IOT Cluster Partition
Rebalancing
Stream Processing
Cookbook
Data
Governance
Apps Service Apps Custom Apps Logs Data Stacks Stack Overflow Storage Platform Data In Data Out Data Add
Branch
Processing Real-time Aggregate Data Frameworks CLI Dev Scale Combine Join Architect # of Producers
For the complete, most updated collection of Icons please go to: https://cnfl.io/Icons
Icons
35
Webinar Developer Onboard Offboard Filter
Globe Infinity Settings Monitoring Anomaly Detection Analytics Real-time Analytics Real-time Processing Process Data Upload Download
Computer Devices Computer /
DB / Cloud
Status Open Source Web Confirmed RSS MQTT Message Quotes Interview # of Topics
Person People People
Manager
Career Enablement Roadmap Search Solution
Send
Features Company
Policies
Docs Invoice Blog Podcast Video Book Table Email Print
Continuous
Learning
Lock Key Warning Hacker Bug GDPR CCPA Shield Shield Open Machine
Learning
Eye
For the complete, most updated collection of Icons please go to: https://cnfl.io/Icons
Icons
36
Shirt Food Catalyst Box Sparkly New
Manufacturing Venue Government Business Marketplace Ecommerce Sale Money Telecom Support Gaming Healthcare
Computer Love Partner Hand Arm Benefit Thumbs Up Swipe Select Promote Awareness Target
Car Truck Puzzle Lightening Star Question Check
Workday
Speed Time Coming Soon Time / Money ROI TCO Data in Terabytes
Per Day
# of Events
Per Day
Calendar Payday Docker
Transfer Expand / Shrink Add Balance Rest Trophy Certificate Badge
For the complete, most updated collection of Icons please go to: https://cnfl.io/Icons

More Related Content

PDF
Apache flink
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
PPTX
Chicago Flink Meetup: Flink's streaming architecture
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
PPTX
Robust stream processing with Apache Flink
PDF
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
PPTX
Flink System Overview
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Apache flink
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Chicago Flink Meetup: Flink's streaming architecture
Stephan Ewen - Experiences running Flink at Very Large Scale
Robust stream processing with Apache Flink
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink System Overview
Where is my bottleneck? Performance troubleshooting in Flink

Similar to FailureEnrichers - Flink Meetup Bay Area.pptx (20)

PDF
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
PDF
A look at Flink 1.2
PDF
Making Sense of Apache Flink: A Fearless Introduction
PPTX
Flink Streaming @BudapestData
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PPTX
Will it Scale? The Secrets behind Scaling Stream Processing Applications
PPTX
Aljoscha Krettek - The Future of Apache Flink
PDF
When Streaming Needs Batch With Konstantin Knauf | Current 2022
PDF
Unified Stream and Batch Processing with Apache Flink
PDF
Apache Flink internals
PDF
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
PDF
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
PPTX
Apache flink 1.0.0 overview
PPTX
Apache Flink Deep Dive
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
PPTX
Flink 1.0-slides
PPTX
Robust Stream Processing with Apache Flink
PDF
Introduction to Stateful Stream Processing with Apache Flink.
PDF
Jamie Grier - Robust Stream Processing with Apache Flink
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
A look at Flink 1.2
Making Sense of Apache Flink: A Fearless Introduction
Flink Streaming @BudapestData
Flexible and Real-Time Stream Processing with Apache Flink
Introduction to Apache Flink - Fast and reliable big data processing
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Aljoscha Krettek - The Future of Apache Flink
When Streaming Needs Batch With Konstantin Knauf | Current 2022
Unified Stream and Batch Processing with Apache Flink
Apache Flink internals
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Apache flink 1.0.0 overview
Apache Flink Deep Dive
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Flink 1.0-slides
Robust Stream Processing with Apache Flink
Introduction to Stateful Stream Processing with Apache Flink.
Jamie Grier - Robust Stream Processing with Apache Flink
Ad

More from Panagiotis Garefalakis (9)

PDF
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
PPTX
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
PPTX
Medea: Scheduling of Long Running Applications in Shared Production Clusters
PDF
Mres presentation
PPT
Dais 2013 2 6 june
PPT
Master presentation-21-7-2014
PPT
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
PDF
Storage managment using nagios
PPTX
Ithings2012 20nov
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Medea: Scheduling of Long Running Applications in Shared Production Clusters
Mres presentation
Dais 2013 2 6 june
Master presentation-21-7-2014
Pgaref Piccolo Building Fast, Distributed Programs with Partitioned Tables
Storage managment using nagios
Ithings2012 20nov
Ad

Recently uploaded (20)

DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
substrate PowerPoint Presentation basic one
PDF
CEH Module 2 Footprinting CEH V13, concepts
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
Introduction to MCP and A2A Protocols: Enabling Agent Communication
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
Build Real-Time ML Apps with Python, Feast & NoSQL
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
The AI Revolution in Customer Service - 2025
PDF
ment.tech-Siri Delay Opens AI Startup Opportunity in 2025.pdf
PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
Basics of Cloud Computing - Cloud Ecosystem
substrate PowerPoint Presentation basic one
CEH Module 2 Footprinting CEH V13, concepts
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Introduction to MCP and A2A Protocols: Enabling Agent Communication
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Build Real-Time ML Apps with Python, Feast & NoSQL
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
4 layer Arch & Reference Arch of IoT.pdf
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
NewMind AI Weekly Chronicles – August ’25 Week IV
The AI Revolution in Customer Service - 2025
ment.tech-Siri Delay Opens AI Startup Opportunity in 2025.pdf
EIS-Webinar-Regulated-Industries-2025-08.pdf
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
AI.gov: A Trojan Horse in the Age of Artificial Intelligence

FailureEnrichers - Flink Meetup Bay Area.pptx

  • 1. Flink’s Pluggable Failure Handling: deal with streaming errors the smart way!
  • 2. About me 2 Panagiotis (Panos) Garefalakis (@pgaref) Software Engineer - SPA - Confluent Flink runtime team - Apache Flink contributor Apache Hive, ORC Comiter & PMC Member respectively PhD in Distributed Systems, Imperial College London 2020
  • 3. Overview 3 Intro Flink framework and how users leverage it to implement streaming applications Background Main components of Flink’s distributed execution runtime and failure handling Implementation Introduce Pluggable Failure Enrichers component as part of the JobMaster Demo Custom Failure Enrichers in just 4 steps and a Confluent Cloud demo! Lessons Learned Mistakes to avoid when running your own Failure Enrichers Summary Key points and useful links 1. 2. 3. 4. 5. 6.
  • 4. Kafka Databases Key/Value Stores Files Apps Sources Real-time Stream Processing Sinks Stream Processing with Flink
  • 5. Real-time Stream Processing Stream Processing with Flink Kafka Databases Key/Value Stores Files Apps Sources Sinks Operator Edge Job Graph Job
  • 6. Writing Streaming Apps INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; results COUNT WHERE color <> orange events GROUP BY color FILTER
  • 7. Writing Streaming Apps INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange FILTER
  • 8. Writing Streaming Apps INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange FILTER
  • 9. Writing Streaming Apps INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange FILTER
  • 10. Writing Streaming Apps INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color <> orange GROUP BY color; GROUP BY color events results COUNT WHERE color <> orange FILTER 1 4 … 1 3 …
  • 11. Running Streaming Apps Task Manager Task Slot State Backend Task Slot Client Task Manager Task Slot State Backend Task Slot Job Manager Dispatcher Resource Manager Job Master Scheduler REST Endpoint Slot Allocator Checkpoint Coordinator
  • 12. Running Streaming Apps Task Manager Task Slot State Backend Task Slot Client Task Manager Task Slot State Backend Task Slot Submit Job Job Manager Dispatcher Resource Manager Job Master Execution Graph Scheduler REST Endpoint Checkpoint Coordinator Slot Allocator
  • 13. Running Streaming Apps Task Manager Task Slot State Backend Task Slot Client Task Manager Task Slot State Backend Task Slot Assign Slot Submit Job Job Manager Dispatcher Resource Manager Job Master Execution Graph Scheduler Checkpoint Coordinator REST Endpoint Slot Allocator
  • 14. Running Streaming Apps Task Manager Task Slot State Backend Task Slot Data Shuffle Client Task Manager Task Slot State Backend Task Slot Submit Task Submit Job Job Manager Dispatcher Resource Manager Job Master Execution Graph Scheduler Checkpoint Coordinator REST Endpoint Slot Allocator
  • 15. Running Streaming Apps Task Manager Task Slot State Backend Task Slot Data Shuffle Client Task Manager Task Slot State Backend Task Slot Submit/Stop/Cancel Tasks, Checkpoint Submit Job Results Job Manager Dispatcher Resource Manager Job Master Execution Graph Scheduler Checkpoint Coordinator REST Endpoint Slot Allocator
  • 16. Local Failures Task Manager Task Slot State Backend Task Slot Data Shuffle Client Task Manager Task Slot State Backend Task Slot Submit/Stop/Cancel Tasks, Checkpoint Submit Job Results Job Manager Dispatcher Resource Manager Job Master Execution Graph Scheduler Checkpoint Coordinator Exception! REST Endpoint Permissions Errors Serialization Errors ClassClast Erros etc. Slot Allocator
  • 17. Global Failures Task Manager Task Slot State Backend Task Slot Data Shuffle Client Task Manager Task Slot State Backend Task Slot Submit/Stop/Cancel Tasks, Checkpoint Submit Job Results Job Manager Dispatcher Resource Manager Job Master Execution Graph Scheduler Checkpoint Coordinator REST Endpoint Checkpoint Errors Op Coordinator Errors etc. Slot Allocator
  • 18. Failure Handling Task Manager Task Slot State Backend Task Slot Data Shuffle Client Task Manager Task Slot State Backend Task Slot Submit/Stop/Cancel Tasks, Checkpoint Submit Job Results Job Manager Dispatcher Resource Manager Job Master Execution Graph Scheduler Checkpoint Coordinator Failure Handler REST Endpoint Slot Allocator
  • 19. Failure Handling Task Manager Task Slot State Backend Task Slot Data Shuffle Client Task Manager Task Slot State Backend Task Slot Submit/Stop/Cancel Tasks, Checkpoint Submit Job Results Job Manager Dispatcher Resource Manager Job Master Execution Graph Scheduler Checkpoint Coordinator Failure Handler REST Endpoint Restart Task Permissions Exception? Expose OOM Errors to Users? Slot Allocator
  • 20. Extending Failure Handling 20 Enrich failures with extra metadata (e.g., type of failure) Expose failures to downstream consumers (e.g, notification systems) Support custom logic (pluggable interface)
  • 21. Pluggable Failure Enrichers Task Manager Task Slot State Backend Task Slot Data Shuffle Client Task Manager Task Slot State Backend Task Slot Job Manager Dispatcher Resource Manager Job Master Execution Graph Scheduler Checkpoint Coordinator Failure Handler Failure Enrichers REST Endpoint FLIP-304 Flink 1.18 Submit Job Slot Allocator
  • 22. Pluggable Failure Enrichers Task Manager Task Slot State Backend Task Slot Data Shuffle Client Task Manager Task Slot State Backend Task Slot Submit/Stop/Cancel Tasks, Checkpoint Submit Job Results Job Manager Dispatcher Resource Manager Job Master Execution Graph Scheduler Checkpoint Coordinator Failure Handler Failure Enrichers REST Endpoint FLIP-304 Flink 1.18 Type Classifier Slot Allocator
  • 23. Pluggable Failure Enrichers Task Manager Task Slot State Backend Task Slot Data Shuffle Client Task Manager Task Slot State Backend Task Slot Submit/Stop/Cancel Tasks, Checkpoint Submit Job Results Job Manager Dispatcher Resource Manager Job Master Execution Graph Scheduler Checkpoint Coordinator Failure Handler Failure Enrichers REST Endpoint FLIP-304 Flink 1.18 Type Classifier ClassCast Exception Handle Task Failure Slot Allocator
  • 24. Pluggable Failure Enrichers Task Manager Task Slot State Backend Task Slot Data Shuffle Client Task Manager Task Slot State Backend Task Slot Submit/Stop/Cancel Tasks, Checkpoint Submit Job Results Job Manager Dispatcher Resource Manager Job Master Execution Graph Scheduler Checkpoint Coordinator Failure Handler Failure Enrichers REST Endpoint FLIP-304 Flink 1.18 Type Classifier Handle Task Failure exceptionName: ClassCastException "failureLabels": { "type": "USER" } ClassCast Exception Slot Allocator
  • 25. Failure Enricher Implementation FLIP-304 Flink 1.18 public class TypeClassifier implements FailureEnricher { private static final String typeKey = "TYPE"; @Override public Set<String> getOutputKeys() { return Stream.of(typeKey).collect(Collectors.toSet()); } @Override public CompletableFuture<Map<String, String>> processFailure(Throwable cause, final Context ctx) { final Map<String, String> labels = new HashMap(); if (ExceptionUtils.findThrowable(cause, ClassCastException.class).isPresent()) { labels.put(typeKey, "USER"); } else { labels.put(typeKey, "SYSTEM"); } return CompletableFuture.completedFuture(labels); } } Step 1: Implement your enricher
  • 26. Failure Enricher Implementation FLIP-304 Flink 1.18 public class TypeClassifierFactory implements FailureEnricherFactory { @Override public FailureEnricher createFailureEnricher(Configuration conf) { return new TypeClassifier(); } } Step 2: Create an enricher factory Step 3: Package jar Step 4: Modify Flink configuration jobmanager.failure-enrichers = org.apache.flink.test.plugin.jar.failure.TypeClassifier META-INF/services/org.apache.flink.core.failure.FailureEnricherFactory
  • 30. Lessons Learned Documentation https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/advanced/failure_enrichers FLIP https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers Apache Flink 1.18 https://www.confluent.io/blog/announcing-apache-flink-1-18 ● Failure Enrichers might throw exceptions too, make sure they are properly handled! ○ There is no way to enforce no exceptions are thrown (pluggable component) and this could result to throwing away labels ● Bundle Failure Enrichers’ dependencies when you are using third party libraries! ○ PluginLoader only allows whitelisted classes of the parent / system classloader ● Logs and system-tests are your friends!
  • 31. Summary Documentation https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/advanced/failure_enrichers FLIP https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers Apache Flink 1.18 https://www.confluent.io/blog/announcing-apache-flink-1-18 ● Flink service providers deal with a plethora of failures types coming from different sources and followed by a variety of corrective actions ● Pluggable Failure Enrichers allow for ○ custom logic (classification, tagging, alerting etc.) ○ custom metadata labels ○ asynchronous execution ○ simple implementation and packaging (independent jars)
  • 34. Icons 34 Central Nervous System Early Production Streaming Stream Designer Data Everywhere Kafka Cluster Database Databases Data Lake DB Warehouse Data Center Cloud Cloud to Cloud Hybrid Cloud Cloud Dev Equal Cloud Cloud Management Server On-premise Serverless Replicator Operator Kafka KSQL Rocket ksqlDB KSQL Circle Connector Microservices Schema Registry Streams Event Streams Number of Data Sources IOT Cluster Partition Rebalancing Stream Processing Cookbook Data Governance Apps Service Apps Custom Apps Logs Data Stacks Stack Overflow Storage Platform Data In Data Out Data Add Branch Processing Real-time Aggregate Data Frameworks CLI Dev Scale Combine Join Architect # of Producers For the complete, most updated collection of Icons please go to: https://cnfl.io/Icons
  • 35. Icons 35 Webinar Developer Onboard Offboard Filter Globe Infinity Settings Monitoring Anomaly Detection Analytics Real-time Analytics Real-time Processing Process Data Upload Download Computer Devices Computer / DB / Cloud Status Open Source Web Confirmed RSS MQTT Message Quotes Interview # of Topics Person People People Manager Career Enablement Roadmap Search Solution Send Features Company Policies Docs Invoice Blog Podcast Video Book Table Email Print Continuous Learning Lock Key Warning Hacker Bug GDPR CCPA Shield Shield Open Machine Learning Eye For the complete, most updated collection of Icons please go to: https://cnfl.io/Icons
  • 36. Icons 36 Shirt Food Catalyst Box Sparkly New Manufacturing Venue Government Business Marketplace Ecommerce Sale Money Telecom Support Gaming Healthcare Computer Love Partner Hand Arm Benefit Thumbs Up Swipe Select Promote Awareness Target Car Truck Puzzle Lightening Star Question Check Workday Speed Time Coming Soon Time / Money ROI TCO Data in Terabytes Per Day # of Events Per Day Calendar Payday Docker Transfer Expand / Shrink Add Balance Rest Trophy Certificate Badge For the complete, most updated collection of Icons please go to: https://cnfl.io/Icons

Editor's Notes

  • #1: Hello everybody and thanks for joining! This presentation is about dealing with streaming errors the smart way! Also a quick introduction to the Pluggable Failure Enrichers framework, a useful feature for our Flink offering @ Confluent, or anyone managing Flink out there First, a few words about me..
  • #2: I am Panagiotis, you can call me Panos, I am a Software Engineer working for Confluent.. where for the past year or so we have been building a cloud-native serveless offering for Apache Flink. Interests in the broad area of systems including large-scale distributed systems, big data processing, resource management and streaming. Let me now give you a brief overview of this talk..
  • #3: I will start Will share some Background about.. Touch on the implementation.. As you are all are aware, Apache Flink is a powerful framework that …
  • #4: … Connect, enrich, and process data in real-time. Users build Flink applications that: Consume data from one or more event sources, perform real-time stream processing, and finally produce data to one or more sinks These sources and sinks can be anything from messaging systems, such as Kafka, files, databases, or any service or application that produces and consumes data
  • #5: User Applications sit in-between the sources and sinks I just described and is where the all the stream processing business logic is implemented A Running user app is called a Job, that is in reality a data streaming pipeline that we call the Job Graph Nodes represent the processing steps of the pipeline and they are executed by an operator (transforming streams of events) Operators are connected to one another, using the edges of the graph (connections) To implement a Flink streaming application…
  • #6: Users define their business logic using one of Flink’s APIs like Table, DataStream, Process Functions APIs or SQL (currently the default for Confluent) For example, users can write a SQL statement like the one on the left that is then converted into a Job Graph like the one on the right
  • #7: In this particular example, job is receiving some events that have a color property
  • #8: In this particular example, job is receiving some events that have a color property It is first filtering the events to remove any orange colored events Then Shuffles them
  • #9: In this particular example, job is receiving some events that have a color property It is first filtering the events to remove any orange colored events Then Shuffles them so that they are grouped by color – performing a COUNT operation on the same node Finally,
  • #10: In this particular example, job is receiving some events that have a color property It is first filtering the events to remove any orange colored events Then Shuffles them so that they are grouped by color – performing a COUNT operation on the same node Finally, the COUNT results are merged in the sink (that in this case is the result table) – where the events are redistributing using rebalancing To execute those statements or jobs – Users submit them to Flink’s distributed execution Runtime
  • #11: That typically consists of: Client process compiles and converts the job into a JobGraph – (Confluent Cloud this is performed by an independent service) JobManager manages resources and jobs in the cluster – consisting of a Dispatcher and a ResourceManager Within a JM, a JobMaster for each application responsible for managing a specific job lifecycle (task scheduling, execution, error recovery, distributed snapshots etc.) And finally, a number of shared Task Managers consisting of State backend and a number of slots. A slot is responsible for running a specific task or a chain of tasks To run the application..
  • #12: First, the client submits the JobGraph to the JobManager -> going through the Dispatcher, starting a JobMaster JobMaster translates it to an ExecutionGraph (including concurrency of all Ops) that will used for the actual job execution To run the ExecutionGraph, the scheduler requests resources from the ResourceManager to start job tasks
  • #13: The ResourceManager selects an idle slot by requesting a specific TaskManager to assign the slot to a specific JobMaster When the TaskManager completes the assignment, reports back to the JobMaster JobMaster submits tasks for execution using the Scheduler
  • #14: The TaskManager receives task through submit calls A new thread is started to run this task where pre-defined computations are performed As part of the computations, data can also be exchanged across TaskManagers using the Data Shuffle module and eventually..
  • #15: Eventually streaming results are returned back to the client During the job’s execution, the JobMaster remains responsible for the job lifecycle management, task scheduling, distributed status snapshots and error recovery As you can image, in a distributed environment, failures are a commonplace in Flink framework we have 2 types of job failures: global, and local
  • #16: Local Failures happen in the context of a "local"executing task In a local failure example: a task might throw a Permission, Serialization or TypeCast exception (invalid Type conversions)
  • #17: Global Failures happen in context of the JobMaster In a global failure example: the checkpoint coordinator might run out of space or memory
  • #18: These failures are managed by Flinks Failure Handler: Employs restart and failover strategies define the behavior of the job when encountering exceptions to minimize downtime / disruptions If the failure is not recoverable or tasks are not restartable – it fails the job Otherwise defines when and which tasks should be restarted
  • #19: However, not all Recoverable/Unrecoverable failures are the same! Restarting a Task failure throwing a Permissions Exception won’t help! As a Service provider: We need more context about each failure to take the right action and decide what we want to expose! Differentiate between a System/Infrastructure (TaskManager running out of memory) and a User application failure (Permission/DeSer error)
  • #20: That’s why we decided to extend Failure Handling in Flink with Pluggable Failure Enrichers They allow enriching failures with custom failure tags Take action based on those failures e.g., notification systems At the same time, Flink is evolving, more dependencies thus more sources of failures. Allow developers to implement their own custom logic (Limitless possibilities) PluginManager
  • #21: Pluggable Failure Enrichers component part of the JobMaster Custom Enrichers loaded at startup time Enrichers are triggered on every global/local failure Emit failure labels/tags for per-Enricher unique, pre-defined keys (defined at startup time) Work both with DefaultScheduler and AdaptiveScheduler They perform asynchronous execution (ioExecutor pool) to avoid blocking the main scheduler thread
  • #22: Let me now walk you through an example, we are a Flink service provider want to classify failures by TYPE. 2 types: SYSTEM (infrastructure issue) or USER (application code issue) We implement a Type Classifier, package it as a jar, and save it on the Flink plugin folder At JobMaster start time, then the Failure Enricher framework will load our jar and instantiate our TypeClassifier
  • #23: In this particular job, a User writes a SQL statement Reading from a Kafka topic a String field as an Integer – causing a ClassCastException This is a Local exception thrown at the Task level, propagated to task Failure Handler as a Local Failure The Failure Enricher framework will also trigger the Enricher instances that include our TypeClassifier
  • #24: Our TypeClassifier Categorizes the ClassCastException as a USER exception Adds the USER type label and Emits the metadata as part of ExceptionHistory reachable by the REST endpoint
  • #25: To implement: Add your own FailureEnricher by implementing the FailureEnricher interface. getOutputKeys unique, pre-defined keys (defined at startup time) Async IO
  • #26: To implement: Add your own Factory by implementing the FailureEnricherFactory interface. Add a service entry. Create a file Manifest path entry which contains the class name of your failure enricher factory class Add the FailureEnricher implementation in the jobmanager configuration
  • #27: Flink is in Public Preview in Confluent Cloud!
  • #28: I am Panagiotis, you can call me Panos, I am a Software Engineer working for Confluent.. where for the past year or so we have been building a cloud-native serveless offering for Apache Flink. Interests in the broad area of systems including large-scale distributed systems, big data processing, resource management and streaming. Let me now give you a brief overview of this talk..
  • #29: I am Panagiotis, you can call me Panos, I am a Software Engineer working for Confluent.. where for the past year or so we have been building a cloud-native serveless offering for Apache Flink. Interests in the broad area of systems including large-scale distributed systems, big data processing, resource management and streaming. Let me now give you a brief overview of this talk..
  • #31: To sum up, make it less challenging
  • #32: Pluggable Failure Handling recently released as part of Flink 1.18 * [Documentation](https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/advanced/failure_enrichers/) * [FLIP-304: Pluggable Failure Enrichers](https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers)