SlideShare a Scribd company logo
2
Most read
7
Most read
10
Most read
Autoscaling with
Apache Flink
Robert Metzger
Staff Engineer @ decodable, Committer and PMC Chair @ Flink
Why Autoscaling?
Source: https://flink.apache.org/2021/05/06/reactive-mode.html
Wasted resources
Reasons for changing loads
- Seasonality:
- day / night
- weekend / weekday
- Product popularity: new feature launches, ad campaigns
- Upstream system outages: load spikes during recovery
Solutions in Flink to Rescale
- Flink 1.2 (2017): Rescalable State
- Flink can restore from a savepoint with a different parallelism, so no data will be lost, all
computations will stay correct
- When used for scaling: requires custom tooling to orchestrate operations, and
bookkeeping
- Flink 1.13 (2021): Reactive Mode (beta)
- Flink automatically adjusts when TaskManagers are added or removed
- Requires outside entity to decide on # TaskManagers
- Since Flink 1.15 (2022): Reactive Mode is out of beta
Further reading: https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html
How to use Reactive Mode?
- Reactive Mode works with all standalone deployments
- E.g. Kubernetes, Docker or via the provided deployment scripts
- Set the configuration:
scheduler-mode=reactive
- Start the JobManager, and add as many TaskManagers as you need
- (optionally) Use a service to determine the number of TaskManagers
- Kubernetes Horizontal Pod Autoscaler
- AWS AutoScaling Groups
- Google Cloud Managed Instance Groups
Reactive Mode: How does it work?
JobManager
TaskManager
Job parallelism = 2
TaskManager
Flink automatically adjusts when TaskManagers are added or removed
Example: Load is increasing
Load
Reactive Mode: How does it work?
JobManager
TaskManager
Job parallelism = 4
TaskManager
Flink automatically adjusts when TaskManagers are added or removed
Example: Load is increasing → add more TaskManagers
TaskManager TaskManager
NEW NEW
Reactive Mode: How does it work?
- The JobManager adjusts the job parallelism depending on the number of
available TaskManagers
- When the # TaskManager changes, the Flink job is restarting, restoring from
the latest checkpoint
- Possible metrics: CPU load / Kafka lag (recommended) / Throughput / latency
- Scaling model similar to Kafka Streams
Reactive Mode example: Kubernetes HPA
- Kubernetes has a built-in
component called
HorizontalPodAutoscaler
- Automatically adjusts the
scale of a deployment based
on a metric
Flink
TaskManager
Deployment
Flink
JobManager
Job
Flink
Job-
Manager
Pod
Flink
Task-
Manager
Pod
Flink
Task-
Manager
Pod
Flink
Task-
Manager
Pod
min=1 max=15
cpu=80%
on=TaskManager
deployment
Horizontalpodautoscaler
Adjusted dynamically
Source: https://flink.apache.org/2021/05/06/reactive-mode.html
Reactive Mode and Flink Deployments
→ Reactive Mode only works with “standalone mode”
Passive Deployment
Flink resources managed externally (“Standalone
mode”)
→ “a bunch of JVMs”
Deployed on bare metal, Docker, Kubernetes
Pros / Cons:
+ DIY scenarios
+ Fast deployments
- Restart
→ Reactive Scaling (outside entity decides)
Active Deployment
Flink actively manages resources
→ Flink talks to a resource manager
Implementations: Native Kubernetes, YARN
Pros / cons:
+ Automatically restarts failed resources
+ Allocates only required resources
- Requires a lot of K8s permissions
→ Autoscaling (Flink decides)
Autoscaling with Flink? Enter Adaptive
Scheduler
- Benefits
- Flink can make better scaling decisions
- Example: rescale only right after a checkpoint completed → avoid
reprocessing
- Fewer components required (“batteries included”)
- How?
- Reactive Mode is based a new (Flink 1.13) internal workload scheduler,
called Adaptive Scheduler.
- Currently configured to behave “reactively”, can also be changed to
automatic
Internals: Adaptive Scheduler
Source / Further reading: https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
https://cwiki.apache.org/confluence/display/FLINK/FLIP-138%3A+Declarative+Resource+management
SlotManager
Resource
Manager
Active K8s / YARN
Requirements
Adaptive Scheduler
I need 15 slots
I have 8 slots
Adaptive Scheduler for Autoscaling (future)
Source / Further reading: https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
https://cwiki.apache.org/confluence/display/FLINK/FLIP-138%3A+Declarative+Resource+management
SlotManager
Resource
Manager
Active K8s / YARN
Requirements
Adaptive Scheduler
I need x slots
I have 8 slots
Pluggable
Autoscaler
Ideas for autoscaler implementations
- REST Interface
- Set desired parallelism via REST call to JobManager
- Either for entire job (and let JM decide on per-operator parallelism) or per-
operator
- User Code + provided autoscaling strategies
- User provides Flink with a custom scaling logic with access to metrics
- Problem: we want to avoid user-code on the JobManager
- JobGraph configuration
- Users configure min, target, max parallelism per operator
Closing remarks
- Autoscaling with Flink is possible today, it’s called
“Reactive Mode” :-)
- Getting started guide:
https://flink.apache.org/2021/05/06/reactive-mode.html
- Limitations of Adaptive Scheduler / Reactive Mode
- Only works with Application Mode
- Task local recovery not yet supported
- Lack of good UI support (history of rescale events)
Questions?
rmetzger@decodable.co / rmetzger@apache.org
@rmetzger_
2022
Build real-time data apps &
services. Fast.
decodable.co

More Related Content

What's hot (20)

PPTX
The Current State of Table API in 2022
Flink Forward
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
Introduction to Kafka Cruise Control
Jiangjie Qin
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
PDF
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Flink Forward
 
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PDF
Handle Large Messages In Apache Kafka
Jiangjie Qin
 
PDF
Introduction to Apache Flink
datamantra
 
PDF
Kafka Streams State Stores Being Persistent
confluent
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
The Current State of Table API in 2022
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
Introduction to Kafka Cruise Control
Jiangjie Qin
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Handle Large Messages In Apache Kafka
Jiangjie Qin
 
Introduction to Apache Flink
datamantra
 
Kafka Streams State Stores Being Persistent
confluent
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 

Similar to Autoscaling Flink with Reactive Mode (20)

PDF
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Flink Forward
 
PDF
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
PDF
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Till Rohrmann
 
PDF
Flink Forward San Francisco 2019: Future of Apache Flink Deployments: Contain...
Flink Forward
 
PDF
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
HostedbyConfluent
 
PPTX
Tuning Flink Clusters for stability and efficiency
Divye Kapoor
 
PDF
A look at Flink 1.2
Stefan Richter
 
PDF
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Ververica
 
PDF
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Flink Forward
 
PPTX
Robust stream processing with Apache Flink
Aljoscha Krettek
 
PDF
Flink Jobs Deployment On Kubernetes
Knoldus Inc.
 
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
PDF
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward
 
PDF
Apache Flink
Mike Frampton
 
PDF
How to build a tool for operating Flink on Kubernetes
AndreaMedeghini
 
PDF
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward
 
PDF
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
GetInData
 
PPTX
Flink System Overview
Timo Walther
 
PDF
Flink at netflix paypal speaker series
Monal Daxini
 
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Flink Forward
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Till Rohrmann
 
Flink Forward San Francisco 2019: Future of Apache Flink Deployments: Contain...
Flink Forward
 
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
HostedbyConfluent
 
Tuning Flink Clusters for stability and efficiency
Divye Kapoor
 
A look at Flink 1.2
Stefan Richter
 
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Ververica
 
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Flink Forward
 
Robust stream processing with Apache Flink
Aljoscha Krettek
 
Flink Jobs Deployment On Kubernetes
Knoldus Inc.
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward
 
Apache Flink
Mike Frampton
 
How to build a tool for operating Flink on Kubernetes
AndreaMedeghini
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
GetInData
 
Flink System Overview
Timo Walther
 
Flink at netflix paypal speaker series
Monal Daxini
 
Ad

More from Flink Forward (13)

PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PPTX
Welcome to the Flink Community!
Flink Forward
 
PPTX
Practical learnings from running thousands of Flink jobs
Flink Forward
 
PPTX
Extending Flink SQL for stream processing use cases
Flink Forward
 
PPTX
The top 3 challenges running multi-tenant Flink at scale
Flink Forward
 
PPTX
Using Queryable State for Fun and Profit
Flink Forward
 
PDF
Changelog Stream Processing with Apache Flink
Flink Forward
 
PPTX
Large Scale Real Time Fraudulent Web Behavior Detection
Flink Forward
 
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
PPTX
Near real-time statistical modeling and anomaly detection using Flink!
Flink Forward
 
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Welcome to the Flink Community!
Flink Forward
 
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Extending Flink SQL for stream processing use cases
Flink Forward
 
The top 3 challenges running multi-tenant Flink at scale
Flink Forward
 
Using Queryable State for Fun and Profit
Flink Forward
 
Changelog Stream Processing with Apache Flink
Flink Forward
 
Large Scale Real Time Fraudulent Web Behavior Detection
Flink Forward
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Near real-time statistical modeling and anomaly detection using Flink!
Flink Forward
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Ad

Recently uploaded (20)

PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 

Autoscaling Flink with Reactive Mode

  • 1. Autoscaling with Apache Flink Robert Metzger Staff Engineer @ decodable, Committer and PMC Chair @ Flink
  • 3. Reasons for changing loads - Seasonality: - day / night - weekend / weekday - Product popularity: new feature launches, ad campaigns - Upstream system outages: load spikes during recovery
  • 4. Solutions in Flink to Rescale - Flink 1.2 (2017): Rescalable State - Flink can restore from a savepoint with a different parallelism, so no data will be lost, all computations will stay correct - When used for scaling: requires custom tooling to orchestrate operations, and bookkeeping - Flink 1.13 (2021): Reactive Mode (beta) - Flink automatically adjusts when TaskManagers are added or removed - Requires outside entity to decide on # TaskManagers - Since Flink 1.15 (2022): Reactive Mode is out of beta Further reading: https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html
  • 5. How to use Reactive Mode? - Reactive Mode works with all standalone deployments - E.g. Kubernetes, Docker or via the provided deployment scripts - Set the configuration: scheduler-mode=reactive - Start the JobManager, and add as many TaskManagers as you need - (optionally) Use a service to determine the number of TaskManagers - Kubernetes Horizontal Pod Autoscaler - AWS AutoScaling Groups - Google Cloud Managed Instance Groups
  • 6. Reactive Mode: How does it work? JobManager TaskManager Job parallelism = 2 TaskManager Flink automatically adjusts when TaskManagers are added or removed Example: Load is increasing Load
  • 7. Reactive Mode: How does it work? JobManager TaskManager Job parallelism = 4 TaskManager Flink automatically adjusts when TaskManagers are added or removed Example: Load is increasing → add more TaskManagers TaskManager TaskManager NEW NEW
  • 8. Reactive Mode: How does it work? - The JobManager adjusts the job parallelism depending on the number of available TaskManagers - When the # TaskManager changes, the Flink job is restarting, restoring from the latest checkpoint - Possible metrics: CPU load / Kafka lag (recommended) / Throughput / latency - Scaling model similar to Kafka Streams
  • 9. Reactive Mode example: Kubernetes HPA - Kubernetes has a built-in component called HorizontalPodAutoscaler - Automatically adjusts the scale of a deployment based on a metric Flink TaskManager Deployment Flink JobManager Job Flink Job- Manager Pod Flink Task- Manager Pod Flink Task- Manager Pod Flink Task- Manager Pod min=1 max=15 cpu=80% on=TaskManager deployment Horizontalpodautoscaler Adjusted dynamically Source: https://flink.apache.org/2021/05/06/reactive-mode.html
  • 10. Reactive Mode and Flink Deployments → Reactive Mode only works with “standalone mode” Passive Deployment Flink resources managed externally (“Standalone mode”) → “a bunch of JVMs” Deployed on bare metal, Docker, Kubernetes Pros / Cons: + DIY scenarios + Fast deployments - Restart → Reactive Scaling (outside entity decides) Active Deployment Flink actively manages resources → Flink talks to a resource manager Implementations: Native Kubernetes, YARN Pros / cons: + Automatically restarts failed resources + Allocates only required resources - Requires a lot of K8s permissions → Autoscaling (Flink decides)
  • 11. Autoscaling with Flink? Enter Adaptive Scheduler - Benefits - Flink can make better scaling decisions - Example: rescale only right after a checkpoint completed → avoid reprocessing - Fewer components required (“batteries included”) - How? - Reactive Mode is based a new (Flink 1.13) internal workload scheduler, called Adaptive Scheduler. - Currently configured to behave “reactively”, can also be changed to automatic
  • 12. Internals: Adaptive Scheduler Source / Further reading: https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler https://cwiki.apache.org/confluence/display/FLINK/FLIP-138%3A+Declarative+Resource+management SlotManager Resource Manager Active K8s / YARN Requirements Adaptive Scheduler I need 15 slots I have 8 slots
  • 13. Adaptive Scheduler for Autoscaling (future) Source / Further reading: https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler https://cwiki.apache.org/confluence/display/FLINK/FLIP-138%3A+Declarative+Resource+management SlotManager Resource Manager Active K8s / YARN Requirements Adaptive Scheduler I need x slots I have 8 slots Pluggable Autoscaler
  • 14. Ideas for autoscaler implementations - REST Interface - Set desired parallelism via REST call to JobManager - Either for entire job (and let JM decide on per-operator parallelism) or per- operator - User Code + provided autoscaling strategies - User provides Flink with a custom scaling logic with access to metrics - Problem: we want to avoid user-code on the JobManager - JobGraph configuration - Users configure min, target, max parallelism per operator
  • 15. Closing remarks - Autoscaling with Flink is possible today, it’s called “Reactive Mode” :-) - Getting started guide: https://flink.apache.org/2021/05/06/reactive-mode.html - Limitations of Adaptive Scheduler / Reactive Mode - Only works with Application Mode - Task local recovery not yet supported - Lack of good UI support (history of rescale events)
  • 17. 2022 Build real-time data apps & services. Fast. decodable.co

Editor's Notes

  • #3: Space between actual load and # of workers == wasted resources You want your resource allocation to be close to actual load
  • #5: Rescalable state: stop with savepoint, restore Good when scaling manually and very rarely Reactive Mode == Kafka Streams deployment model
  • #6: Rescalable state: stop with savepoint, restore Good when scaling manually and very rarely Reactive Mode == Kafka Streams deployment model
  • #7: How does Reactive Mode work?
  • #8: “Just add more hardware”
  • #9: Rescaling same operation as failure: restore from latest checkpoint Can be expensive with large state … only rescale rarely!
  • #10: Example implementation in Kubernetes, the most popular deployment option of Flink at the moment
  • #11: Relationship of scaling and deployment modes. Passive deployment: manually launch the flink components (K8s HA also works here!) Active deployment: flink takes care of launch itself (mostly)
  • #13: Blue line / states: interesting path Source code: hide empty description skinparam monochrome false skinparam defaultFontSize 15 [*] -> Created Created --> Waiting : Start scheduling state "Waiting for resources" as Waiting #lightblue state Executing #lightblue state Restarting #lightblue Waiting --> Waiting : Resources are not stable yet Waiting -[#blue,bold]-> Executing : Resources are stable Waiting --> Finished : Cancel, suspend or not \nenough resources Executing --> Canceling : Cancel Executing --> Failing : Unrecoverable fault Executing --> Finished : Suspend terminal state Executing -[#blue,bold]-> Restarting : Recoverable fault Restarting --> Finished : Suspend Restarting --> Canceling : Cancel Restarting -[#blue,bold]-> Waiting : Cancelation complete Canceling --> Finished : Cancelation complete Failing --> Finished : Failing complete Finished -> [*] https://www.planttext.com/?text=RPB1RiCW38RlF8NLOxM-m0wxLEi3h9fsw7PmYTim4OZ0JEtRpoHbB2YdHFYp_zy_zAOZe67aEtGKTJ0Z6--KEcs_OFS2-q38rAd75tPoze66ZRl2CnmP0qFKFNN9of6AB1Hi2d7n0G95duAck06CfLSLOZdlhR20WS1vcSrujWHtuaNBwurqMcsQ6nRmmJWJnQAmUtIQx1F454To7OY_h4BEfsiFd-xFx6ITYeggUddWF6LMd_yRu83cKNwNaTh_K9ZMk62otBBLtR6w-lPdIGvpii0K1kFGmfHkqoxRvqieKRHQ_yhhOYsnibj3rEkQwvWV36W_Z9R4NXsmcdr3bwGQjXnNhjI4awVv2m00
  • #14: Source code: hide empty description skinparam monochrome false skinparam defaultFontSize 15 [*] -> Created Created --> Waiting : Start scheduling state "Waiting for resources" as Waiting #lightblue state Executing #lightblue state Restarting #lightblue Waiting --> Waiting : Resources are not stable yet Waiting -[#blue,bold]-> Executing : Resources are stable Waiting --> Finished : Cancel, suspend or not \nenough resources Executing --> Canceling : Cancel Executing --> Failing : Unrecoverable fault Executing --> Finished : Suspend terminal state Executing -[#blue,bold]-> Restarting : Recoverable fault Restarting --> Finished : Suspend Restarting --> Canceling : Cancel Restarting -[#blue,bold]-> Waiting : Cancelation complete Canceling --> Finished : Cancelation complete Failing --> Finished : Failing complete Finished -> [*] https://www.planttext.com/?text=RPB1RiCW38RlF8NLOxM-m0wxLEi3h9fsw7PmYTim4OZ0JEtRpoHbB2YdHFYp_zy_zAOZe67aEtGKTJ0Z6--KEcs_OFS2-q38rAd75tPoze66ZRl2CnmP0qFKFNN9of6AB1Hi2d7n0G95duAck06CfLSLOZdlhR20WS1vcSrujWHtuaNBwurqMcsQ6nRmmJWJnQAmUtIQx1F454To7OY_h4BEfsiFd-xFx6ITYeggUddWF6LMd_yRu83cKNwNaTh_K9ZMk62otBBLtR6w-lPdIGvpii0K1kFGmfHkqoxRvqieKRHQ_yhhOYsnibj3rEkQwvWV36W_Z9R4NXsmcdr3bwGQjXnNhjI4awVv2m00