SlideShare a Scribd company logo
Apple logo is a trademark of Apple Inc.
Kristine Gu
o

Liang-Chi Hsie
h

THIS IS NOT A CONTRIBUTION
Structured Streaming Use-cases at
Apple
Liang-Chi Hsieh
Apache Spark Committe
r

Software Engineer @ Appl
e

https://github.com/viirya
https://www.linkedin.com/in/liang-chi-hsieh-a7904568/
Who am I
Kristine Guo
Software Engineer @ Appl
e

Focus on cloud platform technologie
s

Currently work on developing high scale backend
system
s

https://www.linkedin.com/in/kristineguo/
Who am I
Agenda
Revive Previous Structured Streaming Effort
New Enhancement to Structured Streaming
Use-case at Appl
e
Features in Structured Streaming that matter to us
New built-in StateStor
e

Session Windo
w

Stateful task scheduling enhancemen
t

Checkpoint enhancemen
t
Revive Previous Structured
Streaming Efforts
StateStore: Current Status
What is StateStore
?

A component for state management for stateful operators such streaming aggregates,
joins, etc
.

Stateful
operators
Checkpoint/Restore
Project
Get/Put key/value pairs
StateStore FileSystem
Built-in StateStore
HDFSBackedStateStor
e

• Store states in an in-memory ma
p

• Checkpoint to HDFS-compatible file syste
Disadvantages
?

• Limitation by executor memory and an issue for large state use-case
• Impact other memory usage on the executors
We need new built-in StateStore
Reviving RocksDB StateStore as a built-in StateStore in Structured Streamin
Why
?

• More and more streaming applications requiring large state
• Widely used in the industr
y

• SPARK-34198: Add RocksDB StateStore as external modul
• Received all positive responses from the communty
Visit Current RocksDB StateStore
OSS implementation
s

• https://github.com/chermenin/spark-states
• SPARK-28120: RocksDB state storag
e

• https://github.com/qubole/spark-state-store
OSS in the futur
e

• RocksDB StateStore from Databricks
RocksDB StateStore Benchmark
put/get significant key/value pairs
Best Time(ms) Avg Time(ms) Relative
Chermenin 32725 34507 1.0X
Qubole 25493 25636 1.3X
Databricks TBD TBD TBD
Window Operations
Time
12:01
12:02
10 min window
12:12
12:15
…
Time
event-time window
12:01
12:02
12:20
12:25
…
session gap
event-time window
session window
Reviving session window as a built-in window feature
Session Window
• SPARK-10816: EventTime based sessionizatio
• Keep inactive in last few year
s

• Available in other streaming engines but missing at Structured Streaming
Session Window Internals
Session manipulatio
n

• Session initialization, restoring, merging, savin
Internal StateStore forma
t

• Efficiently retrieve all session states for a specific session ke
• Partially update the start time and duration of the affected windows
Internal StateStore format
• Simple session window list as the valu
e

• Double list approac
h

• Start times of session windows as key
* Special thanks to Yuanjian Li for the state store format design doc
Session Window List Approach
• Easy to implemen
t

• Memory issue if too many sessions per valu
• Not support partial update
Session key A list of row
Row(a1, a2, a3…, session_window(start_time, end_time))
Row structure
Single state store
Double List Approach
• Order of session windows is kept, efficient to travers
• No complex structur
e

• Harder to maintain
Session key Row(…, session_window(s, e))
First start time key, start time1
key, start time2
key, start time3
None, start time 2
start time 1, start time 3
start time 2, None
key, start time1
key, start time2
key, start time3
Row(…, session_window(s, e))
Row(…, session_window(s, e))
Start times as Key Approach
• Keep the order in the list of start time
s

• Store a list of session start times
Row(…, session_window(s, e))
key 1
key 2
key 3
start time 1, start time 2… key1, start time1
Row(…, session_window(s, e))
Row(…, session_window(s, e))
start time 1, start time 2…
start time 1, start time 2…
key1, start time2
key2, start time1
Agenda
Revive Previous Structured Streaming Effort
New Enhancement to Structured Streaming
Use-case at Appl
e
Stateful Task Scheduling Enhancement
• Spark task scheduling is not designed for stateful task
• State store location is assigned arbitraril
- Change of state store location causes frequent reloading from remote F
- As obstacle of future checkpoint enhancements in our future works
Stateful Task Scheduling Enhancement
• Leveraging existing data locality preferences as a simple workaroun
- SPARK-33814: Provide preferred locations for stateful operation
• Customizing Spark task scheduling behavio
- SPARK-35022: Task Scheduling Plugin in Spark
Possible approaches: Ongoing works
Spark Task Scheduler Task Scheduling Plugin
New resource offers
Candidate tasks for scheduling
Scheduling preferences
How Task Scheduling Plugin helps
• Try best to distribute stateful tasks across available executor
• Keep stateful task location stable across batches
Task Scheduling Plugin
Batch N Batch N + 1
Exec 1
Exec 2
Exec 3
Exec 4
State1
State2
State3
State4
Exec 1
State1
How Task Scheduling Plugin helps
• Try best to distribute stateful tasks across available executor
• Keep stateful task location stable across batches
Task Scheduling Plugin
Batch N Batch N + 1
Exec 1
Exec 2
Exec 3
Exec 4
State1
State2
State3
State4
Exec 1
State1
Exec 3
State3
How Task Scheduling Plugin helps
• Try best to distribute stateful tasks across available executor
• Keep stateful task location stable across batches
Task Scheduling Plugin
Batch N Batch N + 1
Exec 1
Exec 2
Exec 3
Exec 4
State1
State2
State3
State4
Exec 1
State1
Exec 3
State3
Exec 4
State4
Agenda
Revive Previous Structured Streaming Effort
New Enhancement to Structured Streaming
Use-case at Appl
e
Use Case
• Two parallel data streams
• Each stream performs aggregation over the same data source
• Stream aggregation operates on dynamically-sized app-defined batches
• Stream-stream join between the two data streams
- Must account for potential lag between streams
Use Case
Performance Requirements
• Throughput: PBs/day
• RPS: High, O(10k)
• Data size: Varying (1Kb to 1MB)
• State store: Stream-stream join exerts high memory pressure
Solutions
• Accounting for potential lag: watermarking
• Dynamic batch aggregation: session windows
• Stream-stream join pressure: RocksDB-based State Store
Thank you!
• Your feedback is important to u
s

• Don’t forget to rate and review the sessions
TM and © 2021 Apple Inc. All rights reserved.

More Related Content

What's hot (20)

PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
PDF
Hyperspace: An Indexing Subsystem for Apache Spark
Databricks
 
PDF
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
PDF
Operational Tips for Deploying Spark by Miklos Christine
Spark Summit
 
PDF
Apache Spark 3.0: Overview of What’s New and Why Care
Databricks
 
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
PDF
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Databricks
 
PDF
Extending Spark With Java Agent (handout)
Jaroslav Bachorik
 
PDF
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit
 
PDF
Standalone Spark Deployment for Stability and Performance
Romi Kuntsman
 
PPTX
Apache Spark and Online Analytics
Databricks
 
PDF
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
PDF
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
PDF
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Databricks
 
PDF
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
Hyperspace: An Indexing Subsystem for Apache Spark
Databricks
 
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan
 
Operational Tips for Deploying Spark by Miklos Christine
Spark Summit
 
Apache Spark 3.0: Overview of What’s New and Why Care
Databricks
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Databricks
 
Extending Spark With Java Agent (handout)
Jaroslav Bachorik
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit
 
Standalone Spark Deployment for Stability and Performance
Romi Kuntsman
 
Apache Spark and Online Analytics
Databricks
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Databricks
 
Spark Summit EU talk by Berni Schiefer
Spark Summit
 

Similar to Structured Streaming Use-Cases at Apple (20)

PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
PPTX
Apache Spark Components
Girish Khanzode
 
PDF
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
PDF
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
PPTX
State management in Structured Streaming
datamantra
 
ODP
Understanding Spark Structured Streaming
Knoldus Inc.
 
PDF
Toying with spark
Raymond Tay
 
PPT
Spark and spark streaming internals
Sigmoid
 
PDF
Structured Streaming in Spark
Digital Vidya
 
PDF
2017 big data landscape and cutting edge innovations public
Evans Ye
 
PPT
Introduction to Spark Streaming
Knoldus Inc.
 
PDF
Apache Spark - A High Level overview
Karan Alang
 
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
PDF
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
PDF
Deep dive into spark streaming
Tao Li
 
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
PDF
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
 
PDF
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
Databricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Apache Spark Components
Girish Khanzode
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
State management in Structured Streaming
datamantra
 
Understanding Spark Structured Streaming
Knoldus Inc.
 
Toying with spark
Raymond Tay
 
Spark and spark streaming internals
Sigmoid
 
Structured Streaming in Spark
Digital Vidya
 
2017 big data landscape and cutting edge innovations public
Evans Ye
 
Introduction to Spark Streaming
Knoldus Inc.
 
Apache Spark - A High Level overview
Karan Alang
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Deep dive into spark streaming
Tao Li
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
Databricks
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 

Structured Streaming Use-Cases at Apple

  • 1. Apple logo is a trademark of Apple Inc. Kristine Gu o Liang-Chi Hsie h THIS IS NOT A CONTRIBUTION Structured Streaming Use-cases at Apple
  • 2. Liang-Chi Hsieh Apache Spark Committe r Software Engineer @ Appl e https://github.com/viirya https://www.linkedin.com/in/liang-chi-hsieh-a7904568/ Who am I
  • 3. Kristine Guo Software Engineer @ Appl e Focus on cloud platform technologie s Currently work on developing high scale backend system s https://www.linkedin.com/in/kristineguo/ Who am I
  • 4. Agenda Revive Previous Structured Streaming Effort New Enhancement to Structured Streaming Use-case at Appl e
  • 5. Features in Structured Streaming that matter to us New built-in StateStor e Session Windo w Stateful task scheduling enhancemen t Checkpoint enhancemen t
  • 7. StateStore: Current Status What is StateStore ? A component for state management for stateful operators such streaming aggregates, joins, etc . Stateful operators Checkpoint/Restore Project Get/Put key/value pairs StateStore FileSystem
  • 8. Built-in StateStore HDFSBackedStateStor e • Store states in an in-memory ma p • Checkpoint to HDFS-compatible file syste Disadvantages ? • Limitation by executor memory and an issue for large state use-case • Impact other memory usage on the executors
  • 9. We need new built-in StateStore Reviving RocksDB StateStore as a built-in StateStore in Structured Streamin Why ? • More and more streaming applications requiring large state • Widely used in the industr y • SPARK-34198: Add RocksDB StateStore as external modul • Received all positive responses from the communty
  • 10. Visit Current RocksDB StateStore OSS implementation s • https://github.com/chermenin/spark-states • SPARK-28120: RocksDB state storag e • https://github.com/qubole/spark-state-store OSS in the futur e • RocksDB StateStore from Databricks
  • 11. RocksDB StateStore Benchmark put/get significant key/value pairs Best Time(ms) Avg Time(ms) Relative Chermenin 32725 34507 1.0X Qubole 25493 25636 1.3X Databricks TBD TBD TBD
  • 12. Window Operations Time 12:01 12:02 10 min window 12:12 12:15 … Time event-time window 12:01 12:02 12:20 12:25 … session gap event-time window session window
  • 13. Reviving session window as a built-in window feature Session Window • SPARK-10816: EventTime based sessionizatio • Keep inactive in last few year s • Available in other streaming engines but missing at Structured Streaming
  • 14. Session Window Internals Session manipulatio n • Session initialization, restoring, merging, savin Internal StateStore forma t • Efficiently retrieve all session states for a specific session ke • Partially update the start time and duration of the affected windows
  • 15. Internal StateStore format • Simple session window list as the valu e • Double list approac h • Start times of session windows as key * Special thanks to Yuanjian Li for the state store format design doc
  • 16. Session Window List Approach • Easy to implemen t • Memory issue if too many sessions per valu • Not support partial update Session key A list of row Row(a1, a2, a3…, session_window(start_time, end_time)) Row structure Single state store
  • 17. Double List Approach • Order of session windows is kept, efficient to travers • No complex structur e • Harder to maintain Session key Row(…, session_window(s, e)) First start time key, start time1 key, start time2 key, start time3 None, start time 2 start time 1, start time 3 start time 2, None key, start time1 key, start time2 key, start time3 Row(…, session_window(s, e)) Row(…, session_window(s, e))
  • 18. Start times as Key Approach • Keep the order in the list of start time s • Store a list of session start times Row(…, session_window(s, e)) key 1 key 2 key 3 start time 1, start time 2… key1, start time1 Row(…, session_window(s, e)) Row(…, session_window(s, e)) start time 1, start time 2… start time 1, start time 2… key1, start time2 key2, start time1
  • 19. Agenda Revive Previous Structured Streaming Effort New Enhancement to Structured Streaming Use-case at Appl e
  • 20. Stateful Task Scheduling Enhancement • Spark task scheduling is not designed for stateful task • State store location is assigned arbitraril - Change of state store location causes frequent reloading from remote F - As obstacle of future checkpoint enhancements in our future works
  • 21. Stateful Task Scheduling Enhancement • Leveraging existing data locality preferences as a simple workaroun - SPARK-33814: Provide preferred locations for stateful operation • Customizing Spark task scheduling behavio - SPARK-35022: Task Scheduling Plugin in Spark Possible approaches: Ongoing works Spark Task Scheduler Task Scheduling Plugin New resource offers Candidate tasks for scheduling Scheduling preferences
  • 22. How Task Scheduling Plugin helps • Try best to distribute stateful tasks across available executor • Keep stateful task location stable across batches Task Scheduling Plugin Batch N Batch N + 1 Exec 1 Exec 2 Exec 3 Exec 4 State1 State2 State3 State4 Exec 1 State1
  • 23. How Task Scheduling Plugin helps • Try best to distribute stateful tasks across available executor • Keep stateful task location stable across batches Task Scheduling Plugin Batch N Batch N + 1 Exec 1 Exec 2 Exec 3 Exec 4 State1 State2 State3 State4 Exec 1 State1 Exec 3 State3
  • 24. How Task Scheduling Plugin helps • Try best to distribute stateful tasks across available executor • Keep stateful task location stable across batches Task Scheduling Plugin Batch N Batch N + 1 Exec 1 Exec 2 Exec 3 Exec 4 State1 State2 State3 State4 Exec 1 State1 Exec 3 State3 Exec 4 State4
  • 25. Agenda Revive Previous Structured Streaming Effort New Enhancement to Structured Streaming Use-case at Appl e
  • 26. Use Case • Two parallel data streams • Each stream performs aggregation over the same data source • Stream aggregation operates on dynamically-sized app-defined batches • Stream-stream join between the two data streams - Must account for potential lag between streams
  • 28. Performance Requirements • Throughput: PBs/day • RPS: High, O(10k) • Data size: Varying (1Kb to 1MB) • State store: Stream-stream join exerts high memory pressure
  • 29. Solutions • Accounting for potential lag: watermarking • Dynamic batch aggregation: session windows • Stream-stream join pressure: RocksDB-based State Store
  • 30. Thank you! • Your feedback is important to u s • Don’t forget to rate and review the sessions
  • 31. TM and © 2021 Apple Inc. All rights reserved.