SlideShare a Scribd company logo
Tame the Small Files Problem and Optimize
Data Layout for Streaming Ingestion to Iceberg
Steven Wu, Gang Ye, Haizhou Zhao | Apple
THIS IS NOT A CONTRIBUTION
Apache Iceberg is an open table format for huge analytic data
• Time travel

• Advanced
fi
ltering

• Serializable isolation
Where does Iceberg fit in the ecosystem
Table Format


(Metadata)
Compute


Engine
Storage


(Data) Cloud Blob
Storage
Ingest data to Iceberg data lake in streaming fashion
Flink Streaming

Ingestion
Iceberg

Data Lake
Kafka 

Msg Queue
Zoom into the Flink Iceberg sink
Iceberg

Data Lake
writer-1
writer-2
writer-n
…
Records
DFS
Data Files
committer
File Metadata
Case 1: event-time partitioned tables
hour=2022-08-03-00/
hour=2022-08-03-01/
…
Long tail problem with late arrival data
https://en.wikipedia.org/wiki/Long_tail
Hour
Percentage

of data
0 1 2 N
A data file can’t contain rows across partitions
hour=2022-08-03-00/
|- file-000.parquet
|- file-001.parquet
|- …
hour=2022-08-03-01/
|- …
…
How many data files are generated every hour?
writer-1
writer-2
writer-500
…
committer
720K
fi
les every hour (with 10 minute checkpoint interval)
Records for


24x10 partitions
Open 240
fi
les
Commit 120K
fi
les (240x500)
every checkpoint
Assuming table is partitioned
hourly and event time range
is capped at 10 days
Long-tail hours lead to small files
Percentile File Size
P50 55 KB
P75 77 KB
P90 13 MB
P99 18 MB
What are the implications of too many small files
• Poor read performance

• Request throttling

• Memory pressure

• Longer checkpoint duration and pipeline pause

• Stress the metadata system
Why not keyBy shuffle
writer-1
writer-2
writer-n
…
committer
operator-1
operator-2
operator-n
keyBy(hour)
Iceberg
There are two problems
• Tra
ffi
c are not evenly distributed across event hours

• keyBy for low cardinality column won’t be balanced [1]
[1] https://github.com/apache/iceberg/pull/4228
Need smarter shu
ffl
ing
Case 2: data clustering for non-partition columns
CREATE TABLE db.tbl (
ts timestamp,
data string,
event_type string)
USING iceberg
PARTITIONED BY (hours(ts))
Queries often filter on event_type
SELECT count(1) FROM db.tbl WHERE
ts >= '2022-01-01 08:00:00’ AND
ts < '2022-01-01 09:00:00' AND
event_type = ‘C’
Iceberg supports file pruning leveraging min-max stats
at column level
|- file-000.parquet (event_type: A-B)
|- file-001.parquet (event_type: C-C)
|- file-002.parquet (event_type: D-F)
…
event_type = ‘C’
Wide value range would make pruning ineffective
Wide value range
|- file-000.parquet (event_type: A-Z)
|- file-001.parquet (event_type: A-Z)
|- file-002.parquet (event_type: A-Z)
…
event_type = ‘C’
Making event_type a partition column can lead to
explosion of number of partitions
• Before: 8.8K partitions (365 days x 24 hours) [1]

• After: 4.4M partitions (365 days x 24 hours x 500 event_types) [2]

• Can stress metadata system and lead to small
fi
les
[1] Assuming 12 months retention

[2] Assuming 500 event types
Batch engines solve the clustering problem via shuffle
2. Shuffle to
cluster data
Stage Stage
…
1. Compute
data sketch
Event
Type
Weight
A 2%
B 7%
C 22%
…
Z 0.5%
…
A B A
C C C
Z Y X
A
B
A C
C
C
Z
Y
X
3. Sort data
before writing to
files
A A B
C C C
X Y Z
A-B
min-max
C-C
X-Z
Tight value
range
Shu
ffl
e for better data clustering
Why not compact small files or sort files via
background batch maintenance jobs
• Remediation is usually more expensive than prevention

• Doesn’t solve the throttling problem in the streaming path
Agenda
Motivation Evaluation
Design
Introduce a smart shuffling operator in Flink Iceberg sink
Iceberg
writer-1
writer-2
writer-n
…
committer
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Step 1: calculate traffic distribution
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-10
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Step 2a: shuffle data based on traffic distribution
Hour Assigned tasks
0 1, 2, 3, 4
1 4, 5
2 6
… …
238 10
239 10
240 10
writer-1
writer-2
writer-n
…
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Step 2b: range shuffle data for non-partition column
Event
type
Weight
A 2%
B 7%
C 28%
… …
Z 0.5%
Event
type
Assigned
task
A-B 1
C-C 2, 3, 4
… …
P-Z 10
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-10
Range shuffling improves data clustering
A B A
C C C
Z Y X
Z X A
A C Y
C C B
Unsorted
data files
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Tight value
range
Sorting within a file brings additional benefits of row
group and page level skipping
Parquet
fi
le
X


X


X


X


X


Y


Y


Z


Z


Z


Z


Z


Row 

group 1
Row 

group 2
Row 

group 3
SELECT * FROM db.tbl WHERE
ts >= … AND ts < … AND
event_type = 'Y'
What if sorting is needed
• Sorting in streaming is possible but expensive

• Use batch sorting jobs
How to calculate tra
ffi
c distribution
FLIP-27 source interface introduced operator
coordinator component
JobManager TaskManager-1
TaskManager-n
…
Source
Reader-1
Source
Reader-k
…
Source

Coordinator
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 1
Shuffle tasks calculate local stats and send them to
coordinator
writer-1
JobManager
shu
ffl
e

coordinator
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 1
Shuffle coordinator does global aggregation
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Global aggregation
addresses the
potential problem of
different local views
shu
ffl
e

coordinator
JobManager
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Shuffle coordinator broadcasts the globally aggregated
stats to tasks
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Shu
ffl
e

Coordinator
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
JobManager
All shuf
fl
e tasks make
the same decision based
on the same stats
How to shu
ffl
e data
Add a custom partitioner after the shuffle operator
dataStream
.transform("shuffleOperator", shuffleOperatorOutputType, operatorFactory)
.partitionCustom(binPackingPartitioner, keySelector)
public class BinPackingPartitioner<K> implements Partitioner<K> {
@Override
int partition(K key, int numPartitions);
}
There are two shuffling strategies
• Bin packing

• Range distribution
Bin packing can combine multiple small keys to a
single task or split a single large key to multiple tasks
Task Assigned keys
T0 K0, K2, K4, K6, K8
T1 K7
T2 K3
T3 K3
T4 K3
T5 K3
… …
T9 K1,K5
• Only focus on balanced
weight distribution

• Ignore ordering when
assigning keys

• Work well with shu
ffl
ing by
partition columns
Range shuffling split sort values into ranges and
assign them to tasks
• Balance weight distribution
with continuous ranges

• Work well with shu
ffl
ing by
non-partition columns
Value Assigned task
A
T1
B
C
…
D
T2
T3
T4
Optimizing balanced distribution in byte rate can lead to
file count skew where a task handles many long-tail hours
hours
0 1 2 N
https://en.wikipedia.org/wiki/Long_tail
Many long-tail hours can
be assigned to a single
task, which can become
bottleneck
There are two solutions
• Parallelize
fi
le
fl
ushing and upload

• Limit the
fi
le count skew via close-
fi
le-cost (like open-
fi
le-
cost)
Tune close-file-cost to balance btw file count skew
and byte rate skew
Skewness
Close-
fi
le-cost
Byte rate skew
File count skew
Agenda
Motivation Evaluation
Design
A: Simple Iceberg ingestion job without shuffling
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
Chained
…
• Job parallelism is 60

• Checkpoint interval is 10 min
B: Iceberg ingestion with smart shuffling
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
• Job parallelism is 60

• Checkpoint interval is 10 min
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Chained Shuffle
Test setup
• Sink Iceberg table is partitioned hourly by event time

• Benchmark tra
ffi
c volume is 250 MB/sec

• Event time range is 192 hours
What are we comparing
• Number of
fi
les written in one cycle

• File size distribution

• Checkpoint duration

• CPU utilization

• Shu
ffl
ing skew
• Job parallelism is 60

• Event time range is 192 hours

Shu
ffl
e reduced the number of
fi
les by 20x
Without shu
ffl
ing one cycle
fl
ushed 10K
fi
les
With shu
ffl
ing one cycle
fl
ushed 500
fi
les
~2.5x of minimal
number of
fi
les
Shuffling greatly improved file size distribution
Percentile
Without
shuffling
With
shuffling
Improvement
P50 55 KB 913 KB 17x
P75 77 KB 7 MB 90x
P95 13 MB 301 MB 23x
P99 18 MB 306 MB 17x
Shuffling tamed the small files problem
During checkpoint, writer tasks flush and upload data files
writer-1
writer-2
writer-n
…
committer
DFS
Data Files
Reduced checkpoint duration by 8x
Without shu
ffl
ing, checkpoint takes 64s on average
With shu
ffl
ing, checkpoint takes 8s on average
Seconds
10
20
30
40
50
60
70
Record handover btw chained operators are simple
method call
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
Chained
1. Kafka Source 2. Iceberg Sink
…
Shuffling involves significant CPU overhead on serdes
and network I/O
2. Shuffle
1. Kafka Source 3. Iceberg Sink
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Shuffle
Chained
Shuffling increased CPU usage by 62%
All about tradeo
ff
!
With shu
ffl
ing avg CPU util is 57%
Without shu
ffl
ing avg CPU util is 35%
Without shuffling, checkpoint pause is longer and
catch-up spike is bigger
With shu
ffl
ing
Without shu
ffl
ing
Catch-up spike
Trough caused
by pause
Bin packing shuffling won’t be perfect in weight distribution
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Shuffle
Chained
processes data for
partitions a, b, c
processes data for
partitions y, z
Min of writer
record rate
Max of writer
record rate
Skewness
(max-min)/min
No shuffling 4.36 K 4.44 K 1.8%
Bin packing
(greedy algo)
4.02 K 6.39 MB 59%
Our greedy algo implementation of bin packing
introduces higher skew than we hoped for
Future work
• Implement other algorithm

• Better bin packing with less skew

• Range partitioner

• Support sketch statistics for high-cardinality keys

• Contribute it to OSS
References
• Design doc: https://docs.google.com/document/d/13N8cMqPi-
ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/
Q&A
Tame the small files problem and optimize data layout for streaming ingestion to Iceberg
Weight table should be
relatively stable
What about new hour as time moves forward?
Absolute hour Weight
2022-08-03-00 0.4
… …
2022-08-03-12 22
2022-08-03-13 27
2022-08-03-14 38
2022-08-03-15 ??
Weight table based on relative hour would be stable
Relative hour Weight
0 38
1 27
2 22
… …
14 0.4
… …
What about cold start problem?
• First-time run

• Restart with empty state

• New subtasks from scale-up
Cope with with cold start problems
• No shu
ffl
e while learning

• Bu
ff
er records until learned the
fi
rst stats

• New subtasks (scale-up) request stats from the coordinator

More Related Content

What's hot (20)

PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
PPTX
Apache Flink and what it is used for
Aljoscha Krettek
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PPTX
The Current State of Table API in 2022
Flink Forward
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PPTX
Practical learnings from running thousands of Flink jobs
Flink Forward
 
PPTX
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Flink Forward
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PDF
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Databricks
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Changelog Stream Processing with Apache Flink
Flink Forward
 
PDF
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Apache Flink and what it is used for
Aljoscha Krettek
 
Apache Flink internals
Kostas Tzoumas
 
The Current State of Table API in 2022
Flink Forward
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Flink Forward
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Changelog Stream Processing with Apache Flink
Flink Forward
 
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 

Similar to Tame the small files problem and optimize data layout for streaming ingestion to Iceberg (20)

PPTX
Cloud Security Monitoring and Spark Analytics
amesar0
 
PDF
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward
 
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
PPTX
Building Stream Processing as a Service
Steven Wu
 
PPTX
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
StreamNative
 
PPTX
TenMax Data Pipeline Experience Sharing
Chen-en Lu
 
PDF
spark stream - kafka - the right way
Dori Waldman
 
PDF
Flink at netflix paypal speaker series
Monal Daxini
 
PDF
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Martin Zapletal
 
PDF
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
SolarWinds Loggly
 
PDF
Introduction to apache kafka
Samuel Kerrien
 
PDF
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
PPTX
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
PDF
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
PDF
Spark cep
Byungjin Kim
 
PPT
Google Cloud Computing on Google Developer 2008 Day
programmermag
 
PDF
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Jeff Hung
 
PDF
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Data Con LA
 
Cloud Security Monitoring and Spark Analytics
amesar0
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Building Stream Processing as a Service
Steven Wu
 
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
StreamNative
 
TenMax Data Pipeline Experience Sharing
Chen-en Lu
 
spark stream - kafka - the right way
Dori Waldman
 
Flink at netflix paypal speaker series
Monal Daxini
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Martin Zapletal
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
SolarWinds Loggly
 
Introduction to apache kafka
Samuel Kerrien
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Spark cep
Byungjin Kim
 
Google Cloud Computing on Google Developer 2008 Day
programmermag
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Jeff Hung
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Data Con LA
 
Ad

More from Flink Forward (16)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PPTX
Welcome to the Flink Community!
Flink Forward
 
PPTX
Extending Flink SQL for stream processing use cases
Flink Forward
 
PPTX
The top 3 challenges running multi-tenant Flink at scale
Flink Forward
 
PPTX
Using Queryable State for Fun and Profit
Flink Forward
 
PPTX
Large Scale Real Time Fraudulent Web Behavior Detection
Flink Forward
 
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
PPTX
Near real-time statistical modeling and anomaly detection using Flink!
Flink Forward
 
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Welcome to the Flink Community!
Flink Forward
 
Extending Flink SQL for stream processing use cases
Flink Forward
 
The top 3 challenges running multi-tenant Flink at scale
Flink Forward
 
Using Queryable State for Fun and Profit
Flink Forward
 
Large Scale Real Time Fraudulent Web Behavior Detection
Flink Forward
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Near real-time statistical modeling and anomaly detection using Flink!
Flink Forward
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Ad

Recently uploaded (20)

PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Digital Circuits, important subject in CS
contactparinay1
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 

Tame the small files problem and optimize data layout for streaming ingestion to Iceberg

  • 1. Tame the Small Files Problem and Optimize Data Layout for Streaming Ingestion to Iceberg Steven Wu, Gang Ye, Haizhou Zhao | Apple THIS IS NOT A CONTRIBUTION
  • 2. Apache Iceberg is an open table format for huge analytic data • Time travel • Advanced fi ltering • Serializable isolation
  • 3. Where does Iceberg fit in the ecosystem Table Format (Metadata) Compute Engine Storage (Data) Cloud Blob Storage
  • 4. Ingest data to Iceberg data lake in streaming fashion Flink Streaming Ingestion Iceberg Data Lake Kafka Msg Queue
  • 5. Zoom into the Flink Iceberg sink Iceberg Data Lake writer-1 writer-2 writer-n … Records DFS Data Files committer File Metadata
  • 6. Case 1: event-time partitioned tables hour=2022-08-03-00/ hour=2022-08-03-01/ …
  • 7. Long tail problem with late arrival data https://en.wikipedia.org/wiki/Long_tail Hour Percentage of data 0 1 2 N
  • 8. A data file can’t contain rows across partitions hour=2022-08-03-00/ |- file-000.parquet |- file-001.parquet |- … hour=2022-08-03-01/ |- … …
  • 9. How many data files are generated every hour? writer-1 writer-2 writer-500 … committer 720K fi les every hour (with 10 minute checkpoint interval) Records for 24x10 partitions Open 240 fi les Commit 120K fi les (240x500) every checkpoint Assuming table is partitioned hourly and event time range is capped at 10 days
  • 10. Long-tail hours lead to small files Percentile File Size P50 55 KB P75 77 KB P90 13 MB P99 18 MB
  • 11. What are the implications of too many small files • Poor read performance • Request throttling • Memory pressure • Longer checkpoint duration and pipeline pause • Stress the metadata system
  • 12. Why not keyBy shuffle writer-1 writer-2 writer-n … committer operator-1 operator-2 operator-n keyBy(hour) Iceberg
  • 13. There are two problems • Tra ffi c are not evenly distributed across event hours • keyBy for low cardinality column won’t be balanced [1] [1] https://github.com/apache/iceberg/pull/4228
  • 15. Case 2: data clustering for non-partition columns CREATE TABLE db.tbl ( ts timestamp, data string, event_type string) USING iceberg PARTITIONED BY (hours(ts))
  • 16. Queries often filter on event_type SELECT count(1) FROM db.tbl WHERE ts >= '2022-01-01 08:00:00’ AND ts < '2022-01-01 09:00:00' AND event_type = ‘C’
  • 17. Iceberg supports file pruning leveraging min-max stats at column level |- file-000.parquet (event_type: A-B) |- file-001.parquet (event_type: C-C) |- file-002.parquet (event_type: D-F) … event_type = ‘C’
  • 18. Wide value range would make pruning ineffective Wide value range |- file-000.parquet (event_type: A-Z) |- file-001.parquet (event_type: A-Z) |- file-002.parquet (event_type: A-Z) … event_type = ‘C’
  • 19. Making event_type a partition column can lead to explosion of number of partitions • Before: 8.8K partitions (365 days x 24 hours) [1] • After: 4.4M partitions (365 days x 24 hours x 500 event_types) [2] • Can stress metadata system and lead to small fi les [1] Assuming 12 months retention [2] Assuming 500 event types
  • 20. Batch engines solve the clustering problem via shuffle 2. Shuffle to cluster data Stage Stage … 1. Compute data sketch Event Type Weight A 2% B 7% C 22% … Z 0.5% … A B A C C C Z Y X A B A C C C Z Y X 3. Sort data before writing to files A A B C C C X Y Z A-B min-max C-C X-Z Tight value range
  • 21. Shu ffl e for better data clustering
  • 22. Why not compact small files or sort files via background batch maintenance jobs • Remediation is usually more expensive than prevention • Doesn’t solve the throttling problem in the streaming path
  • 24. Introduce a smart shuffling operator in Flink Iceberg sink Iceberg writer-1 writer-2 writer-n … committer shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling
  • 25. Step 1: calculate traffic distribution writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-10 Hour Weight 0 33% 1 14% 2 5% … … 240 0.001%
  • 26. Step 2a: shuffle data based on traffic distribution Hour Assigned tasks 0 1, 2, 3, 4 1 4, 5 2 6 … … 238 10 239 10 240 10 writer-1 writer-2 writer-n … Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% shuf fl e-1 shuf fl e-2 shuf fl e-n
  • 27. Step 2b: range shuffle data for non-partition column Event type Weight A 2% B 7% C 28% … … Z 0.5% Event type Assigned task A-B 1 C-C 2, 3, 4 … … P-Z 10 writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-10
  • 28. Range shuffling improves data clustering A B A C C C Z Y X Z X A A C Y C C B Unsorted data files writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Tight value range
  • 29. Sorting within a file brings additional benefits of row group and page level skipping Parquet fi le X X X X X Y Y Z Z Z Z Z Row group 1 Row group 2 Row group 3 SELECT * FROM db.tbl WHERE ts >= … AND ts < … AND event_type = 'Y'
  • 30. What if sorting is needed • Sorting in streaming is possible but expensive • Use batch sorting jobs
  • 31. How to calculate tra ffi c distribution
  • 32. FLIP-27 source interface introduced operator coordinator component JobManager TaskManager-1 TaskManager-n … Source Reader-1 Source Reader-k … Source Coordinator
  • 33. writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 1 Shuffle tasks calculate local stats and send them to coordinator writer-1 JobManager shu ffl e coordinator
  • 34. writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 1 Shuffle coordinator does global aggregation Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Global aggregation addresses the potential problem of different local views shu ffl e coordinator JobManager
  • 35. writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling Shuffle coordinator broadcasts the globally aggregated stats to tasks Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Shu ffl e Coordinator Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% JobManager All shuf fl e tasks make the same decision based on the same stats
  • 37. Add a custom partitioner after the shuffle operator dataStream .transform("shuffleOperator", shuffleOperatorOutputType, operatorFactory) .partitionCustom(binPackingPartitioner, keySelector) public class BinPackingPartitioner<K> implements Partitioner<K> { @Override int partition(K key, int numPartitions); }
  • 38. There are two shuffling strategies • Bin packing • Range distribution
  • 39. Bin packing can combine multiple small keys to a single task or split a single large key to multiple tasks Task Assigned keys T0 K0, K2, K4, K6, K8 T1 K7 T2 K3 T3 K3 T4 K3 T5 K3 … … T9 K1,K5 • Only focus on balanced weight distribution • Ignore ordering when assigning keys • Work well with shu ffl ing by partition columns
  • 40. Range shuffling split sort values into ranges and assign them to tasks • Balance weight distribution with continuous ranges • Work well with shu ffl ing by non-partition columns Value Assigned task A T1 B C … D T2 T3 T4
  • 41. Optimizing balanced distribution in byte rate can lead to file count skew where a task handles many long-tail hours hours 0 1 2 N https://en.wikipedia.org/wiki/Long_tail Many long-tail hours can be assigned to a single task, which can become bottleneck
  • 42. There are two solutions • Parallelize fi le fl ushing and upload • Limit the fi le count skew via close- fi le-cost (like open- fi le- cost)
  • 43. Tune close-file-cost to balance btw file count skew and byte rate skew Skewness Close- fi le-cost Byte rate skew File count skew
  • 45. A: Simple Iceberg ingestion job without shuffling source-1 source-2 source-n writer-1 writer-2 writer-n committer Chained … • Job parallelism is 60 • Checkpoint interval is 10 min
  • 46. B: Iceberg ingestion with smart shuffling source-1 source-2 source-n writer-1 writer-2 writer-n committer • Job parallelism is 60 • Checkpoint interval is 10 min shuf fl e-1 shuf fl e-2 Shuf fl e-n Chained Shuffle
  • 47. Test setup • Sink Iceberg table is partitioned hourly by event time • Benchmark tra ffi c volume is 250 MB/sec • Event time range is 192 hours
  • 48. What are we comparing • Number of fi les written in one cycle • File size distribution • Checkpoint duration • CPU utilization • Shu ffl ing skew
  • 49. • Job parallelism is 60 • Event time range is 192 hours Shu ffl e reduced the number of fi les by 20x Without shu ffl ing one cycle fl ushed 10K fi les With shu ffl ing one cycle fl ushed 500 fi les ~2.5x of minimal number of fi les
  • 50. Shuffling greatly improved file size distribution Percentile Without shuffling With shuffling Improvement P50 55 KB 913 KB 17x P75 77 KB 7 MB 90x P95 13 MB 301 MB 23x P99 18 MB 306 MB 17x
  • 51. Shuffling tamed the small files problem
  • 52. During checkpoint, writer tasks flush and upload data files writer-1 writer-2 writer-n … committer DFS Data Files
  • 53. Reduced checkpoint duration by 8x Without shu ffl ing, checkpoint takes 64s on average With shu ffl ing, checkpoint takes 8s on average Seconds 10 20 30 40 50 60 70
  • 54. Record handover btw chained operators are simple method call source-1 source-2 source-n writer-1 writer-2 writer-n committer Chained 1. Kafka Source 2. Iceberg Sink …
  • 55. Shuffling involves significant CPU overhead on serdes and network I/O 2. Shuffle 1. Kafka Source 3. Iceberg Sink source-1 source-2 source-n writer-1 writer-2 writer-n committer shuf fl e-1 shuf fl e-2 Shuf fl e-n Shuffle Chained
  • 56. Shuffling increased CPU usage by 62% All about tradeo ff ! With shu ffl ing avg CPU util is 57% Without shu ffl ing avg CPU util is 35%
  • 57. Without shuffling, checkpoint pause is longer and catch-up spike is bigger With shu ffl ing Without shu ffl ing Catch-up spike Trough caused by pause
  • 58. Bin packing shuffling won’t be perfect in weight distribution source-1 source-2 source-n writer-1 writer-2 writer-n committer shuf fl e-1 shuf fl e-2 Shuf fl e-n Shuffle Chained processes data for partitions a, b, c processes data for partitions y, z
  • 59. Min of writer record rate Max of writer record rate Skewness (max-min)/min No shuffling 4.36 K 4.44 K 1.8% Bin packing (greedy algo) 4.02 K 6.39 MB 59% Our greedy algo implementation of bin packing introduces higher skew than we hoped for
  • 60. Future work • Implement other algorithm • Better bin packing with less skew • Range partitioner • Support sketch statistics for high-cardinality keys • Contribute it to OSS
  • 61. References • Design doc: https://docs.google.com/document/d/13N8cMqPi- ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/
  • 62. Q&A
  • 64. Weight table should be relatively stable
  • 65. What about new hour as time moves forward? Absolute hour Weight 2022-08-03-00 0.4 … … 2022-08-03-12 22 2022-08-03-13 27 2022-08-03-14 38 2022-08-03-15 ??
  • 66. Weight table based on relative hour would be stable Relative hour Weight 0 38 1 27 2 22 … … 14 0.4 … …
  • 67. What about cold start problem? • First-time run • Restart with empty state • New subtasks from scale-up
  • 68. Cope with with cold start problems • No shu ffl e while learning • Bu ff er records until learned the fi rst stats • New subtasks (scale-up) request stats from the coordinator