SlideShare a Scribd company logo
Apache Hudi: The Path Forward
Vinoth Chandar, Raymond Xu
PMC, Apache Hudi
Agenda
1) Hudi Intro
2) Table Metadata
3) Caching
4) Community
Hudi Intro
Components, Evolution
Typical Use-Cases
Hudi - the Pioneer
Serverless, transactional layer over
lakes.
Multi-engine, Decoupled storage
from engine/compute
Introduced notions of
Copy-On-Write and
Merge-on-Read
Change capture on lakes
Ideas now heavily borrowed
outside.
The Hudi Stack
Lakes on cheap, scalable Hadoop compatible storage
Built on open file and data formats
Transactional Database Kernel
- Table Format for file layouts, schema, …
- Indexing for faster updates/deletes
- Built-in “daemons” aka table services
- MVCC, OCC Concurrency Control
SQL and Programming APIs
Platform services and operational tools
Universally queryable from popular engines
It’s a platform!
Both streaming + batch style pipelines
- State store for incremental merging intermediate results
- Change events like Apache Kafka topics
For data lake workloads
- Optimized, self-managing data plane
- Large scale data processing
- Lakehouse?
With tightly-integrated components
- Loose coupling => too many to integrate
- Reduce build out time for data lakes
http://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform
Table Format
Avro Schema, Evolution rules
File groups, reduce merge overhead
Timeline => event log, WAL
Internal metadata table
Ongoing
- Schema-on-read i.e
drop,renames (RFC-33)
- Infinite retention
File Formats
Base and Delta Log Files
- Parquet, Orc, HFile Base files
- Avro log files
- Encode changes as blocks
Ongoing
- Parquet log blocks for large
batch writes
- CSV, unstructured formats
- pre-materialization for
masking/data privacy
Indexes
Pluggable, Consistent with txns
For upserts, deletes
- HBase, External index ->
pluggable
- Simple, Bloom/Local vs Global
Ongoing
- RFC-27 Range indexes
- Bucketed Index
- DynamoDB index
- Metadata index
- Record level indexing
Concurrency Control
Hudi did not need multi-writer support
- Treat writers and services differently
- MVCC, non-blocking
- Table services satisfy most needs
Hudi now does Optimistic Concurrency Control
- File level, timeline consistent
- Still MVCC for table services
Future/Ongoing
- Multi-table transactions
- MVCC, fully lock free transactions
Writers
Incremental & Batch write operations
- File sizing, Layout control upon write
- Sorting, compression, Index maintenance
- Spill handling, Multi-threaded write pipeline
Record level merges APIs
- Unique keys, composite,
- key generators, virtual or physical
- partial merges, event-time processing
Record level metadata
- Arrival and event time, watermarks
- Encode source CDC operation
Readers
Hive, Impala, Presto, Spark, Trino, Redshift
Use engine’s native readers
First class support for incremental queries
Flexibility - snapshot vs read-optimized
Future
- Flexible change stream data models.
- Snowflake/BigQuery external tables
Table Services
Self managing database runtime
Table services know each other
- E.g avoid duplicate schedules
- E.g skip compacting files being clustered
Cleaning (committed/uncommitted), archival,
clustering, compaction, ..
Services can be run continuously or scheduled
Platform Services
DeltaStreamer/FlinkStreamer
ingest/ETL utility
Deliver Commit notifications
Kafka Connect Sink
Data Quality checkers
Snapshot, Restore, Export, Import
Table Metadata
Current choices, Ongoing work, Future plans
What qualifies as table metadata?
Schema - Columns names/types, keys, partitioning, evolution/versions
- Typically small, < 1MB per version.
Files/Objects - Length, paths, URIs
- 2M objects => 10s of MBs
Stats - Min, Max, Nulls etc, Per col Per file
- 2M objects => 100+ of MBs
Redo Logs - Changes to metadata => writes, rollbacks, table optimizations.
- Committing (200kb) every minute for a year => ~100 GB
Indexes? - Remember Stats != Index, They can be much bigger.
How’s this stored in Hudi, today?
Schema - Stored within the redo log, consistent with table changes.
- Synced out to different meta-stores, post commit
Files/Objects - Obtained from an internal metadata table partition `files`
- Or just by listing storage - sometimes it’s faster!
Redo Logs - As an event log in the timeline folder “.hoodie”
- Archived out, once transactions/table operations complete/expire.
Stats - We don’t. Yet. Fetch from file footers.
- Again sometimes faster if parallelized, even on cloud storage.
RFC-27 (Ongoing): Flat Files are not cool
Scaling file stats for high scale writing
- 65536 files (1TB data, stored as 16MB
small files)
- 100 columns, 6.5M stat entries
- O(total_cols_tracked_in_table)
- Slow, 10s of seconds.
Range reads to the rescue!
- O(num_cols_in_query) performace
- Interval trees with smart skipping
The Hudi Timeline server
Metadata need efficient serving, caching
- Not just efficient storage
Responsibilities
- Cache file listings across executors
- Amortize access to metadata table
- Performant uncommitted file
cleanup
Incremental sync
- Streaming/continuous writes
- Lazy refreshing of timeline
S3 Baseline: listing p90
- 1sec (10k files),
- 10 sec (100K files)
Timeline Server: 1-10 ms!
File-backed metadata: ~1 second!
Extending the Timeline Server
New APIs
- Serve also stats, redo log information.
- Locking APIs
Let’s make a cluster!
- Shard servers by table/db
- Pluggable backing storage
- Local DB w/ recovery/checkpointing
- Remote DB with
newSQL/transactional storage
Cache
Basic Idea, Design Considerations
Basic Idea
Problems
- Frequent commits => small objects /
blocks => I/O costly
- File System / Block level caching not
very effective
base file b @ t1
base file b’ @ t2
log file 1 for b
log file 2 for b
log file 1 for b’
log file 2 for b’
Time
Hudi FileGroup
log file 3 for b’
Hudi FileGroup fits caching
- Smallest unit to compact
- Size properly to fit cache store
- Cache compacted data for
real-time views => save
computation
Design Considerations
Refresh-Ahead
- Works with Change-Data-Capture
scenario
- Micro-compact FileGroup and save in
cache
Cache
base file b
log file 1 for b
log file 2 for b
compacted
Change-Data-Capture
Refresh-Ahead
Read-Through
- Driven by usage, on-demand
computation
- LRU or LFU
Query I/O
Read-Through
Design Considerations
FileGroup consistent hashing
- Each FileGroup has a unique ID
- Work with distributed cache servers
Cache
Node A
FileGroup
Query I/O
Cache
Node B
FileGroup FileGroup
Coordinator
(Timeline server?)
Query I/O
Lake Storage
Cache (e.g. Alluxio)
Transactionality
- Only committed files can be
cached
- Rollback include cache
invalidation
Pluggable Caching Layer
- Define APIs for pluggable
caching implementations
Community
Adoption, Operating the Apache way, Ongoing work
How we roll?
Friendly and diverse community
- Open and Collaborative
- 20+ PMCs/Committers from 10+
organizations
Developers
- Propose new RFCs (design docs)
- Dev list discussions, JIRA for issue tracking.
Users
- Weekly community on-call rotations
- Issue triage, bug filing process on Github
1200+
Slack
200+
Contributors
1000+
GH Engagers
~10-20
PRs/week
20+
Committers
10+
PMCs
Major Ongoing Works
RFC-26: Z-order indexing, Hilbert curves (PR #3330)
RFC-27: Data skipping/Range indexing (PR #3475)
RFC-29: Hashed Indexing (PR #3173)
RFC-32: Kafka Connect Sink for Hudi (Pre-release; available in 0.10.0)
RFC-33: Full-schema evolution support (PR #3668)
RFC-35: BigQuery integration
Major Ongoing Works
RFC-20: Error tables (PR #3312)
RFC-08: Record level indexing (PR #3508)
RFC-15: Synchronous, Multi table Metadata writes (PR #3590)
Hudi + Dbt (dbt-labs/dbt-spark/pull/210)
PrestoDB/Trino Connectors (Early design)
Hudi is broadly adopted outside
More at : http://hudi.apache.org/powered-by
Engage With Our Community
User Docs : https://hudi.apache.org
Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI
Github : https://github.com/apache/hudi/
Twitter : https://twitter.com/apachehudi
Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe)
dev@hudi.apache.org (actual mailing list)
Slack : https://join.slack.com/t/apache-hudi/signup
Thanks!
Questions?

More Related Content

What's hot (20)

PDF
Building large scale transactional data lake using apache hudi
Bill Liu
 
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PDF
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
PDF
Spark with Delta Lake
Knoldus Inc.
 
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
PDF
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
PPTX
Delta lake and the delta architecture
Adam Doyle
 
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
PPTX
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Building large scale transactional data lake using apache hudi
Bill Liu
 
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Apache Flink internals
Kostas Tzoumas
 
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Understanding Query Plans and Spark UIs
Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Spark with Delta Lake
Knoldus Inc.
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Delta lake and the delta architecture
Adam Doyle
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 

Similar to Apache Hudi: The Path Forward (20)

PPTX
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
PDF
Hoodie - DataEngConf 2017
Vinoth Chandar
 
PPTX
Bringing OLTP woth OLAP: Lumos on Hadoop
DataWorks Summit
 
PDF
Hadoop and object stores can we do it better
gvernik
 
PDF
Hadoop and object stores: Can we do it better?
gvernik
 
PDF
Large-scale Web Apps @ Pinterest
HBaseCon
 
PPTX
Scale your Alfresco Solutions
Alfresco Software
 
PDF
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
PDF
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
nadine39280
 
PPTX
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
PPTX
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
alanfgates
 
PPTX
Hive & HBase For Transaction Processing
DataWorks Summit
 
PDF
Data Modeling in Hadoop - Essentials for building data driven applications
Maloy Manna, PMP®
 
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
PPT
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
PPTX
Optimizing Big Data to run in the Public Cloud
Qubole
 
PDF
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
PDF
Agile data lake? An oxymoron?
samthemonad
 
ODP
Apache Marmotta - Introduction
Sebastian Schaffert
 
PPTX
Overview of MongoDB and Other Non-Relational Databases
Andrew Kandels
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
Hoodie - DataEngConf 2017
Vinoth Chandar
 
Bringing OLTP woth OLAP: Lumos on Hadoop
DataWorks Summit
 
Hadoop and object stores can we do it better
gvernik
 
Hadoop and object stores: Can we do it better?
gvernik
 
Large-scale Web Apps @ Pinterest
HBaseCon
 
Scale your Alfresco Solutions
Alfresco Software
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
nadine39280
 
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
alanfgates
 
Hive & HBase For Transaction Processing
DataWorks Summit
 
Data Modeling in Hadoop - Essentials for building data driven applications
Maloy Manna, PMP®
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Optimizing Big Data to run in the Public Cloud
Qubole
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
Agile data lake? An oxymoron?
samthemonad
 
Apache Marmotta - Introduction
Sebastian Schaffert
 
Overview of MongoDB and Other Non-Relational Databases
Andrew Kandels
 
Ad

More from Alluxio, Inc. (20)

PDF
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
PDF
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
Ad

Recently uploaded (20)

PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
PDF
AI Image Enhancer: Revolutionizing Visual Quality”
docmasoom
 
PDF
Step-by-Step Guide to Install SAP HANA Studio | Complete Installation Tutoria...
SAP Vista, an A L T Z E N Company
 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PDF
What companies do with Pharo (ESUG 2025)
ESUG
 
PPTX
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PPTX
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PDF
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
PDF
Infrastructure planning and resilience - Keith Hastings.pptx.pdf
Safe Software
 
PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PPTX
Farrell__10e_ch04_PowerPoint.pptx Programming Logic and Design slides
bashnahara11
 
PDF
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PPT
Brief History of Python by Learning Python in three hours
adanechb21
 
PDF
Enhancing Security in VAST: Towards Static Vulnerability Scanning
ESUG
 
PDF
Protecting the Digital World Cyber Securit
dnthakkar16
 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
AI Image Enhancer: Revolutionizing Visual Quality”
docmasoom
 
Step-by-Step Guide to Install SAP HANA Studio | Complete Installation Tutoria...
SAP Vista, an A L T Z E N Company
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
What companies do with Pharo (ESUG 2025)
ESUG
 
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
Infrastructure planning and resilience - Keith Hastings.pptx.pdf
Safe Software
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
Farrell__10e_ch04_PowerPoint.pptx Programming Logic and Design slides
bashnahara11
 
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
Brief History of Python by Learning Python in three hours
adanechb21
 
Enhancing Security in VAST: Towards Static Vulnerability Scanning
ESUG
 
Protecting the Digital World Cyber Securit
dnthakkar16
 

Apache Hudi: The Path Forward

  • 1. Apache Hudi: The Path Forward Vinoth Chandar, Raymond Xu PMC, Apache Hudi
  • 2. Agenda 1) Hudi Intro 2) Table Metadata 3) Caching 4) Community
  • 5. Hudi - the Pioneer Serverless, transactional layer over lakes. Multi-engine, Decoupled storage from engine/compute Introduced notions of Copy-On-Write and Merge-on-Read Change capture on lakes Ideas now heavily borrowed outside.
  • 6. The Hudi Stack Lakes on cheap, scalable Hadoop compatible storage Built on open file and data formats Transactional Database Kernel - Table Format for file layouts, schema, … - Indexing for faster updates/deletes - Built-in “daemons” aka table services - MVCC, OCC Concurrency Control SQL and Programming APIs Platform services and operational tools Universally queryable from popular engines
  • 7. It’s a platform! Both streaming + batch style pipelines - State store for incremental merging intermediate results - Change events like Apache Kafka topics For data lake workloads - Optimized, self-managing data plane - Large scale data processing - Lakehouse? With tightly-integrated components - Loose coupling => too many to integrate - Reduce build out time for data lakes http://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform
  • 8. Table Format Avro Schema, Evolution rules File groups, reduce merge overhead Timeline => event log, WAL Internal metadata table Ongoing - Schema-on-read i.e drop,renames (RFC-33) - Infinite retention
  • 9. File Formats Base and Delta Log Files - Parquet, Orc, HFile Base files - Avro log files - Encode changes as blocks Ongoing - Parquet log blocks for large batch writes - CSV, unstructured formats - pre-materialization for masking/data privacy
  • 10. Indexes Pluggable, Consistent with txns For upserts, deletes - HBase, External index -> pluggable - Simple, Bloom/Local vs Global Ongoing - RFC-27 Range indexes - Bucketed Index - DynamoDB index - Metadata index - Record level indexing
  • 11. Concurrency Control Hudi did not need multi-writer support - Treat writers and services differently - MVCC, non-blocking - Table services satisfy most needs Hudi now does Optimistic Concurrency Control - File level, timeline consistent - Still MVCC for table services Future/Ongoing - Multi-table transactions - MVCC, fully lock free transactions
  • 12. Writers Incremental & Batch write operations - File sizing, Layout control upon write - Sorting, compression, Index maintenance - Spill handling, Multi-threaded write pipeline Record level merges APIs - Unique keys, composite, - key generators, virtual or physical - partial merges, event-time processing Record level metadata - Arrival and event time, watermarks - Encode source CDC operation
  • 13. Readers Hive, Impala, Presto, Spark, Trino, Redshift Use engine’s native readers First class support for incremental queries Flexibility - snapshot vs read-optimized Future - Flexible change stream data models. - Snowflake/BigQuery external tables
  • 14. Table Services Self managing database runtime Table services know each other - E.g avoid duplicate schedules - E.g skip compacting files being clustered Cleaning (committed/uncommitted), archival, clustering, compaction, .. Services can be run continuously or scheduled
  • 15. Platform Services DeltaStreamer/FlinkStreamer ingest/ETL utility Deliver Commit notifications Kafka Connect Sink Data Quality checkers Snapshot, Restore, Export, Import
  • 16. Table Metadata Current choices, Ongoing work, Future plans
  • 17. What qualifies as table metadata? Schema - Columns names/types, keys, partitioning, evolution/versions - Typically small, < 1MB per version. Files/Objects - Length, paths, URIs - 2M objects => 10s of MBs Stats - Min, Max, Nulls etc, Per col Per file - 2M objects => 100+ of MBs Redo Logs - Changes to metadata => writes, rollbacks, table optimizations. - Committing (200kb) every minute for a year => ~100 GB Indexes? - Remember Stats != Index, They can be much bigger.
  • 18. How’s this stored in Hudi, today? Schema - Stored within the redo log, consistent with table changes. - Synced out to different meta-stores, post commit Files/Objects - Obtained from an internal metadata table partition `files` - Or just by listing storage - sometimes it’s faster! Redo Logs - As an event log in the timeline folder “.hoodie” - Archived out, once transactions/table operations complete/expire. Stats - We don’t. Yet. Fetch from file footers. - Again sometimes faster if parallelized, even on cloud storage.
  • 19. RFC-27 (Ongoing): Flat Files are not cool Scaling file stats for high scale writing - 65536 files (1TB data, stored as 16MB small files) - 100 columns, 6.5M stat entries - O(total_cols_tracked_in_table) - Slow, 10s of seconds. Range reads to the rescue! - O(num_cols_in_query) performace - Interval trees with smart skipping
  • 20. The Hudi Timeline server Metadata need efficient serving, caching - Not just efficient storage Responsibilities - Cache file listings across executors - Amortize access to metadata table - Performant uncommitted file cleanup Incremental sync - Streaming/continuous writes - Lazy refreshing of timeline S3 Baseline: listing p90 - 1sec (10k files), - 10 sec (100K files) Timeline Server: 1-10 ms! File-backed metadata: ~1 second!
  • 21. Extending the Timeline Server New APIs - Serve also stats, redo log information. - Locking APIs Let’s make a cluster! - Shard servers by table/db - Pluggable backing storage - Local DB w/ recovery/checkpointing - Remote DB with newSQL/transactional storage
  • 22. Cache Basic Idea, Design Considerations
  • 23. Basic Idea Problems - Frequent commits => small objects / blocks => I/O costly - File System / Block level caching not very effective base file b @ t1 base file b’ @ t2 log file 1 for b log file 2 for b log file 1 for b’ log file 2 for b’ Time Hudi FileGroup log file 3 for b’ Hudi FileGroup fits caching - Smallest unit to compact - Size properly to fit cache store - Cache compacted data for real-time views => save computation
  • 24. Design Considerations Refresh-Ahead - Works with Change-Data-Capture scenario - Micro-compact FileGroup and save in cache Cache base file b log file 1 for b log file 2 for b compacted Change-Data-Capture Refresh-Ahead Read-Through - Driven by usage, on-demand computation - LRU or LFU Query I/O Read-Through
  • 25. Design Considerations FileGroup consistent hashing - Each FileGroup has a unique ID - Work with distributed cache servers Cache Node A FileGroup Query I/O Cache Node B FileGroup FileGroup Coordinator (Timeline server?) Query I/O Lake Storage Cache (e.g. Alluxio) Transactionality - Only committed files can be cached - Rollback include cache invalidation Pluggable Caching Layer - Define APIs for pluggable caching implementations
  • 26. Community Adoption, Operating the Apache way, Ongoing work
  • 27. How we roll? Friendly and diverse community - Open and Collaborative - 20+ PMCs/Committers from 10+ organizations Developers - Propose new RFCs (design docs) - Dev list discussions, JIRA for issue tracking. Users - Weekly community on-call rotations - Issue triage, bug filing process on Github 1200+ Slack 200+ Contributors 1000+ GH Engagers ~10-20 PRs/week 20+ Committers 10+ PMCs
  • 28. Major Ongoing Works RFC-26: Z-order indexing, Hilbert curves (PR #3330) RFC-27: Data skipping/Range indexing (PR #3475) RFC-29: Hashed Indexing (PR #3173) RFC-32: Kafka Connect Sink for Hudi (Pre-release; available in 0.10.0) RFC-33: Full-schema evolution support (PR #3668) RFC-35: BigQuery integration
  • 29. Major Ongoing Works RFC-20: Error tables (PR #3312) RFC-08: Record level indexing (PR #3508) RFC-15: Synchronous, Multi table Metadata writes (PR #3590) Hudi + Dbt (dbt-labs/dbt-spark/pull/210) PrestoDB/Trino Connectors (Early design)
  • 30. Hudi is broadly adopted outside More at : http://hudi.apache.org/powered-by
  • 31. Engage With Our Community User Docs : https://hudi.apache.org Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI Github : https://github.com/apache/hudi/ Twitter : https://twitter.com/apachehudi Mailing list(s) : [email protected] (send an empty email to subscribe) [email protected] (actual mailing list) Slack : https://join.slack.com/t/apache-hudi/signup