SlideShare a Scribd company logo
8
Most read
11
Most read
Correctly Loading Incremental Data at Scale
April 16, 2024
Copyright © 2024 Tobiko Data, Inc.
About Me
● Co-Founder & CTO at Tobiko Data
● Creator of SQLGlot
● Previously Airbnb, Netflix
● Experience with rec sys, xp,
semantics layer
What kind of data is incremental?
● Facts
○ Clicks, views, etc...
● Previous events immutable
● Can be quite large
What is not incremental?
● Dimensions
○ Users, billing info, etc...
● Rows can change over time
● Usually smaller
Should I avoid incremental loading?
Incremental loading exists for a reason
How do you read data incrementally?
maximal timestamp time partitions
Maximal timestamp
● If the table doesn’t exist, compute the whole history in one shot
● If the table does exist, query it to find the last processed timestamp
and then use that to filter the upstream source data.
Maximal timestamp pros / cons
● Doesn't require extra state
● It assumes that you’re able to load the entire table in one go
● The query is more complicated to write and maintain
● Custom SQL is often needed to handle incremental models
differently in development
● You cannot detect or fix data gaps
Time partitions
● Scheduler tracks what time ranges need to run and passes it to the
query
Time partitions pros / cons
● Requires extra state
● The queries are simpler
● Backfills are more scalable and reliable
● You can easily compute just one day of data or manually fix gaps
Initial load maximal timestamps vs partitions
● Requires a primary key
● Simple to ensure consistency and no duplicates
● Can have performance issues
Merge - incremental by unique key
● Very efficient
● Doesn't handle updates or late arriving data easily
Insert overwrite
Merge vs Insert overwrite
● Merge is expensive without partition pruning
● Merge is more effective when only a few records need to be updated
● Insert overwrite is inefficient when only a few rows have changed
● Insert overwrite doesn't need to match rows so can be more efficient
Reading late arriving data
Writing late arriving data
● Avoiding data leakage with insert overwrite
○ Filter your results to the expected range and insert the complete
range
● Use predicate push down with merge
○ Avoid full scanning your data
Thank you!

More Related Content

Similar to Correctly Loading Incremental Data at Scale (20)

PDF
Time Travelling With DB2 10 For zOS
Laura Hood
 
PDF
The world's next top data model
Patrick McFadin
 
PPTX
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Treasure Data, Inc.
 
PDF
Cassandra Community Webinar | Data Model on Fire
DataStax
 
PPTX
SQL Server 2016 Temporal Tables
Davide Mauri
 
PDF
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
DataStax Academy
 
PDF
Cassandra at Morningstar (Feb 2011)
jeremiahdjordan
 
PPT
Teradata 13.10
Teradata
 
PPTX
Sql 2016 - What's New
dpcobb
 
PDF
PHPDay 2019 - MySQL 8, not only good, great!
Gabriela Ferrara
 
PDF
PostgreSQL: Data analysis and analytics
Hans-Jürgen Schönig
 
PDF
Christian Winther Kristensen
InfinIT - Innovationsnetværket for it
 
PDF
State of Cassandra, 2011
jbellis
 
PDF
Webinar - MariaDB Temporal Tables: a demonstration
Federico Razzoli
 
PDF
OSDC 2012 | Expert Troubleshooting: Resolving MySQL Problems Quickly by Kenny...
NETWAYS
 
PDF
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
PDF
MariaDB Temporal Tables
Federico Razzoli
 
PPTX
SQL Server & SQL Azure Temporal Tables - V2
Davide Mauri
 
PDF
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Duyhai Doan
 
PDF
Cassandra - lesson learned
Andrzej Ludwikowski
 
Time Travelling With DB2 10 For zOS
Laura Hood
 
The world's next top data model
Patrick McFadin
 
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Treasure Data, Inc.
 
Cassandra Community Webinar | Data Model on Fire
DataStax
 
SQL Server 2016 Temporal Tables
Davide Mauri
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
DataStax Academy
 
Cassandra at Morningstar (Feb 2011)
jeremiahdjordan
 
Teradata 13.10
Teradata
 
Sql 2016 - What's New
dpcobb
 
PHPDay 2019 - MySQL 8, not only good, great!
Gabriela Ferrara
 
PostgreSQL: Data analysis and analytics
Hans-Jürgen Schönig
 
Christian Winther Kristensen
InfinIT - Innovationsnetværket for it
 
State of Cassandra, 2011
jbellis
 
Webinar - MariaDB Temporal Tables: a demonstration
Federico Razzoli
 
OSDC 2012 | Expert Troubleshooting: Resolving MySQL Problems Quickly by Kenny...
NETWAYS
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
MariaDB Temporal Tables
Federico Razzoli
 
SQL Server & SQL Azure Temporal Tables - V2
Davide Mauri
 
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Duyhai Doan
 
Cassandra - lesson learned
Andrzej Ludwikowski
 

More from Alluxio, Inc. (20)

PDF
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
PDF
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
AI/ML Infra Meetup | Balancing Cost, Performance, and Scale - Running GPU/CPU...
Alluxio, Inc.
 
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
AI/ML Infra Meetup | Exploring Distributed Caching for Faster GPU Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
Ad

Recently uploaded (20)

PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
PPTX
Structural Functiona theory this important for the theorist
cagumaydanny26
 
PDF
6th International Conference on Machine Learning Techniques and Data Science ...
ijistjournal
 
PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
PPTX
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PPTX
drones for disaster prevention response.pptx
NawrasShatnawi1
 
PDF
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
PDF
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PPTX
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
PPTX
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
PDF
MOBILE AND WEB BASED REMOTE BUSINESS MONITORING SYSTEM
ijait
 
PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
PPT
Oxygen Co2 Transport in the Lungs(Exchange og gases)
SUNDERLINSHIBUD
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PPTX
Presentation on Foundation Design for Civil Engineers.pptx
KamalKhan563106
 
PDF
monopile foundation seminar topic for civil engineering students
Ahina5
 
PDF
Book.pdf01_Intro.ppt algorithm for preperation stu used
archu26
 
PPTX
Benefits_^0_Challigi😙🏡💐8fenges[1].pptx
akghostmaker
 
Hashing Introduction , hash functions and techniques
sailajam21
 
Structural Functiona theory this important for the theorist
cagumaydanny26
 
6th International Conference on Machine Learning Techniques and Data Science ...
ijistjournal
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
drones for disaster prevention response.pptx
NawrasShatnawi1
 
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
MOBILE AND WEB BASED REMOTE BUSINESS MONITORING SYSTEM
ijait
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
Oxygen Co2 Transport in the Lungs(Exchange og gases)
SUNDERLINSHIBUD
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
Presentation on Foundation Design for Civil Engineers.pptx
KamalKhan563106
 
monopile foundation seminar topic for civil engineering students
Ahina5
 
Book.pdf01_Intro.ppt algorithm for preperation stu used
archu26
 
Benefits_^0_Challigi😙🏡💐8fenges[1].pptx
akghostmaker
 
Ad

Correctly Loading Incremental Data at Scale

  • 1. Correctly Loading Incremental Data at Scale April 16, 2024 Copyright © 2024 Tobiko Data, Inc.
  • 2. About Me ● Co-Founder & CTO at Tobiko Data ● Creator of SQLGlot ● Previously Airbnb, Netflix ● Experience with rec sys, xp, semantics layer
  • 3. What kind of data is incremental? ● Facts ○ Clicks, views, etc... ● Previous events immutable ● Can be quite large
  • 4. What is not incremental? ● Dimensions ○ Users, billing info, etc... ● Rows can change over time ● Usually smaller
  • 5. Should I avoid incremental loading?
  • 7. How do you read data incrementally? maximal timestamp time partitions
  • 8. Maximal timestamp ● If the table doesn’t exist, compute the whole history in one shot ● If the table does exist, query it to find the last processed timestamp and then use that to filter the upstream source data.
  • 9. Maximal timestamp pros / cons ● Doesn't require extra state ● It assumes that you’re able to load the entire table in one go ● The query is more complicated to write and maintain ● Custom SQL is often needed to handle incremental models differently in development ● You cannot detect or fix data gaps
  • 10. Time partitions ● Scheduler tracks what time ranges need to run and passes it to the query
  • 11. Time partitions pros / cons ● Requires extra state ● The queries are simpler ● Backfills are more scalable and reliable ● You can easily compute just one day of data or manually fix gaps
  • 12. Initial load maximal timestamps vs partitions
  • 13. ● Requires a primary key ● Simple to ensure consistency and no duplicates ● Can have performance issues Merge - incremental by unique key
  • 14. ● Very efficient ● Doesn't handle updates or late arriving data easily Insert overwrite
  • 15. Merge vs Insert overwrite ● Merge is expensive without partition pruning ● Merge is more effective when only a few records need to be updated ● Insert overwrite is inefficient when only a few rows have changed ● Insert overwrite doesn't need to match rows so can be more efficient
  • 17. Writing late arriving data ● Avoiding data leakage with insert overwrite ○ Filter your results to the expected range and insert the complete range ● Use predicate push down with merge ○ Avoid full scanning your data