SlideShare a Scribd company logo
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
👋 Hi, I’m Yaroslav
Staff Data Engineer @ Shopify
I like moving things from
Batch to Streaming. A lot.
λ Architecture
Classic Lambda Architecture © Databricks
Other Lambda Incarnations
● “Let’s run a batch job to fix the data”
● “Let’s run a batch job to optimize file size”
● “Let’s run a batch job to reprocess everything”
κ Architecture
Classic Kappa Architecture
Kappa Concerns
● Data availability / retention
● Data consistency
● Handling late-arriving data
● Data reprocessing & backfill
Before we continue...
…why do I like streaming so much?
Latency
● It’s actually not the main goal, but it’s a nice one!
● You have no idea how much latency is OK
Handling Late-Arriving Data: Batch
● How much time should we wait?
● How much time is OK to reprocess?
Handling Late-Arriving Data: Streaming
● Stateless transformations and sinks with updates: easy
● Stateful transformations:
○ Small state: easy
○ Large state: doable
● Sinks without updates: it depends
Operations and Observability: Batch
● “Yeah, it occasionally fails, we just wait 6 hours for a
retry run”
● “Oh, I disabled the wrong job and nobody noticed”
● “We don’t really have metrics for this job, but you can
monitor it with this UI”
Operations and Observability: Streaming
● Modern frameworks like Kafka Streams and Apache Flink
can be deployed with orchestration systems like
Kubernetes
● The same SLO and uptime expectations as with
applications serving traffic
● You can fully embrace CI/CD, observability and other
DevOps/SRE practices and mentality
κ Building Blocks
Core areas: The Log (Kafka), Streaming Framework (Flink or Kafka Streams), Sinks (e.g. Iceberg)
Kafka Topic Compaction
● Can be used if intermediate
values (per key) are not
important
● Enables infinite retention
Kafka Tiered Storage
● WIP (KIP-405)
● Enables infinite retention
for all topics
● “Topic Archive Pattern”
have been used for years
Kafka Exactly-Once
● Introduced in 0.11, 4+
years ago (!)
● Eliminates duplicates,
ensures consistency
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction();
} catch(ProducerFencedException e) {
producer.close();
} catch(KafkaException e) {
producer.abortTransaction();
}
Data Integration
● Bringing all types of data
to Kafka
● Solves “but I don’t have
this in Kafka” question
● Kafka Connect works best
with Kafka, but there are
other options
● Just avoid building
one-off integrations
● Makes sense for sinks
too!
Dynamic Kafka Clusters
● Transient Kafka clusters
can be brought on-demand
for large reprocessing
● This requires protocol
changes for producers and
consumers
● Netflix has done it
Reliable and scalable state as a part of
your streaming engine is mandatory for
any complex κ use-case
State Concepts
● Keyed state is a key to
scalable pipelines
● State is used as a building
block in a lot of high-level
components, you don’t
always see it
● State must be
fault-tolerant
○ KStreams: changelog
○ Flink: checkpoints,
savepoints
KStreams Exactly-Once
● Higher-level API on top of
Kafka exactly-once
primitives,
processing.guarantee =
exactly_once OR
exactly_once_beta
Flink Exactly-Once
● Leverages state to support
exactly-once semantics
● Has an advanced Kafka
source/sink integration
● Custom sources and sinks
can be created using
standard patterns
abstract class TwoPhaseCommitSinkFunction
extends RichSinkFunction
implements CheckpointedFunction,
CheckpointListener
State Management
● Advanced concepts
○ KStreams: Processor API
with state stores and
punctuators
○ Flink: state variables,
timers and side outputs
● You can use state as a
database
● You could even
repopulate state when
handling late-arriving
messages
Flink State Processor API
● Don’t backfill your state by
replaying from the source:
update state directly
● Combines the power of
batch and streaming by
processing state with batch
and then bootstrapping a
streaming application
val listState = savepoint
.readListState(
"my-state",
"list-state",
Types.INT
)
// ...
Savepoint
.create(new MemoryStateBackend(), 128)
.withOperator("my-state", transformation)
.write(savepointPath)
Some data stores are better suited to be
used as data sinks for streaming
pipelines than others
Supporting Updates/Upserts
Updates/Upserts can seriously simplify overall design: they can be
used for data correction.
● Good
○ RDBMS, NoSQL (HBase, Cassandra, Elasticsearch), OLAP
(Pinot), Compacted Kafka topics, “Lakehouse” object stores
with Parquet*
● Problematic
○ OLAP (Druid, Clickhouse), Non-compacted Kafka topics,
Regular object stores with Parquet*
“Lakehouse” Data Sinks
● Iceberg, Delta, Hudi
● Provide a transactional
journal on top of the object
store. Allow updates,
compaction, even time
travelling
FlinkSink.forRowData(input)
.tableLoader(tableLoader)
.overwrite(true)
.hadoopConf(hadoopConf)
.build()
Addressing κ Concerns
Addressing Concerns
● Dava availability / retention
○ Data integration, compacted topics and tiered storage
● Data consistency
○ Exactly-once end-to-end delivery semantics
● Handling late-arriving data
○ State management, proper data sinks
● Data reprocessing & backfill
○ Dynamic Kafka clusters, Savepoints, State Processor API
Use-case 1: stateless transformations, routing and integration
Use-case 1: stateless transformations, routing and integration
1. Tiered storage
2. Exactly-once
3. Kafka Connect
4. Iceberg
5. Upserts
6. Dynamic Kafka clusters
1
2
3
4
5
5
6
Use-case 2: stateful transformations, analytics
Use-case 2: stateful transformations, analytics
1. Tiered storage
2. Compacted topics
3. Kafka Connect / Debezium
4. Exactly-once
5. Savepoints, State Processor API
6. Upserts
1
2
4
5 6
4
3
Questions?
@sap1ens

More Related Content

What's hot (20)

PDF
[DockerCon 2019] Hardening Docker daemon with Rootless mode
Akihiro Suda
 
PDF
Introduction to Git
Yan Vugenfirer
 
PDF
Performance Analysis of Apache Spark and Presto in Cloud Environments
Databricks
 
PPTX
分散システムについて語らせてくれ
Kumazaki Hiroki
 
PPTX
AlloyDBを触ってみた!(第33回PostgreSQLアンカンファレンス@オンライン 発表資料)
NTT DATA Technology & Innovation
 
PDF
Apache Flink and Apache Hudi.pdf
dogma28
 
PDF
忙しい人の5分で分かるDocker 2017年春Ver
Masahito Zembutsu
 
PPTX
GraalVMのJavaネイティブビルド機能でどの程度起動が速くなるのか?~サーバレス基盤上での評価~ / How fast does GraalVM's...
Shinji Takao
 
PDF
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
PPTX
Infrastructure as code (iac) - Terraform for AWS
Johanes Glenn
 
PDF
Spark Summit EU talk by Mike Percy
Spark Summit
 
PDF
Keep me in the Loop: INotify in HDFS
DataWorks Summit
 
PPTX
Hadoop
ABHIJEET RAJ
 
PDF
Containers: The What, Why, and How
Sneha Inguva
 
PDF
Terraform -- Infrastructure as Code
Martin Schütte
 
PPTX
A brief introduction to IaC with Terraform by Kenton Robbins (codeHarbour May...
Alex Cachia
 
PPTX
DevOps: Infrastructure as Code
Julio Aziz Flores Casab
 
PDF
Software Engineering - chp0- introduction
Lilia Sfaxi
 
PPTX
Golang - Overview of Go (golang) Language
Aniruddha Chakrabarti
 
PDF
Pharo Virtual Machine: News from the Front
ESUG
 
[DockerCon 2019] Hardening Docker daemon with Rootless mode
Akihiro Suda
 
Introduction to Git
Yan Vugenfirer
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Databricks
 
分散システムについて語らせてくれ
Kumazaki Hiroki
 
AlloyDBを触ってみた!(第33回PostgreSQLアンカンファレンス@オンライン 発表資料)
NTT DATA Technology & Innovation
 
Apache Flink and Apache Hudi.pdf
dogma28
 
忙しい人の5分で分かるDocker 2017年春Ver
Masahito Zembutsu
 
GraalVMのJavaネイティブビルド機能でどの程度起動が速くなるのか?~サーバレス基盤上での評価~ / How fast does GraalVM's...
Shinji Takao
 
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
Infrastructure as code (iac) - Terraform for AWS
Johanes Glenn
 
Spark Summit EU talk by Mike Percy
Spark Summit
 
Keep me in the Loop: INotify in HDFS
DataWorks Summit
 
Hadoop
ABHIJEET RAJ
 
Containers: The What, Why, and How
Sneha Inguva
 
Terraform -- Infrastructure as Code
Martin Schütte
 
A brief introduction to IaC with Terraform by Kenton Robbins (codeHarbour May...
Alex Cachia
 
DevOps: Infrastructure as Code
Julio Aziz Flores Casab
 
Software Engineering - chp0- introduction
Lilia Sfaxi
 
Golang - Overview of Go (golang) Language
Aniruddha Chakrabarti
 
Pharo Virtual Machine: News from the Front
ESUG
 

Similar to It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify (20)

PDF
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
 
PDF
Streaming vs batching (conundrum ai internal meetup)
Mark Andreev
 
PDF
Building Big Data Streaming Architectures
David Martínez Rego
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PDF
Streaming architecture patterns
hadooparchbook
 
PDF
Towards Data Operations
Andrea Monacchi
 
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
PDF
Building end to end streaming application on Spark
datamantra
 
PPTX
Data streaming fundamentals
Mohammed Fazuluddin
 
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
PDF
What no one tells you about writing a streaming app
hadooparchbook
 
PPTX
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
PDF
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
HostedbyConfluent
 
PDF
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
PDF
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
Databricks
 
PDF
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
PDF
The Power of Distributed Snapshots in Apache Flink
C4Media
 
PDF
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
 
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
 
Streaming vs batching (conundrum ai internal meetup)
Mark Andreev
 
Building Big Data Streaming Architectures
David Martínez Rego
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Streaming architecture patterns
hadooparchbook
 
Towards Data Operations
Andrea Monacchi
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
Building end to end streaming application on Spark
datamantra
 
Data streaming fundamentals
Mohammed Fazuluddin
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
What no one tells you about writing a streaming app
hadooparchbook
 
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
HostedbyConfluent
 
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
Databricks
 
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
The Power of Distributed Snapshots in Apache Flink
C4Media
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
 
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
PDF
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
PDF
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
PDF
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
PDF
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
PDF
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
PDF
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
PDF
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
PDF
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
PDF
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Ad

Recently uploaded (20)

PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 

It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify

  • 1. It's Time To Stop Using Lambda Architecture Yaroslav Tkachenko
  • 2. 👋 Hi, I’m Yaroslav Staff Data Engineer @ Shopify I like moving things from Batch to Streaming. A lot.
  • 5. Other Lambda Incarnations ● “Let’s run a batch job to fix the data” ● “Let’s run a batch job to optimize file size” ● “Let’s run a batch job to reprocess everything”
  • 8. Kappa Concerns ● Data availability / retention ● Data consistency ● Handling late-arriving data ● Data reprocessing & backfill
  • 10. …why do I like streaming so much?
  • 11. Latency ● It’s actually not the main goal, but it’s a nice one! ● You have no idea how much latency is OK
  • 12. Handling Late-Arriving Data: Batch ● How much time should we wait? ● How much time is OK to reprocess?
  • 13. Handling Late-Arriving Data: Streaming ● Stateless transformations and sinks with updates: easy ● Stateful transformations: ○ Small state: easy ○ Large state: doable ● Sinks without updates: it depends
  • 14. Operations and Observability: Batch ● “Yeah, it occasionally fails, we just wait 6 hours for a retry run” ● “Oh, I disabled the wrong job and nobody noticed” ● “We don’t really have metrics for this job, but you can monitor it with this UI”
  • 15. Operations and Observability: Streaming ● Modern frameworks like Kafka Streams and Apache Flink can be deployed with orchestration systems like Kubernetes ● The same SLO and uptime expectations as with applications serving traffic ● You can fully embrace CI/CD, observability and other DevOps/SRE practices and mentality
  • 17. Core areas: The Log (Kafka), Streaming Framework (Flink or Kafka Streams), Sinks (e.g. Iceberg)
  • 18. Kafka Topic Compaction ● Can be used if intermediate values (per key) are not important ● Enables infinite retention
  • 19. Kafka Tiered Storage ● WIP (KIP-405) ● Enables infinite retention for all topics ● “Topic Archive Pattern” have been used for years
  • 20. Kafka Exactly-Once ● Introduced in 0.11, 4+ years ago (!) ● Eliminates duplicates, ensures consistency producer.initTransactions(); try { producer.beginTransaction(); producer.send(record1); producer.send(record2); producer.commitTransaction(); } catch(ProducerFencedException e) { producer.close(); } catch(KafkaException e) { producer.abortTransaction(); }
  • 21. Data Integration ● Bringing all types of data to Kafka ● Solves “but I don’t have this in Kafka” question ● Kafka Connect works best with Kafka, but there are other options ● Just avoid building one-off integrations ● Makes sense for sinks too!
  • 22. Dynamic Kafka Clusters ● Transient Kafka clusters can be brought on-demand for large reprocessing ● This requires protocol changes for producers and consumers ● Netflix has done it
  • 23. Reliable and scalable state as a part of your streaming engine is mandatory for any complex κ use-case
  • 24. State Concepts ● Keyed state is a key to scalable pipelines ● State is used as a building block in a lot of high-level components, you don’t always see it ● State must be fault-tolerant ○ KStreams: changelog ○ Flink: checkpoints, savepoints
  • 25. KStreams Exactly-Once ● Higher-level API on top of Kafka exactly-once primitives, processing.guarantee = exactly_once OR exactly_once_beta
  • 26. Flink Exactly-Once ● Leverages state to support exactly-once semantics ● Has an advanced Kafka source/sink integration ● Custom sources and sinks can be created using standard patterns abstract class TwoPhaseCommitSinkFunction extends RichSinkFunction implements CheckpointedFunction, CheckpointListener
  • 27. State Management ● Advanced concepts ○ KStreams: Processor API with state stores and punctuators ○ Flink: state variables, timers and side outputs ● You can use state as a database ● You could even repopulate state when handling late-arriving messages
  • 28. Flink State Processor API ● Don’t backfill your state by replaying from the source: update state directly ● Combines the power of batch and streaming by processing state with batch and then bootstrapping a streaming application val listState = savepoint .readListState( "my-state", "list-state", Types.INT ) // ... Savepoint .create(new MemoryStateBackend(), 128) .withOperator("my-state", transformation) .write(savepointPath)
  • 29. Some data stores are better suited to be used as data sinks for streaming pipelines than others
  • 30. Supporting Updates/Upserts Updates/Upserts can seriously simplify overall design: they can be used for data correction. ● Good ○ RDBMS, NoSQL (HBase, Cassandra, Elasticsearch), OLAP (Pinot), Compacted Kafka topics, “Lakehouse” object stores with Parquet* ● Problematic ○ OLAP (Druid, Clickhouse), Non-compacted Kafka topics, Regular object stores with Parquet*
  • 31. “Lakehouse” Data Sinks ● Iceberg, Delta, Hudi ● Provide a transactional journal on top of the object store. Allow updates, compaction, even time travelling FlinkSink.forRowData(input) .tableLoader(tableLoader) .overwrite(true) .hadoopConf(hadoopConf) .build()
  • 33. Addressing Concerns ● Dava availability / retention ○ Data integration, compacted topics and tiered storage ● Data consistency ○ Exactly-once end-to-end delivery semantics ● Handling late-arriving data ○ State management, proper data sinks ● Data reprocessing & backfill ○ Dynamic Kafka clusters, Savepoints, State Processor API
  • 34. Use-case 1: stateless transformations, routing and integration
  • 35. Use-case 1: stateless transformations, routing and integration 1. Tiered storage 2. Exactly-once 3. Kafka Connect 4. Iceberg 5. Upserts 6. Dynamic Kafka clusters 1 2 3 4 5 5 6
  • 36. Use-case 2: stateful transformations, analytics
  • 37. Use-case 2: stateful transformations, analytics 1. Tiered storage 2. Compacted topics 3. Kafka Connect / Debezium 4. Exactly-once 5. Savepoints, State Processor API 6. Upserts 1 2 4 5 6 4 3