SlideShare a Scribd company logo
ksqlDB: A Stream-Relational
Database System
Matthias J. Sax | Software Engineer
@MatthiasJSax
The Vision of a Streaming Database
2
SQL abstraction over mutable TABLEs and immutable append-only STREAMs
• OLTP?
• transaction, concurrent updates
• OLAP?
• complex ad-hoc queries
• Online Stream Processing:
• Continuous queries over STREAMs and TABLEs
• Simple state lookups into TABLEs
• Subscribe to data streams
• Streaming ETL
• Materialized View maintenance
Apache Kafka and ksqlDB
3
Broker Cluster
“Storage Layer”
ksqlDB Cluster
“Compute Layer”
Network
Replicated & partitioned, immutable append-only log of messages
Apache Kafka: Data Model
4
Messages:
• plain bytes: key (for partitioning), value
• long timestamp
Topic cleanup policies:
• retention: purge old messages (time or size based)
• compaction: keep the latest message per key
Apache Kafka: Data Model (cont.)
5
The Stream-Table Duality
6
ksqlDB: Data Model
Streams and tables of structured records
7
CREATE STREAM clickstream (
time BIGINT,
url VARCHAR,
status INTEGER,
bytes INTEGER,
user_id VARCHAR,
agent VARCHAR)
WITH (
kafka_topic = ‘cs_topic',
value_format = 'JSON'
);
CREATE TABLE users (
user_id INTEGER PRIMARY KEY,
registered_at LONG,
username VARCHAR,
name VARCHAR,
city VARCHAR,
level VARCHAR)
WITH (
kafka_topic = ‘users_topic',
value_format = 'AVRO’
);
ksqlDB: Queries
Persistent continuous queries (CQ)
• Take one or more input STREAMs/TABLEs and compute a result STREAM or TABLE
• Deployed in the ksqlDB servers
• Executed in a data-parallel manner
• Scalable
• Fault-tolerant
8
CREATE STREAM resultStream
AS SELECT...
CREATE TABLE resultTable
AS SELECT...
ksqlDB: Queries (cont.)
Transient client queries
• CLI / Java client / etc.
• Pull queries: simple state lookups against TABLEs
• “classic” queries
• Limited in ksqlDB (no aggregations or joins)
• Push queries: subscription to a result STREAM
9
ksqlDB: Overview
10
Kafka Cluster
(stores STREAMs/TABLEs as partitioned logs)
ksqlDB cluster
Upstream
Producers
CQ
Downstream
Consumers
App App
CQ
Push (Subscription) Pull (Query)
Querying
Queries Semantics
12
Streaming queries have temporal semantics based on event-time
TABLE queries:
• Like regular SQL
• However: TABLEs are treated as “versioned” tables that evolve over time
How to query a STREAM?
Simple Stream Queries
13
Filter, projection etc (stateless)
CREATE STREAM user_clicks AS
SELECT user_id, status, ucase(agent)
FROM clickstream
WHERE user_id = ‘mjsax’;
Stream Aggregation
14
Windowed / non-windowed: returns a continuously updating TABLE
CREATE TABLE clicks AS
SELECT user_id, COUNT(url)
FROM clickstream
WHERE bytes > 1024
WINDOW TUMBLIND
(size 30 seconds)
GROUP BY user_id
HAVING COUNT(url) > 20;
Stream-Stream Join
15
Sliding-window join:
14:041 14:162 14:083
14:01A 14:11B 14:23C
14:041⨝A 14:162⨝B 14:113⨝B
max(l.ts; r.ts)
CREATE STREAM joinedStream AS
SELECT *
FROM leftStream AS l JOIN rightStream AS r
WITHIN 5 minutes ON l.id = r.id;
Chaining stream-stream joins is not associative!
• Order matters: ⨝(s1,s2,s3) != (s1 ⨝ s2) ⨝ s3 != (s1 ⨝ s3) ⨝ s2 != (s2 ⨝ s3) ⨝ s1
N-way Stream-Stream Join
16
Chaining stream-stream joins is not associative!
• Order matters: ⨝(s1,s2,s3) != (s1 ⨝ s2) ⨝ s3 != (s1 ⨝ s3) ⨝ s2 != (s2 ⨝ s3) ⨝ s1
N-way Stream-Stream Join
17
14:06X 14:21Y
14:212⨝Y⨝b
14:16b14:11a
14:011 14:26314:162
* window size=5min
Chaining stream-stream joins is not associative!
• Order matters: ⨝(s1,s2,s3) != (s1 ⨝ s2) ⨝ s3 != (s1 ⨝ s3) ⨝ s2 != (s2 ⨝ s3) ⨝ s1
N-way Stream-Stream Join
18
14:06X 14:21Y
14:011 14:26314:162
14:212⨝Y14:061⨝X 14:263⨝Y
* window size=5min
Chaining stream-stream joins is not associative!
• Order matters: ⨝(s1,s2,s3) != (s1 ⨝ s2) ⨝ s3 != (s1 ⨝ s3) ⨝ s2 != (s2 ⨝ s3) ⨝ s1
N-way Stream-Stream Join
19
14:16b14:11a
14:212⨝Y⨝b14:111⨝Y⨝a
14:212⨝Y14:061⨝X 14:263⨝Y
* window size=5min
Table Queries
20
As in regular SQL, however…
2121
Do you think that’s a table you are querying ?
Table Queries
22
Filters, projections, and aggregations are executed on changelog streams
Query
Stream-Table Join
23
Data enrichment via table lookups
Stream-Table Join (cont.)
24
Stream-Table join is a temporal join
14:01a 14:03b 14:05c 14:08b 14:11a
14:02… 14:04… 14:07…14:06… 14:10…
14:01a
14:03b
14:05c
14:05
14:01a
14:08b
14:05c
14:08
14:11a
14:08b
14:05c
14:11
14:01a
14:03b
14:03
14:01a
14:01
14:06 14:07 14:1014:0414:02
Table-Table Join
25
Table-Table join is a temporal join
users
details
profile
v1 v5 v6
v2 v6
v2 v5 v6
Runtime: Kafka Streams
Persistent Queries
Executed as Kafka Streams programs
Kafka Streams:
• Java client library (part of the Apache Kafka project)
• High level DSL to process Kafka topics as KStreams and KTables
• Executes a dataflow program, represented as a directed graph of operators
• filter()/map()
• groupBy()/windowedBy()
• aggregate()
• join()
• Etc.
• Scalable
• Fault-Tolerant
27
Persistent Queries
Kafka Stream topologies:
28
Parallel execution:
Kafka Streams Internals
Consumers, Producers, and RocksDB
29
Kafka Streams Internals (cont.)
Data repartitioning and fault-tolerance
30
Fault-Tolerance and Hot Standbys
31
Recovery: replying the
changelog topic
Hot standbys: eagerly
replaying the changelog
topic
Hot standbys allow for
instant fail-over.
Hot standbys allow for HA
pull queries.
Elasticity
32
Dynamic scaling as special case of store recovery
Challenges
Work in Progress
Streaming SQL
• Concise and more powerful language
• Improved time/operator semantics
Consistency guarantees
• ksqlDB is an async system: vector clock approach for improved time tracking
Applications and transient queries
• More efficient and more scalable pull/push queries
• Improved INSERT/UPDATE/DELETE support?
34
Future Work
Query optimization
• Currently state: rule based
• filter push down
• merging of repartition topic
• merging of input/output/changelog topics
• Streaming cost model and cost-based optimization
• Adaptive re-optimization at runtime
• Query merging/splitting
Runtime improvements:
• Internal and optimized (binary) data format to avoid expensive (de)serialization costs
• Task assignment: load balancing vs stickyness
Richer SQL:
• Sub-query support
35
References
• KSQL: Streaming SQL Engine for Apache Kafka
Hojjat Jafarpour, Rohan Desai, Damian Guy
EDBT '19: Proceedings of the 22nd International Conference on Extending Database Technology, 2019
• Streams and Tables: Two Sides of the Same Coin
Matthias J. Sax, Guozhang Wang, Matthias Weidlich, Johann-Christoph Freytag
BIRTE '18: Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics, 2018
• https://kafka.apache.org/books-and-papers
• https://ksqldb.io/
36
Thanks! We are hiring!
@MatthiasJSax
matthias@confluent.io | mjsax@apache.org

More Related Content

What's hot (20)

PDF
Introduction to Kafka Streams
Guozhang Wang
 
PDF
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
PPTX
Apache kafka
Kumar Shivam
 
PPTX
Kafka presentation
Mohammed Fazuluddin
 
PDF
Kafka Streams: What it is, and how to use it?
confluent
 
PDF
Fundamentals of Apache Kafka
Chhavi Parasher
 
PPTX
Introduction to Apache Kafka
Jeff Holoman
 
PPTX
kafka
Amikam Snir
 
PPTX
Apache Kafka at LinkedIn
Discover Pinterest
 
PDF
Apache Kafka - Martin Podval
Martin Podval
 
PDF
Integrating Apache Kafka Into Your Environment
confluent
 
PPTX
Introduction to KSQL: Streaming SQL for Apache Kafka®
confluent
 
PPTX
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
PPTX
Introduction to Kafka Cruise Control
Jiangjie Qin
 
PDF
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
confluent
 
PDF
Can Apache Kafka Replace a Database?
Kai Wähner
 
PDF
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
PDF
So You Want to Write a Connector?
confluent
 
PDF
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
Introduction to Kafka Streams
Guozhang Wang
 
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
Apache kafka
Kumar Shivam
 
Kafka presentation
Mohammed Fazuluddin
 
Kafka Streams: What it is, and how to use it?
confluent
 
Fundamentals of Apache Kafka
Chhavi Parasher
 
Introduction to Apache Kafka
Jeff Holoman
 
Apache Kafka at LinkedIn
Discover Pinterest
 
Apache Kafka - Martin Podval
Martin Podval
 
Integrating Apache Kafka Into Your Environment
confluent
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
confluent
 
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Introduction to Kafka Cruise Control
Jiangjie Qin
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
confluent
 
Can Apache Kafka Replace a Database?
Kai Wähner
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
So You Want to Write a Connector?
confluent
 
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 

Similar to ksqlDB: A Stream-Relational Database System (20)

PDF
Paris jug ksql - 2018-06-28
Florent Ramiere
 
PDF
Streams, Tables, and Time in KSQL
confluent
 
PPTX
Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDB
ScyllaDB
 
PDF
Un'introduzione a Kafka Streams e KSQL... and why they matter!
Paolo Castagna
 
PDF
ksqlDB Workshop
confluent
 
PDF
How Do You Query a Stream? | Kafka Summit London
HostedbyConfluent
 
PDF
KSQL – An Open Source Streaming Engine for Apache Kafka
Kai Wähner
 
PDF
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
Kai Wähner
 
PPTX
Exploring KSQL Patterns
confluent
 
PDF
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Codemotion
 
PDF
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Codemotion
 
PDF
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Michael Noll
 
PDF
KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...
Kai Wähner
 
PDF
APAC ksqlDB Workshop
confluent
 
PDF
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
Michael Noll
 
PDF
KSQL: Open Source Streaming for Apache Kafka
confluent
 
PDF
KSQL - Stream Processing simplified!
Guido Schmutz
 
PDF
Riviera Jug - 20/03/2018 - KSQL
Florent Ramiere
 
PDF
Live Coding a KSQL Application
confluent
 
PPTX
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
KafkaZone
 
Paris jug ksql - 2018-06-28
Florent Ramiere
 
Streams, Tables, and Time in KSQL
confluent
 
Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDB
ScyllaDB
 
Un'introduzione a Kafka Streams e KSQL... and why they matter!
Paolo Castagna
 
ksqlDB Workshop
confluent
 
How Do You Query a Stream? | Kafka Summit London
HostedbyConfluent
 
KSQL – An Open Source Streaming Engine for Apache Kafka
Kai Wähner
 
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
Kai Wähner
 
Exploring KSQL Patterns
confluent
 
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Codemotion
 
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Codemotion
 
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...
Michael Noll
 
KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...
Kai Wähner
 
APAC ksqlDB Workshop
confluent
 
Big, Fast, Easy Data: Distributed Stream Processing for Everyone with KSQL, t...
Michael Noll
 
KSQL: Open Source Streaming for Apache Kafka
confluent
 
KSQL - Stream Processing simplified!
Guido Schmutz
 
Riviera Jug - 20/03/2018 - KSQL
Florent Ramiere
 
Live Coding a KSQL Application
confluent
 
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
KafkaZone
 
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
PDF
Migration, backup and restore made easy using Kannika
confluent
 
PDF
Five Things You Need to Know About Data Streaming in 2025
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
PDF
Unlocking value with event-driven architecture by Confluent
confluent
 
PDF
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
PDF
Building API data products on top of your real-time data infrastructure
confluent
 
PDF
Speed Wins: From Kafka to APIs in Minutes
confluent
 
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
Migration, backup and restore made easy using Kannika
confluent
 
Five Things You Need to Know About Data Streaming in 2025
confluent
 
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Unlocking value with event-driven architecture by Confluent
confluent
 
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Ad

Recently uploaded (20)

PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PDF
Home Cleaning App Development Services.pdf
V3cube
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PDF
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Linux schedulers for fun and profit with SchedKit
Alessio Biancalana
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
Manual Testing for Accessibility Enhancement
Julia Undeutsch
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Software Development Company Keene Systems, Inc (1).pdf
Custom Software Development Company | Keene Systems, Inc.
 
PDF
Evolution: How True AI is Redefining Safety in Industry 4.0
vikaassingh4433
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
NASA A Researcher’s Guide to International Space Station : Fundamental Physics
Dr. PANKAJ DHUSSA
 
PDF
[GDGoC FPTU] Spring 2025 Summary Slidess
minhtrietgect
 
PDF
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pdf
ghjghvhjgc
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
Home Cleaning App Development Services.pdf
V3cube
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Linux schedulers for fun and profit with SchedKit
Alessio Biancalana
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Manual Testing for Accessibility Enhancement
Julia Undeutsch
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Software Development Company Keene Systems, Inc (1).pdf
Custom Software Development Company | Keene Systems, Inc.
 
Evolution: How True AI is Redefining Safety in Industry 4.0
vikaassingh4433
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
NASA A Researcher’s Guide to International Space Station : Fundamental Physics
Dr. PANKAJ DHUSSA
 
[GDGoC FPTU] Spring 2025 Summary Slidess
minhtrietgect
 
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pdf
ghjghvhjgc
 

ksqlDB: A Stream-Relational Database System

  • 1. ksqlDB: A Stream-Relational Database System Matthias J. Sax | Software Engineer @MatthiasJSax
  • 2. The Vision of a Streaming Database 2 SQL abstraction over mutable TABLEs and immutable append-only STREAMs • OLTP? • transaction, concurrent updates • OLAP? • complex ad-hoc queries • Online Stream Processing: • Continuous queries over STREAMs and TABLEs • Simple state lookups into TABLEs • Subscribe to data streams • Streaming ETL • Materialized View maintenance
  • 3. Apache Kafka and ksqlDB 3 Broker Cluster “Storage Layer” ksqlDB Cluster “Compute Layer” Network
  • 4. Replicated & partitioned, immutable append-only log of messages Apache Kafka: Data Model 4
  • 5. Messages: • plain bytes: key (for partitioning), value • long timestamp Topic cleanup policies: • retention: purge old messages (time or size based) • compaction: keep the latest message per key Apache Kafka: Data Model (cont.) 5
  • 7. ksqlDB: Data Model Streams and tables of structured records 7 CREATE STREAM clickstream ( time BIGINT, url VARCHAR, status INTEGER, bytes INTEGER, user_id VARCHAR, agent VARCHAR) WITH ( kafka_topic = ‘cs_topic', value_format = 'JSON' ); CREATE TABLE users ( user_id INTEGER PRIMARY KEY, registered_at LONG, username VARCHAR, name VARCHAR, city VARCHAR, level VARCHAR) WITH ( kafka_topic = ‘users_topic', value_format = 'AVRO’ );
  • 8. ksqlDB: Queries Persistent continuous queries (CQ) • Take one or more input STREAMs/TABLEs and compute a result STREAM or TABLE • Deployed in the ksqlDB servers • Executed in a data-parallel manner • Scalable • Fault-tolerant 8 CREATE STREAM resultStream AS SELECT... CREATE TABLE resultTable AS SELECT...
  • 9. ksqlDB: Queries (cont.) Transient client queries • CLI / Java client / etc. • Pull queries: simple state lookups against TABLEs • “classic” queries • Limited in ksqlDB (no aggregations or joins) • Push queries: subscription to a result STREAM 9
  • 10. ksqlDB: Overview 10 Kafka Cluster (stores STREAMs/TABLEs as partitioned logs) ksqlDB cluster Upstream Producers CQ Downstream Consumers App App CQ Push (Subscription) Pull (Query)
  • 12. Queries Semantics 12 Streaming queries have temporal semantics based on event-time TABLE queries: • Like regular SQL • However: TABLEs are treated as “versioned” tables that evolve over time How to query a STREAM?
  • 13. Simple Stream Queries 13 Filter, projection etc (stateless) CREATE STREAM user_clicks AS SELECT user_id, status, ucase(agent) FROM clickstream WHERE user_id = ‘mjsax’;
  • 14. Stream Aggregation 14 Windowed / non-windowed: returns a continuously updating TABLE CREATE TABLE clicks AS SELECT user_id, COUNT(url) FROM clickstream WHERE bytes > 1024 WINDOW TUMBLIND (size 30 seconds) GROUP BY user_id HAVING COUNT(url) > 20;
  • 15. Stream-Stream Join 15 Sliding-window join: 14:041 14:162 14:083 14:01A 14:11B 14:23C 14:041⨝A 14:162⨝B 14:113⨝B max(l.ts; r.ts) CREATE STREAM joinedStream AS SELECT * FROM leftStream AS l JOIN rightStream AS r WITHIN 5 minutes ON l.id = r.id;
  • 16. Chaining stream-stream joins is not associative! • Order matters: ⨝(s1,s2,s3) != (s1 ⨝ s2) ⨝ s3 != (s1 ⨝ s3) ⨝ s2 != (s2 ⨝ s3) ⨝ s1 N-way Stream-Stream Join 16
  • 17. Chaining stream-stream joins is not associative! • Order matters: ⨝(s1,s2,s3) != (s1 ⨝ s2) ⨝ s3 != (s1 ⨝ s3) ⨝ s2 != (s2 ⨝ s3) ⨝ s1 N-way Stream-Stream Join 17 14:06X 14:21Y 14:212⨝Y⨝b 14:16b14:11a 14:011 14:26314:162 * window size=5min
  • 18. Chaining stream-stream joins is not associative! • Order matters: ⨝(s1,s2,s3) != (s1 ⨝ s2) ⨝ s3 != (s1 ⨝ s3) ⨝ s2 != (s2 ⨝ s3) ⨝ s1 N-way Stream-Stream Join 18 14:06X 14:21Y 14:011 14:26314:162 14:212⨝Y14:061⨝X 14:263⨝Y * window size=5min
  • 19. Chaining stream-stream joins is not associative! • Order matters: ⨝(s1,s2,s3) != (s1 ⨝ s2) ⨝ s3 != (s1 ⨝ s3) ⨝ s2 != (s2 ⨝ s3) ⨝ s1 N-way Stream-Stream Join 19 14:16b14:11a 14:212⨝Y⨝b14:111⨝Y⨝a 14:212⨝Y14:061⨝X 14:263⨝Y * window size=5min
  • 20. Table Queries 20 As in regular SQL, however…
  • 21. 2121 Do you think that’s a table you are querying ?
  • 22. Table Queries 22 Filters, projections, and aggregations are executed on changelog streams Query
  • 24. Stream-Table Join (cont.) 24 Stream-Table join is a temporal join 14:01a 14:03b 14:05c 14:08b 14:11a 14:02… 14:04… 14:07…14:06… 14:10… 14:01a 14:03b 14:05c 14:05 14:01a 14:08b 14:05c 14:08 14:11a 14:08b 14:05c 14:11 14:01a 14:03b 14:03 14:01a 14:01 14:06 14:07 14:1014:0414:02
  • 25. Table-Table Join 25 Table-Table join is a temporal join users details profile v1 v5 v6 v2 v6 v2 v5 v6
  • 27. Persistent Queries Executed as Kafka Streams programs Kafka Streams: • Java client library (part of the Apache Kafka project) • High level DSL to process Kafka topics as KStreams and KTables • Executes a dataflow program, represented as a directed graph of operators • filter()/map() • groupBy()/windowedBy() • aggregate() • join() • Etc. • Scalable • Fault-Tolerant 27
  • 28. Persistent Queries Kafka Stream topologies: 28 Parallel execution:
  • 29. Kafka Streams Internals Consumers, Producers, and RocksDB 29
  • 30. Kafka Streams Internals (cont.) Data repartitioning and fault-tolerance 30
  • 31. Fault-Tolerance and Hot Standbys 31 Recovery: replying the changelog topic Hot standbys: eagerly replaying the changelog topic Hot standbys allow for instant fail-over. Hot standbys allow for HA pull queries.
  • 32. Elasticity 32 Dynamic scaling as special case of store recovery
  • 34. Work in Progress Streaming SQL • Concise and more powerful language • Improved time/operator semantics Consistency guarantees • ksqlDB is an async system: vector clock approach for improved time tracking Applications and transient queries • More efficient and more scalable pull/push queries • Improved INSERT/UPDATE/DELETE support? 34
  • 35. Future Work Query optimization • Currently state: rule based • filter push down • merging of repartition topic • merging of input/output/changelog topics • Streaming cost model and cost-based optimization • Adaptive re-optimization at runtime • Query merging/splitting Runtime improvements: • Internal and optimized (binary) data format to avoid expensive (de)serialization costs • Task assignment: load balancing vs stickyness Richer SQL: • Sub-query support 35
  • 36. References • KSQL: Streaming SQL Engine for Apache Kafka Hojjat Jafarpour, Rohan Desai, Damian Guy EDBT '19: Proceedings of the 22nd International Conference on Extending Database Technology, 2019 • Streams and Tables: Two Sides of the Same Coin Matthias J. Sax, Guozhang Wang, Matthias Weidlich, Johann-Christoph Freytag BIRTE '18: Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics, 2018 • https://kafka.apache.org/books-and-papers • https://ksqldb.io/ 36