SlideShare a Scribd company logo
@rmoff robin@confluent.io
Steps to Building a
Streaming ETL Pipeline with
Apache Kafka® and KSQL
Robin Moffatt, Developer Advocate
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 2
$ whoami
• Developer Advocate @ Confluent
• Working in data & analytics since 2001
• Oracle ACE Director & Dev Champion
• Blogging : https://rmoff.net & http://cnfl.io/rmoff
• Twitter: @rmoff
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 3
Housekeeping Items
● This session will last about an hour.
● This session will be recorded.
● You can submit your questions by entering them into the
GoToWebinar panel.
● The last 10-15 minutes will consist of Q&A.
● The slides and recording will be available after the talk.
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
Streaming ETL
with Apache Kafka
and KSQL
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 5
Database offload Hadoop/Object Storage/Cloud DW for Analytics
HDFS / S3 /
BigQuery etc
RDBMS
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 6
Streaming ETL with Apache Kafka and KSQL
order items
customer
customer orders
Stream
Processing
RDBMS
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 7
Real-time Event Stream Enrichment with Apache Kafka and KSQL
order events
customer
Stream
Processing
customer orders
RDBMS
<y>
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 8
Transform Once, Use Many
order events
customer
Stream
Processing
customer orders
RDBMS
<y>
New App
<x>
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 9
Transform Once, Use Many
order events
customer
Stream
Processing
customer orders
RDBMS
<y>
HDFS / S3 / etc
New App
<x>
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 12
The Connect API of Apache Kafka®
✓ Fault tolerant and automatically load balanced
✓ Extensible API
✓ Single Message Transforms
✓ Part of Apache Kafka, included in

Confluent Open Source
Reliable and scalable integration of Kafka
with other systems – no coding required.
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo",
"table.whitelist": "sales,orders,customers"
}
https://docs.confluent.io/current/connect/
✓ Centralized management and configuration
✓ Support for hundreds of technologies including
RDBMS, Elasticsearch, HDFS, S3, syslog
✓ Supports CDC ingest of events from RDBMS
✓ Preserves data schema
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 13
Kafka Connect
Kafka Brokers
Kafka Connect
Tasks Workers
Sources Sinks
Amazon S3
syslog
flat file
CSV
JSON
MQTT
MQTT
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 14
Considerations for Integration into Apache Kafka
Photo by Matthew Smith on Unsplash
• Chucking data over the fence into a Kafka topic is
not enough
• We need standard ways of building data pipelines
in Kafka
• Schema handling
• Serialisation formats
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 15
Considerations for Integration into Apache Kafka
Photo by Matthew Smith on Unsplash
• Confluent Schema Registry & Avro is a great way to
do this
• Downstream users of the data can then easily use
the data
• KSQL
• Kafka Connect
• Kafka Streams
• Custom apps
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 16
The Confluent Schema Registry
MySQL
Avro
Message
Elasticsearch
Schema
RegistryAvro
Schema
Kafka
Connect
Kafka
ConnectAvro
Message
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 17
The Confluent Schema Registry
Source (MySQL) schema
is preserved
Target (Elasticsearch) schema
mapping is automagically built
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 18
Integrating Databases with Kafka
• CDC is a generic term referring to
capturing changing data typically
from a RDBMS.
• Two general approaches:
• Query-based CDC
• Log-based CDC
There are other options including hacks with
Triggers, Flashback etc but these are system and/or
technology-specific.
Read more: http://cnfl.io/kafka-cdc
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
• Use a database query to try and identify new & changed rows





• Implemented with the open source Kafka Connect JDBC connector
• Can import based on table names, schema, or bespoke SQL query
•Incremental ingest driven through incrementing ID column and/or
timestamp column
19
Query-based CDC
SELECT * FROM my_table 

WHERE col > <value of col last time we polled>
Read more: http://cnfl.io/kafka-cdc
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 20
Log-based CDC
• Use the database's
transaction log to identify
every single change event
• Various CDC tools available
that integrate with Apache
Kafka (more of this later…)
Read more: http://cnfl.io/kafka-cdc
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 21
Query-based vs Log-based CDC
Photo by Matese Fields on Unsplash
• Query-based
+Usually easier to setup, and
requires fewer permissions
- Needs specific columns in
source schema
- Impact of polling the DB (or
higher latencies tradeoff)
- Can't track deletes
Read more: http://cnfl.io/kafka-cdc
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 22
Query-based vs Log-based CDC
Photo by Sebastian Pociecha on Unsplash
• Log-based
+Greater data fidelity
+Lower latency
+Lower impact on source
- More setup steps
- Higher system privileges required
- For propriatory databases, usually $$$
Read more: http://cnfl.io/kafka-cdc
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 23
Which Log-Based CDC Tool?
For query-based CDC, use the Confluent Kafka Connect JDBC connector
• Open Source RDBMS, 

e.g. MySQL, PostgreSQL
• Debezium
• (+ paid options)
• Mainframe

e.g. VSAM, IMS
• Attunity
• SQData
• Proprietory RDBMS, 

e.g. Oracle, MS SQL
• Attunity
• IBM InfoSphere Data Replication
• Oracle GoldenGate
• SQData
• HVR
Read more: http://cnfl.io/kafka-cdc
All these options integrate with Apache Kafka and Confluent
Platform, including support for the Schema Registry
“
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
But I need to
join…aggregate…filter…
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
Declarative
Stream
Language
Processing
KSQLis a
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
KSQLis the
Streaming
SQL Engine
for
Apache Kafka
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
KSQL in Development and Production
Interactive KSQL

for development and testing
Headless KSQL

for Production
Desired KSQL queries
have been identified
REST
“Hmm, let me try

out this idea...”
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
KSQL for Streaming ETL
CREATE STREAM vip_actions AS 

SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u
ON c.userid = u.user_id 

WHERE u.level = 'Platinum';
Joining, filtering, and aggregating streams of event data
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
KSQL for Anomaly Detection
CREATE TABLE possible_fraud AS

SELECT card_number, count(*)

FROM authorization_attempts 

WINDOW TUMBLING (SIZE 5 SECONDS)

GROUP BY card_number

HAVING count(*) > 3;
Identifying patterns or anomalies in real-time data,
surfaced in milliseconds
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
KSQL for Real-Time Monitoring
• Log data monitoring, tracking and alerting
• syslog data
• Sensor / IoT data
CREATE STREAM SYSLOG_INVALID_USERS AS
SELECT HOST, MESSAGE
FROM SYSLOG
WHERE MESSAGE LIKE '%Invalid user%';
http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
CREATE STREAM views_by_userid
WITH (PARTITIONS=6, REPLICAS=5,
VALUE_FORMAT='AVRO',
TIMESTAMP='view_time') AS 

SELECT * FROM clickstream
PARTITION BY user_id;
KSQL for Data Transformation
Make simple derivations of existing topics from the command line
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
DEMO!
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
MySQL DebeziumKafka Connect
Producer API
Elasticsearch
Kafka Connect
34
Questions?
http://confluent.io/ksql
https://slackpass.io/confluentcommunity

More Related Content

What's hot (20)

PPTX
An introduction to DevOps
Alexander Meijers
 
PPTX
Drive business outcomes using Azure Devops
Belatrix Software
 
PDF
NiFi Developer Guide
Deon Huang
 
PPTX
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
DevOpsDays Tel Aviv
 
PPTX
Do You Really Need to Evolve From Monitoring to Observability?
Splunk
 
PPTX
Data council sf amundsen presentation
Tao Feng
 
PDF
Building an Observability platform with ClickHouse
Altinity Ltd
 
PPTX
Apache NiFi Crash Course Intro
DataWorks Summit/Hadoop Summit
 
PDF
DevSecOps: Taking a DevOps Approach to Security
Alert Logic
 
PPTX
Hadoop security
Shivaji Dutta
 
PPTX
PayPal Real Time Analytics
Anil Madan
 
PPTX
ELK Stack
Phuc Nguyen
 
PPTX
Threat Hunting with Splunk
Splunk
 
PDF
Hadoop Overview kdd2011
Milind Bhandarkar
 
PPTX
DevOps 101 - an Introduction to DevOps
Red Gate Software
 
PDF
Storj Labs - Blockchain Cloud Storage
Amanda Marie Goldston
 
PDF
Zabbix - an important part of your IT infrastructure
Arvids Godjuks
 
PPTX
Strata sf - Amundsen presentation
Tao Feng
 
PDF
Azure DevOps - Azure Guatemala Meetup
Guillermo Zepeda Selman
 
PDF
Azure DevOps & GitHub... Better Together!
Lorenzo Barbieri
 
An introduction to DevOps
Alexander Meijers
 
Drive business outcomes using Azure Devops
Belatrix Software
 
NiFi Developer Guide
Deon Huang
 
THE THREE DISCIPLINES OF CI/CD SECURITY, DANIEL KRIVELEVICH, Cider Security
DevOpsDays Tel Aviv
 
Do You Really Need to Evolve From Monitoring to Observability?
Splunk
 
Data council sf amundsen presentation
Tao Feng
 
Building an Observability platform with ClickHouse
Altinity Ltd
 
Apache NiFi Crash Course Intro
DataWorks Summit/Hadoop Summit
 
DevSecOps: Taking a DevOps Approach to Security
Alert Logic
 
Hadoop security
Shivaji Dutta
 
PayPal Real Time Analytics
Anil Madan
 
ELK Stack
Phuc Nguyen
 
Threat Hunting with Splunk
Splunk
 
Hadoop Overview kdd2011
Milind Bhandarkar
 
DevOps 101 - an Introduction to DevOps
Red Gate Software
 
Storj Labs - Blockchain Cloud Storage
Amanda Marie Goldston
 
Zabbix - an important part of your IT infrastructure
Arvids Godjuks
 
Strata sf - Amundsen presentation
Tao Feng
 
Azure DevOps - Azure Guatemala Meetup
Guillermo Zepeda Selman
 
Azure DevOps & GitHub... Better Together!
Lorenzo Barbieri
 

Similar to Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL (20)

PDF
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent
 
PDF
Riviera Jug - 20/03/2018 - KSQL
Florent Ramiere
 
PDF
Jug - ecosystem
Florent Ramiere
 
PDF
Chti jug - 2018-06-26
Florent Ramiere
 
PDF
Kafka Connect and Streams (Concepts, Architecture, Features)
Kai Wähner
 
PPTX
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Kairo Tavares
 
PDF
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
Kai Wähner
 
PDF
KSQL---Streaming SQL for Apache Kafka
Matthias J. Sax
 
PDF
BBL KAPPA Lesfurets.com
Cedric Vidal
 
PDF
Un'introduzione a Kafka Streams e KSQL... and why they matter!
Paolo Castagna
 
PDF
Streaming etl in practice with postgre sql, apache kafka, and ksql mic
Bas van Oudenaarde
 
PDF
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Codemotion
 
PDF
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Codemotion
 
PDF
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
confluent
 
PDF
KSQL: Open Source Streaming for Apache Kafka
confluent
 
PDF
Event-Driven Model Serving: Stream Processing vs. RPC with Kafka and TensorFl...
confluent
 
PDF
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Kai Wähner
 
PPTX
Kafka Streams for Java enthusiasts
Slim Baltagi
 
PDF
KSQL – An Open Source Streaming Engine for Apache Kafka
Kai Wähner
 
PDF
Streaming ETL with Apache Kafka and KSQL
Nick Dearden
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent
 
Riviera Jug - 20/03/2018 - KSQL
Florent Ramiere
 
Jug - ecosystem
Florent Ramiere
 
Chti jug - 2018-06-26
Florent Ramiere
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kai Wähner
 
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Kairo Tavares
 
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
Kai Wähner
 
KSQL---Streaming SQL for Apache Kafka
Matthias J. Sax
 
BBL KAPPA Lesfurets.com
Cedric Vidal
 
Un'introduzione a Kafka Streams e KSQL... and why they matter!
Paolo Castagna
 
Streaming etl in practice with postgre sql, apache kafka, and ksql mic
Bas van Oudenaarde
 
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Codemotion
 
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Codemotion
 
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
confluent
 
KSQL: Open Source Streaming for Apache Kafka
confluent
 
Event-Driven Model Serving: Stream Processing vs. RPC with Kafka and TensorFl...
confluent
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Kai Wähner
 
Kafka Streams for Java enthusiasts
Slim Baltagi
 
KSQL – An Open Source Streaming Engine for Apache Kafka
Kai Wähner
 
Streaming ETL with Apache Kafka and KSQL
Nick Dearden
 
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
PDF
Migration, backup and restore made easy using Kannika
confluent
 
PDF
Five Things You Need to Know About Data Streaming in 2025
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
PDF
Unlocking value with event-driven architecture by Confluent
confluent
 
PDF
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
PDF
Building API data products on top of your real-time data infrastructure
confluent
 
PDF
Speed Wins: From Kafka to APIs in Minutes
confluent
 
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
Migration, backup and restore made easy using Kannika
confluent
 
Five Things You Need to Know About Data Streaming in 2025
confluent
 
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Unlocking value with event-driven architecture by Confluent
confluent
 
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Ad

Recently uploaded (20)

PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Digital Circuits, important subject in CS
contactparinay1
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 

Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL

  • 1. @rmoff [email protected] Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL Robin Moffatt, Developer Advocate
  • 2. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 2 $ whoami • Developer Advocate @ Confluent • Working in data & analytics since 2001 • Oracle ACE Director & Dev Champion • Blogging : https://rmoff.net & http://cnfl.io/rmoff • Twitter: @rmoff
  • 3. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 3 Housekeeping Items ● This session will last about an hour. ● This session will be recorded. ● You can submit your questions by entering them into the GoToWebinar panel. ● The last 10-15 minutes will consist of Q&A. ● The slides and recording will be available after the talk.
  • 4. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL Streaming ETL with Apache Kafka and KSQL
  • 5. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 5 Database offload Hadoop/Object Storage/Cloud DW for Analytics HDFS / S3 / BigQuery etc RDBMS
  • 6. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 6 Streaming ETL with Apache Kafka and KSQL order items customer customer orders Stream Processing RDBMS
  • 7. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 7 Real-time Event Stream Enrichment with Apache Kafka and KSQL order events customer Stream Processing customer orders RDBMS <y>
  • 8. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 8 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> New App <x>
  • 9. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 9 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> HDFS / S3 / etc New App <x>
  • 10. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
  • 11. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
  • 12. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 12 The Connect API of Apache Kafka® ✓ Fault tolerant and automatically load balanced ✓ Extensible API ✓ Single Message Transforms ✓ Part of Apache Kafka, included in
 Confluent Open Source Reliable and scalable integration of Kafka with other systems – no coding required. { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo", "table.whitelist": "sales,orders,customers" } https://docs.confluent.io/current/connect/ ✓ Centralized management and configuration ✓ Support for hundreds of technologies including RDBMS, Elasticsearch, HDFS, S3, syslog ✓ Supports CDC ingest of events from RDBMS ✓ Preserves data schema
  • 13. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 13 Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources Sinks Amazon S3 syslog flat file CSV JSON MQTT MQTT
  • 14. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 14 Considerations for Integration into Apache Kafka Photo by Matthew Smith on Unsplash • Chucking data over the fence into a Kafka topic is not enough • We need standard ways of building data pipelines in Kafka • Schema handling • Serialisation formats
  • 15. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 15 Considerations for Integration into Apache Kafka Photo by Matthew Smith on Unsplash • Confluent Schema Registry & Avro is a great way to do this • Downstream users of the data can then easily use the data • KSQL • Kafka Connect • Kafka Streams • Custom apps
  • 16. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 16 The Confluent Schema Registry MySQL Avro Message Elasticsearch Schema RegistryAvro Schema Kafka Connect Kafka ConnectAvro Message
  • 17. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 17 The Confluent Schema Registry Source (MySQL) schema is preserved Target (Elasticsearch) schema mapping is automagically built
  • 18. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 18 Integrating Databases with Kafka • CDC is a generic term referring to capturing changing data typically from a RDBMS. • Two general approaches: • Query-based CDC • Log-based CDC There are other options including hacks with Triggers, Flashback etc but these are system and/or technology-specific. Read more: http://cnfl.io/kafka-cdc
  • 19. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL • Use a database query to try and identify new & changed rows
 
 
 • Implemented with the open source Kafka Connect JDBC connector • Can import based on table names, schema, or bespoke SQL query •Incremental ingest driven through incrementing ID column and/or timestamp column 19 Query-based CDC SELECT * FROM my_table 
 WHERE col > <value of col last time we polled> Read more: http://cnfl.io/kafka-cdc
  • 20. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 20 Log-based CDC • Use the database's transaction log to identify every single change event • Various CDC tools available that integrate with Apache Kafka (more of this later…) Read more: http://cnfl.io/kafka-cdc
  • 21. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 21 Query-based vs Log-based CDC Photo by Matese Fields on Unsplash • Query-based +Usually easier to setup, and requires fewer permissions - Needs specific columns in source schema - Impact of polling the DB (or higher latencies tradeoff) - Can't track deletes Read more: http://cnfl.io/kafka-cdc
  • 22. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 22 Query-based vs Log-based CDC Photo by Sebastian Pociecha on Unsplash • Log-based +Greater data fidelity +Lower latency +Lower impact on source - More setup steps - Higher system privileges required - For propriatory databases, usually $$$ Read more: http://cnfl.io/kafka-cdc
  • 23. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 23 Which Log-Based CDC Tool? For query-based CDC, use the Confluent Kafka Connect JDBC connector • Open Source RDBMS, 
 e.g. MySQL, PostgreSQL • Debezium • (+ paid options) • Mainframe
 e.g. VSAM, IMS • Attunity • SQData • Proprietory RDBMS, 
 e.g. Oracle, MS SQL • Attunity • IBM InfoSphere Data Replication • Oracle GoldenGate • SQData • HVR Read more: http://cnfl.io/kafka-cdc All these options integrate with Apache Kafka and Confluent Platform, including support for the Schema Registry
  • 24. “ @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL But I need to join…aggregate…filter…
  • 25. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL Declarative Stream Language Processing KSQLis a
  • 26. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL KSQLis the Streaming SQL Engine for Apache Kafka
  • 27. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL KSQL in Development and Production Interactive KSQL
 for development and testing Headless KSQL
 for Production Desired KSQL queries have been identified REST “Hmm, let me try
 out this idea...”
  • 28. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL KSQL for Streaming ETL CREATE STREAM vip_actions AS 
 SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id 
 WHERE u.level = 'Platinum'; Joining, filtering, and aggregating streams of event data
  • 29. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL KSQL for Anomaly Detection CREATE TABLE possible_fraud AS
 SELECT card_number, count(*)
 FROM authorization_attempts 
 WINDOW TUMBLING (SIZE 5 SECONDS)
 GROUP BY card_number
 HAVING count(*) > 3; Identifying patterns or anomalies in real-time data, surfaced in milliseconds
  • 30. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL KSQL for Real-Time Monitoring • Log data monitoring, tracking and alerting • syslog data • Sensor / IoT data CREATE STREAM SYSLOG_INVALID_USERS AS SELECT HOST, MESSAGE FROM SYSLOG WHERE MESSAGE LIKE '%Invalid user%'; http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting
  • 31. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL CREATE STREAM views_by_userid WITH (PARTITIONS=6, REPLICAS=5, VALUE_FORMAT='AVRO', TIMESTAMP='view_time') AS 
 SELECT * FROM clickstream PARTITION BY user_id; KSQL for Data Transformation Make simple derivations of existing topics from the command line
  • 32. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL DEMO!
  • 33. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL MySQL DebeziumKafka Connect Producer API Elasticsearch Kafka Connect