Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL

@rmoff robin@confluent.io
Steps to Building a
Streaming ETL Pipeline with
Apache Kafka® and KSQL
Robin Moffatt, Developer Advocate

@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 2
$ whoami
• Developer Advocate @ Confluent
• Working in data & analytics since 2001
• Oracle ACE Director & Dev Champion
• Blogging : https://rmoff.net & http://cnfl.io/rmoff
• Twitter: @rmoff

Housekeeping Items
● This session will last about an hour.
● This session will be recorded.
● You can submit your questions by entering them into the
GoToWebinar panel.
● The last 10-15 minutes will consist of Q&A.
● The slides and recording will be available after the talk.

@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
Streaming ETL
with Apache Kafka
and KSQL

Database offload Hadoop/Object Storage/Cloud DW for Analytics
HDFS / S3 /
BigQuery etc
RDBMS

Streaming ETL with Apache Kafka and KSQL
order items
customer
customer orders
Stream
Processing
RDBMS

Real-time Event Stream Enrichment with Apache Kafka and KSQL
order events
customer
Stream
Processing
customer orders
RDBMS
<y>

Transform Once, Use Many
order events
customer
Stream
Processing
customer orders
RDBMS
<y>
New App
<x>

Transform Once, Use Many
order events
customer
Stream
Processing
customer orders
RDBMS
<y>
HDFS / S3 / etc
New App
<x>

The Connect API of Apache Kafka®
✓ Fault tolerant and automatically load balanced
✓ Extensible API
✓ Single Message Transforms
✓ Part of Apache Kafka, included in 
Confluent Open Source
Reliable and scalable integration of Kafka
with other systems – no coding required.
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo",
"table.whitelist": "sales,orders,customers"
}
https://docs.confluent.io/current/connect/
✓ Centralized management and configuration
✓ Support for hundreds of technologies including
RDBMS, Elasticsearch, HDFS, S3, syslog
✓ Supports CDC ingest of events from RDBMS
✓ Preserves data schema

Kafka Connect
Kafka Brokers
Kafka Connect
Tasks Workers
Sources Sinks
Amazon S3
syslog
flat file
CSV
JSON
MQTT
MQTT

Considerations for Integration into Apache Kafka
Photo by Matthew Smith on Unsplash
• Chucking data over the fence into a Kafka topic is
not enough
• We need standard ways of building data pipelines
in Kafka
• Schema handling
• Serialisation formats

Considerations for Integration into Apache Kafka
Photo by Matthew Smith on Unsplash
• Confluent Schema Registry & Avro is a great way to
do this
• Downstream users of the data can then easily use
the data
• KSQL
• Kafka Connect
• Kafka Streams
• Custom apps

The Confluent Schema Registry
MySQL
Avro
Message
Elasticsearch
Schema
RegistryAvro
Schema
Kafka
Connect
Kafka
ConnectAvro
Message

The Confluent Schema Registry
Source (MySQL) schema
is preserved
Target (Elasticsearch) schema
mapping is automagically built

Integrating Databases with Kafka
• CDC is a generic term referring to
capturing changing data typically
from a RDBMS.
• Two general approaches:
• Query-based CDC
• Log-based CDC
There are other options including hacks with
Triggers, Flashback etc but these are system and/or
technology-specific.
Read more: http://cnfl.io/kafka-cdc

• Use a database query to try and identify new & changed rows 
 
 
• Implemented with the open source Kafka Connect JDBC connector
• Can import based on table names, schema, or bespoke SQL query
•Incremental ingest driven through incrementing ID column and/or
timestamp column
19
Query-based CDC
SELECT * FROM my_table  
WHERE col > <value of col last time we polled>

Log-based CDC
• Use the database's
transaction log to identify
every single change event
• Various CDC tools available
that integrate with Apache
Kafka (more of this later…)

Query-based vs Log-based CDC
Photo by Matese Fields on Unsplash
• Query-based
+Usually easier to setup, and
requires fewer permissions
- Needs specific columns in
source schema
- Impact of polling the DB (or
higher latencies tradeoff)
- Can't track deletes

Query-based vs Log-based CDC
Photo by Sebastian Pociecha on Unsplash
• Log-based
+Greater data fidelity
+Lower latency
+Lower impact on source
- More setup steps
- Higher system privileges required
- For propriatory databases, usually $$$

Which Log-Based CDC Tool?
For query-based CDC, use the Confluent Kafka Connect JDBC connector
• Open Source RDBMS,  
e.g. MySQL, PostgreSQL
• Debezium
• (+ paid options)
• Mainframe 
e.g. VSAM, IMS
• Attunity
• SQData
• Proprietory RDBMS,  
e.g. Oracle, MS SQL
• Attunity
• IBM InfoSphere Data Replication
• Oracle GoldenGate
• SQData
• HVR
All these options integrate with Apache Kafka and Confluent
Platform, including support for the Schema Registry

“
But I need to
join…aggregate…filter…

Declarative
Stream
Language
Processing
KSQLis a

KSQLis the
Streaming
SQL Engine
for
Apache Kafka

KSQL in Development and Production
Interactive KSQL 
for development and testing
Headless KSQL 
for Production
Desired KSQL queries
have been identified
REST
“Hmm, let me try 
out this idea...”

KSQL for Streaming ETL
CREATE STREAM vip_actions AS  
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u
ON c.userid = u.user_id  
WHERE u.level = 'Platinum';
Joining, filtering, and aggregating streams of event data

KSQL for Anomaly Detection
CREATE TABLE possible_fraud AS 
SELECT card_number, count(*) 
FROM authorization_attempts  
WINDOW TUMBLING (SIZE 5 SECONDS) 
GROUP BY card_number 
HAVING count(*) > 3;
Identifying patterns or anomalies in real-time data,
surfaced in milliseconds

KSQL for Real-Time Monitoring
• Log data monitoring, tracking and alerting
• syslog data
• Sensor / IoT data
CREATE STREAM SYSLOG_INVALID_USERS AS
SELECT HOST, MESSAGE
FROM SYSLOG
WHERE MESSAGE LIKE '%Invalid user%';
http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting

CREATE STREAM views_by_userid
WITH (PARTITIONS=6, REPLICAS=5,
VALUE_FORMAT='AVRO',
TIMESTAMP='view_time') AS  
SELECT * FROM clickstream
PARTITION BY user_id;
KSQL for Data Transformation
Make simple derivations of existing topics from the command line

DEMO!

MySQL DebeziumKafka Connect
Producer API
Elasticsearch
Kafka Connect

34
Questions?
http://confluent.io/ksql
https://slackpass.io/confluentcommunity

Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL

More Related Content

What's hot (20)

Similar to Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL (20)

More from confluent (20)

Recently uploaded (20)

Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL