SlideShare a Scribd company logo
Building a Real-time Streaming
ETL Framework Using ksqlDB
and NoSQL
Hojjat Jafarpour, Software Engineer at Confluent
Maheedhar Gunturu, Solutions Architect at ScyllaDB
Presenters
Hojjat Jafarpour, Software Engineer at Confluent
Hojjat is a software engineer and the creator of KSQL, the Streaming SQL engine for Apache
Kafka, at Confluent. Before joining Confluent he worked at NEC Labs, Informatica, Quantcast
and Tidemark on various big data management projects. He has a Ph.D. in computer science
from UC Irvine, where he worked on scalable stream processing and publish/subscribe
systems.
Maheedhar Gunturu, Solutions Architect at ScyllaDB
Maheedhar held senior roles both in engineering and sales organizations. He has over a decade
of experience designing & developing server-side applications in the cloud and working on big
data and ETL frameworks in companies such as Samsung, MapR, Apple, VoltDB, Zscaler and
Qualcomm.
2
Agenda
+ Overview of ScyllaDB
+ Apache Kafka and The Confluent Platform
+ Example Use Cases
+ QA
3
About ScyllaDB
5
+ The Real-Time Big Data Database
+ Drop-in replacement for Apache Cassandra
and Amazon DynamoDB
+ 10X the performance & low tail latency
+ Open Source, Enterprise and Cloud options
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA, USA; Herzelia, Israel;
Warsaw, Poland
About ScyllaDB
Scylla Design Principles
C++ instead of Java Shard per Core All Things Async
Unified Cache I/O Scheduler Self-Optimizing
Seastar Framework
Compatibility
+ CQL native protocol
+ JMX management protocol
+ Management command line
+ SSTable file format
+ Configuration file format
+ CQL language
8
/REST
+ Helps with
+ Database mirroring/replication/state propagation
+ Direct data into a Kafka stream
+ Configurable subscription options to the change log (per table)
+ Post-image (Changed state)
+ Delta (changes per column)
+ Pre-image (Previous state)
+ Scylla CDC-Kafka source connector coming out soon!
Ref: https://www.scylladb.com/tech-talk/change-data-capture-in-scylla/
Change Data Capture (CDC) from Scylla
Apache Kafka and Confluent Platform
Pre-Streaming
New World: Streaming First
Apache Kafka
Kafka
Cluster
A Distributed Commit Log. Publish and subscribe to
streams of records. Highly scalable, high throughput.
Supports transactions. Persisted data.
Reads are a single seek & scan
Writes
are
append
only
Apache Kafka
Kafka Connect API
Reliable and scalable integration of Kafka with other systems –
no coding required.
Apache Kafka
Kafka Streams API
Write standard Java applications & microservices to
process your data in real-time
Orders
Table
Customers
Kafka Streams API
Stream Processing by Analogy
Kafka Cluster
Connect API Stream Processing Connect API
$ cat < in.txt | grep “ksql” | tr a-z A-Z > out.txt
Stream Processing in Kafka
Simplicity
Flexibility
Consumer,
Producer
subscribe(), poll(),
send(), flush()
Stream Processing in Kafka
Simplicity
Flexibility
Consumer,
Producer
subscribe(), poll(),
send(), flush()
Kafka Streams map(), filter(),
aggregate(), join()
Stream Processing in Kafka
Simplicity
Flexibility
Consumer,
Producer
subscribe(), poll(),
send(), flush()
Kafka Streams map(), filter(),
aggregate(), join()
ksqlDB
SELECT … FROM
… JOIN .. GROUP
BY ...
ksqlDB
+ The event streaming database purpose-built for stream
processing applications
+ Enables stream processing with zero coding required
+ The simplest way to process streams of data in real
time
+ Powered by Kafka: scalable, distributed, battle-tested
+ All you need is Kafka–no complex deployments of
bespoke systems for stream processing
ksqlDB
+ Streaming ETL
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
ksqlDB
+ Real-Time Monitoring
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
ksqlDB
+ Features
+ Aggregation
+ Window
+ Tumbling
+ Hopping
+ Session
+ Join
+ Stream-Stream
+ Stream-Table
+ Table-Table
+ Nested data
+ STRUCT
+ UDF/UDAF/UDTF
+ AVRO, JSON, CSV
+ Protobuf to come soon
+ And many more...
Example Use Cases
Using Syslog to Detect SSH Attacks
KSQL
Syslog Syslog Data Syslog Data
Sink
connector
ksql> CREATE SINK CONNECTORSINK_SCYLLA_SYSLOG WITH (
'connector.class' = 'io.connect.scylladb.ScyllaDbSinkConnector',
'connection.url' = 'localhost:9092',
'type.name' = '',
'behavior.on.malformed.documents' = 'warn',
'errors.tolerance' = 'all',
'errors.log.enable' = 'true',
'errors.log.include.messages' = 'true',
'topics' = 'syslog',
'key.ignore' = 'true',
'schema.ignore' = 'true',
'key.converter' = 'org.apache.kafka.connect.storage.StringConverter'
);
ksql> CREATE STREAM SYSLOG WITH (KAFKA_TOPIC='syslog', VALUE_FORMAT='AVRO');
ksql> SELECT TIMESTAMPTOSTRING(S.DATE, 'yyyy-MM-dd HH:mm:ss') AS SYSLOG_TS, S.HOST,
F.DESCRIPTION AS FACILITY, S.MESSAGE, S.REMOTEADDRESS FROM SYSLOG S
LEFT OUTER JOIN FACILITY F ON S.FACILITY=F.ROWKEY WHERE S.HOST='demo' EMIT CHANGES;
ksql> CREATE STREAM SYSLOG_INVALID_USERS AS SELECT * FROM SYSLOG WHERE MESSAGE LIKE
'Invalid user%';
ksql> CREATE STREAM SSH_ATTACKS AS SELECT TIMESTAMPTOSTRING(DATE, 'yyyy-MM-dd HH:mm:ss')
AS SYSLOG_TS, HOST, SPLIT(REPLACE(MESSAGE,'Invalid user ',''),' from ')[0] AS ATTACK_USER,
SPLIT(REPLACE(MESSAGE,'Invalid user ',''),' from ')[1] AS ATTACK_IP FROM
SYSLOG_INVALID_USERS EMIT CHANGES;
ksql> CREATE TABLE SSH_ATTACKS_BY_USER AS SELECT ATTACK_USER, COUNT(*) AS ATTEMPTS FROM
SSH_ATTACKS GROUP BY ATTACK_USER;
ksql> SELECT ATTACK_USER, ATTEMPTS FROM SSH_ATTACKS_BY_USER EMIT CHANGES; (push)
ksql> SELECT ATTACK_USER, ATTEMPTS FROM SSH_ATTACKS_BY_USER WHERE ROWKEY='oracle'; (pull)
Sink
connector
ksql> CREATE SOURCE CONNECTOR SOURCE_SYSLOG_UDP_01 WITH (
'tasks.max' = '1',
'connector.class',
'io.confluent.connect.syslog.SyslogSourceConnector',
'topic' = 'syslog',
'syslog.port' = '42514',
'syslog.listener' = 'UDP',
'syslog.reverse.dns.remote.ip' = 'true',
'confluent.license' = '',
'confluent.topic.bootstrap.servers' = 'kafka:29092',
'confluent.topic.replication.factor' = '1'
);
ksql> CREATE SINK CONNECTOR SINK_ELASTIC_SYSLOG WITH (
'connector.class' =
'io.confluent.connect.elasticsearch.ElasticsearchSinkConnector',
'connection.url' = 'http://elasticsearch:9200',
'type.name' = '',
'behavior.on.malformed.documents' = 'warn',
'errors.tolerance' = 'all',
'errors.log.enable' = 'true',
'errors.log.include.messages' = 'true',
'topics' = 'SYSLOG_INVALID_USERS',
'key.ignore' = 'true',
'schema.ignore' = 'true',
'key.converter' = 'org.apache.kafka.connect.storage.StringConverter'
);
IOT - Smart Home
Hub Mode Service CTADevice
State
Device
health
Streams of real time events CDC
Hub Info Lookup info
Device data
mgmt Apps and services
MQTT PROXY
ksql> CREATE STREAM device_stream_mode WITH
(KAFKA_TOPIC='syslog',VALUE_FORMAT ='AVRO');
ksql> CREATE STREAM device_change_mode AS SELECT D.dev_id,
D.dev_type, H.hub_mode AS device_mode FROM hub_mode H LEFT OUTER
JOIN device_data D ON H.hub_id=D.hub_id EMIT CHANGES;
ksql> CREATE STREAM device_stream_mode AS SELECT DS.dev_id ,
DS.dev_type, DS.mode, F.state AS dev_state FROM device_change_mode
DS LEFT OUTER JOIN FACILITY F ON DS.dev_type=F.dev_type WHERE
DS.mode=<DEVICE_MODE> EMIT CHANGES;
### CONFIGURE THE MQTT SINK
ksql> INSERT INTO hub_mode SELECT * FROM /mqttTopicA/+/sensors
[WITHCONVERTER=`myclass.AvroConverter`]
ksql> CREATE STREAM hub_mode WITH (KAFKA_TOPIC='hub_mode', VALUE_FORMAT='AVRO');
### Create the necessary sink and CDC Source connector to SCYLLA
Source and Sink
connector
Customer Satisfaction - CES Score
CDC
Segmentation ChurnCustomer Loyalty Support
Customer
Service
Number of
attempts per
issue
Violate SLA
Customer
Customer
interaction
Customer
Log
ksql> CREATE STREAM cust_interactions (incident_Id VARCHAR,
timestamp) WITH (VALUE_FORMAT='JSON', PARTITIONS=1,
KAFKA_TOPIC=cust_interaction);
ksql> CREATE TABLE cust_log_aggregate AS SELECT ROWKEY AS
customer_id, COUNT(*) AS touch_points FROM cust_interactions GROUP
BY customer_id;
ksql> CREATE TABLE cust_log_by_issue AS SELECT ROWKEY AS
incident_Id, customer_id, COUNT_DISTINCT(touch_points) AS
UNIQUE_TOUCH_POINTS FROM cust_interactions GROUP BY ROWKEY EMIT
CHANGES;
ksql> Select C.incident_id, (C.incident_id_first_touch_point_TS -
CL.incident_id_last_touchpoint_TS)/1000/60/60 AS current_SLA_hours
FROM customer_log C INNER JOIN call_log CL ON C.incident_id =
CL.incident_id WHERE current_SLA_hours > 24;
Customer 360
Security - Endpoint Security
Syslog DNSnetflow Firewall
Streams of real time events
#JOIN the various streams of DATA using the SOURCE
connector from CASSANDRA.
#BUILD AND DEPLOY THE CUSTOM UDF
ksql> CREATE STREAM entity_risk_score AS SELECT
source_IP, mac_ID, derived_risk_score(priority_errors
, DNS_burstiness, reputation,
firewall_intrusion_attempts) AS risk_score FROM
endpoint_profile WHERE
derived_risk_score(priority_errors , DNS_burstiness,
reputation, firewall_intrusion_attempts) >
<THRESHOLD>;
Ref: https://www.confluent.io/blog/build-udf-udaf-ksql-5-0/
Takeaways
Takeaways
+ ScyllaDB now Supports Change Data Capture (CDC)
+ ksqlDB provides a SQL interface for Streaming Applications
+ ksqlDB is easily extensible with Custom UDFs
+ Scylla has a new SINK connector (CDC source connector is coming soon!)
Resources
+ Scylla Sink Connector
+ Source Connector (Cassandra)
+ Scylla CDC Presentation
+ Debezium
+ Scylla’s 7 Design Principles
+ Scylla Benchmarks
+ Useful Links:
+ Stream Processing Book Bundle
+ Kafka tutorials
+ ksqlDB
+ Confluent - Scylla Partnership Overview
+ Kafka Summits 2020
+ Kafka Summit London: April 27 - 28
+ Kafka Summit Austin: August 24 - 25
Confluent Resources
Q&A
maheedhar@scylladb.com
@vanguard_space
hojjat@confluent.io
@Hojjat
Stay in touch
United States
545 Faber Place
Palo Alto, CA 94303
Israel
11 Galgalei Haplada
Herzelia, Israel
www.scylladb.com
@scylladb
Thank you

More Related Content

What's hot (20)

PPT
Cloud computing
Reetesh Gupta
 
PDF
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
Markus Michalewicz
 
PDF
Designing a modern data warehouse in azure
Antonios Chatzipavlis
 
PDF
Bnkng
thedanzerzone
 
PPTX
Virtualization in Cloud Computing and Machine reference Model
Dr Neelesh Jain
 
PDF
Introduction to Cloud | Cloud Computing Tutorial for Beginners | Cloud Certif...
Edureka!
 
PPTX
Architect your app modernization journey with containers on Microsoft Azure
Davide Benvegnù
 
PPTX
Azure Security and Management
Allen Brokken
 
PPTX
Introduction to PolyBase
James Serra
 
PPTX
Men Salon management system project and ppt
pavisubashsp
 
PPTX
Microsoft Azure Technical Overview
gjuljo
 
PDF
20 Cloud Computing Quotes You Can't Miss
Nerdio
 
DOCX
Web Vulnerability Scanner project Report
Vikas Kumar
 
PPTX
Understanding Cloud Computing
Mohammed Sajjad Ali
 
PDF
A proposal for implementing cloud computing in newspaper company
Kingsley Mensah
 
DOC
Project Proposel Documentation
Abid Afsar Khan Malang Falsafi
 
PDF
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
Chris Hoyean Song
 
PPTX
AWS Introduction
Dimosthenis Botsaris
 
PPTX
Microsoft Azure cloud services
Najeeb Khan
 
PDF
LABRARY MANAGEMENT SYSTEM By ARPIT TRIPATHI
Arpit Tripathi
 
Cloud computing
Reetesh Gupta
 
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
Markus Michalewicz
 
Designing a modern data warehouse in azure
Antonios Chatzipavlis
 
Virtualization in Cloud Computing and Machine reference Model
Dr Neelesh Jain
 
Introduction to Cloud | Cloud Computing Tutorial for Beginners | Cloud Certif...
Edureka!
 
Architect your app modernization journey with containers on Microsoft Azure
Davide Benvegnù
 
Azure Security and Management
Allen Brokken
 
Introduction to PolyBase
James Serra
 
Men Salon management system project and ppt
pavisubashsp
 
Microsoft Azure Technical Overview
gjuljo
 
20 Cloud Computing Quotes You Can't Miss
Nerdio
 
Web Vulnerability Scanner project Report
Vikas Kumar
 
Understanding Cloud Computing
Mohammed Sajjad Ali
 
A proposal for implementing cloud computing in newspaper company
Kingsley Mensah
 
Project Proposel Documentation
Abid Afsar Khan Malang Falsafi
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
Chris Hoyean Song
 
AWS Introduction
Dimosthenis Botsaris
 
Microsoft Azure cloud services
Najeeb Khan
 
LABRARY MANAGEMENT SYSTEM By ARPIT TRIPATHI
Arpit Tripathi
 

Similar to Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL (20)

PPTX
Event streaming webinar feb 2020
Maheedhar Gunturu
 
PDF
Concepts and Patterns for Streaming Services with Kafka
QAware GmbH
 
PDF
Building Event Streaming Architectures on Scylla and Kafka
ScyllaDB
 
PDF
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent
 
PDF
Chti jug - 2018-06-26
Florent Ramiere
 
PPTX
Event Streaming Architectures with Confluent and ScyllaDB
ScyllaDB
 
PDF
All Streams Ahead! ksqlDB Workshop ANZ
confluent
 
PPTX
Connecting kafka message systems with scylla
Maheedhar Gunturu
 
PDF
APAC ksqlDB Workshop
confluent
 
PDF
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
PDF
Jug - ecosystem
Florent Ramiere
 
PDF
KSQL - Stream Processing simplified!
Guido Schmutz
 
PDF
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
confluent
 
PDF
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
confluent
 
PDF
Beyond the brokers - A tour of the Kafka ecosystem
Damien Gasparina
 
PDF
Beyond the Brokers: A Tour of the Kafka Ecosystem
confluent
 
PDF
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent
 
PDF
Un'introduzione a Kafka Streams e KSQL... and why they matter!
Paolo Castagna
 
PDF
Apache Kafka and ksqlDB in Action: Let's Build a Streaming Data Pipeline! (Ro...
confluent
 
PDF
Query Your Streaming Data on Kafka using SQL: Why, How, and What
HostedbyConfluent
 
Event streaming webinar feb 2020
Maheedhar Gunturu
 
Concepts and Patterns for Streaming Services with Kafka
QAware GmbH
 
Building Event Streaming Architectures on Scylla and Kafka
ScyllaDB
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent
 
Chti jug - 2018-06-26
Florent Ramiere
 
Event Streaming Architectures with Confluent and ScyllaDB
ScyllaDB
 
All Streams Ahead! ksqlDB Workshop ANZ
confluent
 
Connecting kafka message systems with scylla
Maheedhar Gunturu
 
APAC ksqlDB Workshop
confluent
 
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
Jug - ecosystem
Florent Ramiere
 
KSQL - Stream Processing simplified!
Guido Schmutz
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
confluent
 
Build a Bridge to Cloud with Apache Kafka® for Data Analytics Cloud Services
confluent
 
Beyond the brokers - A tour of the Kafka ecosystem
Damien Gasparina
 
Beyond the Brokers: A Tour of the Kafka Ecosystem
confluent
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent
 
Un'introduzione a Kafka Streams e KSQL... and why they matter!
Paolo Castagna
 
Apache Kafka and ksqlDB in Action: Let's Build a Streaming Data Pipeline! (Ro...
confluent
 
Query Your Streaming Data on Kafka using SQL: Why, How, and What
HostedbyConfluent
 
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
PDF
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
 
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
ScyllaDB
 
PDF
Leading a High-Stakes Database Migration
ScyllaDB
 
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
ScyllaDB
 
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
ScyllaDB
 
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
 
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB
 
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB
 
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
ScyllaDB
 
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
ScyllaDB
 
PDF
Vector Search with ScyllaDB by Szymon Wasik
ScyllaDB
 
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
ScyllaDB
 
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
ScyllaDB
 
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
ScyllaDB
 
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
ScyllaDB
 
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
ScyllaDB
 
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
ScyllaDB
 
Leading a High-Stakes Database Migration
ScyllaDB
 
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
ScyllaDB
 
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
ScyllaDB
 
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
 
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB
 
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB
 
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
ScyllaDB
 
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
ScyllaDB
 
Vector Search with ScyllaDB by Szymon Wasik
ScyllaDB
 
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
ScyllaDB
 
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
ScyllaDB
 
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
ScyllaDB
 
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
ScyllaDB
 
Lessons Learned from Building a Serverless Notifications System by Srushith R...
ScyllaDB
 
Ad

Recently uploaded (20)

PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 

Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL

  • 1. Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL Hojjat Jafarpour, Software Engineer at Confluent Maheedhar Gunturu, Solutions Architect at ScyllaDB
  • 2. Presenters Hojjat Jafarpour, Software Engineer at Confluent Hojjat is a software engineer and the creator of KSQL, the Streaming SQL engine for Apache Kafka, at Confluent. Before joining Confluent he worked at NEC Labs, Informatica, Quantcast and Tidemark on various big data management projects. He has a Ph.D. in computer science from UC Irvine, where he worked on scalable stream processing and publish/subscribe systems. Maheedhar Gunturu, Solutions Architect at ScyllaDB Maheedhar held senior roles both in engineering and sales organizations. He has over a decade of experience designing & developing server-side applications in the cloud and working on big data and ETL frameworks in companies such as Samsung, MapR, Apple, VoltDB, Zscaler and Qualcomm. 2
  • 3. Agenda + Overview of ScyllaDB + Apache Kafka and The Confluent Platform + Example Use Cases + QA 3
  • 5. 5 + The Real-Time Big Data Database + Drop-in replacement for Apache Cassandra and Amazon DynamoDB + 10X the performance & low tail latency + Open Source, Enterprise and Cloud options + Founded by the creators of KVM hypervisor + HQs: Palo Alto, CA, USA; Herzelia, Israel; Warsaw, Poland About ScyllaDB
  • 6. Scylla Design Principles C++ instead of Java Shard per Core All Things Async Unified Cache I/O Scheduler Self-Optimizing
  • 8. Compatibility + CQL native protocol + JMX management protocol + Management command line + SSTable file format + Configuration file format + CQL language 8 /REST
  • 9. + Helps with + Database mirroring/replication/state propagation + Direct data into a Kafka stream + Configurable subscription options to the change log (per table) + Post-image (Changed state) + Delta (changes per column) + Pre-image (Previous state) + Scylla CDC-Kafka source connector coming out soon! Ref: https://www.scylladb.com/tech-talk/change-data-capture-in-scylla/ Change Data Capture (CDC) from Scylla
  • 10. Apache Kafka and Confluent Platform
  • 13. Apache Kafka Kafka Cluster A Distributed Commit Log. Publish and subscribe to streams of records. Highly scalable, high throughput. Supports transactions. Persisted data. Reads are a single seek & scan Writes are append only
  • 14. Apache Kafka Kafka Connect API Reliable and scalable integration of Kafka with other systems – no coding required.
  • 15. Apache Kafka Kafka Streams API Write standard Java applications & microservices to process your data in real-time Orders Table Customers Kafka Streams API
  • 16. Stream Processing by Analogy Kafka Cluster Connect API Stream Processing Connect API $ cat < in.txt | grep “ksql” | tr a-z A-Z > out.txt
  • 17. Stream Processing in Kafka Simplicity Flexibility Consumer, Producer subscribe(), poll(), send(), flush()
  • 18. Stream Processing in Kafka Simplicity Flexibility Consumer, Producer subscribe(), poll(), send(), flush() Kafka Streams map(), filter(), aggregate(), join()
  • 19. Stream Processing in Kafka Simplicity Flexibility Consumer, Producer subscribe(), poll(), send(), flush() Kafka Streams map(), filter(), aggregate(), join() ksqlDB SELECT … FROM … JOIN .. GROUP BY ...
  • 20. ksqlDB + The event streaming database purpose-built for stream processing applications + Enables stream processing with zero coding required + The simplest way to process streams of data in real time + Powered by Kafka: scalable, distributed, battle-tested + All you need is Kafka–no complex deployments of bespoke systems for stream processing
  • 21. ksqlDB + Streaming ETL CREATE STREAM vip_actions AS SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = 'Platinum';
  • 22. ksqlDB + Real-Time Monitoring CREATE TABLE error_counts AS SELECT error_code, count(*) FROM monitoring_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE type = 'ERROR' GROUP BY error_code;
  • 23. ksqlDB + Features + Aggregation + Window + Tumbling + Hopping + Session + Join + Stream-Stream + Stream-Table + Table-Table + Nested data + STRUCT + UDF/UDAF/UDTF + AVRO, JSON, CSV + Protobuf to come soon + And many more...
  • 25. Using Syslog to Detect SSH Attacks KSQL Syslog Syslog Data Syslog Data Sink connector ksql> CREATE SINK CONNECTORSINK_SCYLLA_SYSLOG WITH ( 'connector.class' = 'io.connect.scylladb.ScyllaDbSinkConnector', 'connection.url' = 'localhost:9092', 'type.name' = '', 'behavior.on.malformed.documents' = 'warn', 'errors.tolerance' = 'all', 'errors.log.enable' = 'true', 'errors.log.include.messages' = 'true', 'topics' = 'syslog', 'key.ignore' = 'true', 'schema.ignore' = 'true', 'key.converter' = 'org.apache.kafka.connect.storage.StringConverter' ); ksql> CREATE STREAM SYSLOG WITH (KAFKA_TOPIC='syslog', VALUE_FORMAT='AVRO'); ksql> SELECT TIMESTAMPTOSTRING(S.DATE, 'yyyy-MM-dd HH:mm:ss') AS SYSLOG_TS, S.HOST, F.DESCRIPTION AS FACILITY, S.MESSAGE, S.REMOTEADDRESS FROM SYSLOG S LEFT OUTER JOIN FACILITY F ON S.FACILITY=F.ROWKEY WHERE S.HOST='demo' EMIT CHANGES; ksql> CREATE STREAM SYSLOG_INVALID_USERS AS SELECT * FROM SYSLOG WHERE MESSAGE LIKE 'Invalid user%'; ksql> CREATE STREAM SSH_ATTACKS AS SELECT TIMESTAMPTOSTRING(DATE, 'yyyy-MM-dd HH:mm:ss') AS SYSLOG_TS, HOST, SPLIT(REPLACE(MESSAGE,'Invalid user ',''),' from ')[0] AS ATTACK_USER, SPLIT(REPLACE(MESSAGE,'Invalid user ',''),' from ')[1] AS ATTACK_IP FROM SYSLOG_INVALID_USERS EMIT CHANGES; ksql> CREATE TABLE SSH_ATTACKS_BY_USER AS SELECT ATTACK_USER, COUNT(*) AS ATTEMPTS FROM SSH_ATTACKS GROUP BY ATTACK_USER; ksql> SELECT ATTACK_USER, ATTEMPTS FROM SSH_ATTACKS_BY_USER EMIT CHANGES; (push) ksql> SELECT ATTACK_USER, ATTEMPTS FROM SSH_ATTACKS_BY_USER WHERE ROWKEY='oracle'; (pull) Sink connector ksql> CREATE SOURCE CONNECTOR SOURCE_SYSLOG_UDP_01 WITH ( 'tasks.max' = '1', 'connector.class', 'io.confluent.connect.syslog.SyslogSourceConnector', 'topic' = 'syslog', 'syslog.port' = '42514', 'syslog.listener' = 'UDP', 'syslog.reverse.dns.remote.ip' = 'true', 'confluent.license' = '', 'confluent.topic.bootstrap.servers' = 'kafka:29092', 'confluent.topic.replication.factor' = '1' ); ksql> CREATE SINK CONNECTOR SINK_ELASTIC_SYSLOG WITH ( 'connector.class' = 'io.confluent.connect.elasticsearch.ElasticsearchSinkConnector', 'connection.url' = 'http://elasticsearch:9200', 'type.name' = '', 'behavior.on.malformed.documents' = 'warn', 'errors.tolerance' = 'all', 'errors.log.enable' = 'true', 'errors.log.include.messages' = 'true', 'topics' = 'SYSLOG_INVALID_USERS', 'key.ignore' = 'true', 'schema.ignore' = 'true', 'key.converter' = 'org.apache.kafka.connect.storage.StringConverter' );
  • 26. IOT - Smart Home Hub Mode Service CTADevice State Device health Streams of real time events CDC Hub Info Lookup info Device data mgmt Apps and services MQTT PROXY ksql> CREATE STREAM device_stream_mode WITH (KAFKA_TOPIC='syslog',VALUE_FORMAT ='AVRO'); ksql> CREATE STREAM device_change_mode AS SELECT D.dev_id, D.dev_type, H.hub_mode AS device_mode FROM hub_mode H LEFT OUTER JOIN device_data D ON H.hub_id=D.hub_id EMIT CHANGES; ksql> CREATE STREAM device_stream_mode AS SELECT DS.dev_id , DS.dev_type, DS.mode, F.state AS dev_state FROM device_change_mode DS LEFT OUTER JOIN FACILITY F ON DS.dev_type=F.dev_type WHERE DS.mode=<DEVICE_MODE> EMIT CHANGES; ### CONFIGURE THE MQTT SINK ksql> INSERT INTO hub_mode SELECT * FROM /mqttTopicA/+/sensors [WITHCONVERTER=`myclass.AvroConverter`] ksql> CREATE STREAM hub_mode WITH (KAFKA_TOPIC='hub_mode', VALUE_FORMAT='AVRO'); ### Create the necessary sink and CDC Source connector to SCYLLA Source and Sink connector
  • 27. Customer Satisfaction - CES Score CDC Segmentation ChurnCustomer Loyalty Support Customer Service Number of attempts per issue Violate SLA Customer Customer interaction Customer Log ksql> CREATE STREAM cust_interactions (incident_Id VARCHAR, timestamp) WITH (VALUE_FORMAT='JSON', PARTITIONS=1, KAFKA_TOPIC=cust_interaction); ksql> CREATE TABLE cust_log_aggregate AS SELECT ROWKEY AS customer_id, COUNT(*) AS touch_points FROM cust_interactions GROUP BY customer_id; ksql> CREATE TABLE cust_log_by_issue AS SELECT ROWKEY AS incident_Id, customer_id, COUNT_DISTINCT(touch_points) AS UNIQUE_TOUCH_POINTS FROM cust_interactions GROUP BY ROWKEY EMIT CHANGES; ksql> Select C.incident_id, (C.incident_id_first_touch_point_TS - CL.incident_id_last_touchpoint_TS)/1000/60/60 AS current_SLA_hours FROM customer_log C INNER JOIN call_log CL ON C.incident_id = CL.incident_id WHERE current_SLA_hours > 24; Customer 360
  • 28. Security - Endpoint Security Syslog DNSnetflow Firewall Streams of real time events #JOIN the various streams of DATA using the SOURCE connector from CASSANDRA. #BUILD AND DEPLOY THE CUSTOM UDF ksql> CREATE STREAM entity_risk_score AS SELECT source_IP, mac_ID, derived_risk_score(priority_errors , DNS_burstiness, reputation, firewall_intrusion_attempts) AS risk_score FROM endpoint_profile WHERE derived_risk_score(priority_errors , DNS_burstiness, reputation, firewall_intrusion_attempts) > <THRESHOLD>; Ref: https://www.confluent.io/blog/build-udf-udaf-ksql-5-0/
  • 30. Takeaways + ScyllaDB now Supports Change Data Capture (CDC) + ksqlDB provides a SQL interface for Streaming Applications + ksqlDB is easily extensible with Custom UDFs + Scylla has a new SINK connector (CDC source connector is coming soon!)
  • 31. Resources + Scylla Sink Connector + Source Connector (Cassandra) + Scylla CDC Presentation + Debezium + Scylla’s 7 Design Principles + Scylla Benchmarks
  • 32. + Useful Links: + Stream Processing Book Bundle + Kafka tutorials + ksqlDB + Confluent - Scylla Partnership Overview + Kafka Summits 2020 + Kafka Summit London: April 27 - 28 + Kafka Summit Austin: August 24 - 25 Confluent Resources
  • 34. United States 545 Faber Place Palo Alto, CA 94303 Israel 11 Galgalei Haplada Herzelia, Israel www.scylladb.com @scylladb Thank you