SlideShare a Scribd company logo
1Confidential
The State of Stream Processing
Neil Avery, @avery_neil,
March 2018
2
The state of stream processing
The rise of streaming...
business has changed:
customer 360, digital transformation,
IoT, machine learning, data science
3
Before stream processing
Before:
OLTP - RPC/Request-response - Messaging - ESB
Almost:
- OLTP: TRIGGERs + materialized views, stored procedures…
….CEP….Big data…
Acknowledge:
Dr Michael Stonebraker - CEP, volt-db, Streaming-SQL etc 2005+
… it didn’t happen until now ...
… WHY NOT? ...
4
5
Because Stream Processing is hard
● Scaling
● Semantics
● Correctness
● Accuracy (determinism)
(eos)
● Transactions
● Partitioning
● Fault tolerance
● Re-processing
● Unifying tables and streams
(stream-relational)
● Time
6
Why stream processing is different
We are solving different problems today.
Businesses want to be real-time, correlated events, react and adapt
Turning off batch and moving to real-time, scaling beyond the limits of traditional
storage system, applying data science or machine learning (fashion, fraud or retailer,
suggestions, advertising)
processing the ‘now’ has great value
7
Who is using it
Companies: Uber, Netflix, Audi, Banks, Slack… everyone
...obvious use-cases…
dumb pipes,
Reactive microservices,
ML, streaming ETL etc
….there are ALSO, ‘data-type’ problems we can solve...
8
Stream processing == data processing for events
● Correlation (user register and signed up, bad password)
● Scaling out the data flow of a company (connecting data-sources)
● Moving to the Central nervous system (Data flow system)
● Transaction processing: The Database unbundled, AKA, Turning the
database inside out…
…. why is this so important? [2014]
9
Turning the database inside out….
“The database is a cache of a subset of the log”
● Pat Helland, 2007
With Kafka we can do this at scale…
… using the LOG and partitions ..
10
The 12 stage Adoption JOURNEY
● Dumb pipes
● Simple streaming (filtering)
● Streaming ETL
● Analytics
● Monitoring & notifications
● Anomaly detection
● Correlation
● Reactive microservices
● Transaction processing
● Database inside out
● Central nervous system (platform)
● Global, central nervous system
(platForm)
11
Turning the database inside out
● Event sourcing
● Source of truth
● Data flow systems
● Central nervous system
● Txns At ‘web-scale’ scale!
12
At scale….
Single Deployments with
● Hundreds of brokers
● Hundreds of thousands of topics
● Millions of partitions
Hundreds of millions of Events per second….
AWS Death Star diagram, circa 2008 as per Werner Vogels tweet
The Streaming platform
14
the streaming platform stack
A new breed of data platform - built on unix principles (do lots of simple
things)
- Messaging
- Storage (topics + partitions + log based)
- Correctness (EoS, txns)
- Processing (KSQL, kafka streams)
- Governance
- Operations
- Elastic scale up/down
Kind of a like a modern database but...it can do these clever things, fast and
at scale - unbundled using stream processing
15
the streaming platform in action
KSQL
Stream processing here...
16
State of the union
stream-Relational Players
● Kafka Streams
● Flink
● spark
Stream Only players
● Apex
● Samza
● Storm
● Google Dataflow
A stream-relational processing platform has the following capabilities:
● Relations (or tables) are first-class citizens, i.e. each has an independent identity.
● Relations can be transformed into other relations.
● Relations can be queried in an ad-hoc manner.
Stream processing basics
..Is a..
- Sequence of facts, records or events
- Captures the order of events
- Time sensitive
But it needs to be stored as a stream => in a LOG
A stream….
18
{ Joe pays Mary $1m }
..Is a..
- ‘Materialized’ view of a stream
- uses Change Data Capture (CDC)
A Table….
19
insert delete update
stream
Your code here
Processing….
20
Event sourcing
Table
Stream
Stream-table join
Your code
The next level down
21
The stream table duality
22
Do you think that’s a
table you are querying?
23
24
Streams and Tables
● STREAM and TABLE as first-class
citizens
● Interpretations of Topic content
● STREAM - data in motion
● TABLE - collected state of a stream
(aggregations)
○ One record per key (per window)
○ Current values (compacted topic)
○ Changelog
25
Windows: Aggregations & Tables
●
Source: https://www.rittmanmead.com/blog/2017/10/ksql-streaming-sql-for-apache-kafka/
26
Code and topology
27
Many Stream patterns….
28
● Scaling
● Semantics
● Correctness
● Accuracy (determinism)
(eos)
● Transactions
● Partitioning
● Fault tolerance
● Re-processing
● Unifying tables and streams
(stream-relational)
● Time
Stream Processing (the hard parts)
29
Quick recap...
30
Or some SQL
CREATE STREAM notifications AS SELECT code,
definition FROM ticks LEFT JOIN opens ON
opens.id = ticks.code;
CREATE TABLE metrics AS SELECT * FROM
notifications WINDOW TUMBLING (size 30 second)
GROUP BY id, value WHERE value = ‘Closed’;
Making it real….
32
Applied Stream processing
1. Data processing: enriching streams of data, filtering, cleaning, joining,
splitting: ETL, Machine learning
2. Analytics: asking questions of data
3. Apps: Build reactive (stateful) microservice apps with materialized views
4. Transaction processing: Stream proc + log-store (Kafka), handle txns
across multiple partitions
(replacing 40 years of slow ACID with Async, fast, correct behaviour)
80% of stream processing is for doing simple stuff, like ETL
33
Stream processing using SQL
80% of stream processing is for doing ‘simple stuff’, like ETL
SQL == winning!
=> Simple, repeatable, easy to deploy & run….
=> Accessible: anyone can do stream processing!
https://github.com/confluentinc/ksql
34
KSQL: Streaming ETL
● Streaming ETL
○ Kafka is popular for data pipelines.
○ KSQL enables easy transformations of data within the pipe
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u ON c.userid =
u.user_id
WHERE u.level = 'Platinum';
35
KSQL: Monitoring
● Real-Time Monitoring
○ Infrastructure/data flows/apps monitoring, tracking, and alerting (e.g.
logs, metrics)
○ Sensor / IoT data
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR’
GROUP BY error_code;
36
KSQL: Anomaly detection
● Anomaly Detection
○ Identifying patterns or anomalies in real-time data, surfaced in
milliseconds
CREATE TABLE possible_fraud AS
SELECT card_number, COUNT(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING COUNT(*) > 3;
37
Use case: Anomaly detection - fraud
Goal: Detect when ‘new’ transactions occur against suspicious recipients
Real world: once a fraudster has access to your account - they need to take the
money. This is done by trying to make multiple transactions of small amounts against
common recipients. There is usually a group of them
I.e. take $100 per month for a broadband or mobile phone provider etc
38
Use case: Anomaly detection - fraud
3x Data Sources:
1) realtime txns, 2) set of suspicious recipients, 3) set of valid recipients
Check ‘previous use’
processor - str-table join
(by account-recipient)
{
Type: txn
req_id:1234,
joe -> mary,
joe_account: 1234
mary_account: 5678
amount: $1m
}
Check ‘is suspicious’
processor - str-table join
(by recipient)
{
Type: suspicious_account
account: 1234
recipient: verizon
}
{
Type: valid_account
account: 1234
recipient: verizon
}
table
table
{
Type:
detected_multi-sus-txns
account: 1234
}
Check ‘multi-sus-24h’
processor - count-by-
(by recipient)
For a complete solution (including data generators) see:
https://github.com/bluemonk3y/ksql-recipe-fraudulent-txns
Window - 24h - count(recipient)
39
Use case: Web-scale Transaction processing
> Joe pays mary $1,000,000
Nb: requires a log based streaming platform to support determinism
(kafka!)
A Traditional OLTP would require atomic commit on all tables
Log-based stream processing scales by using partitioning
(partitions are log-append, atomic and provide the source of truth)
40
Use case: Web-scale Transaction processing
1. Request #1234 Joe -> mary $1m
2. Payer account (joe)
3. Payee account (mary)
Requests processor
(by request#)
{
req_id:1234,
joe -> mary,
joe_account: 1234
mary_account: 5678
amount: $1m
}
Debit processor
(by payer id)
De-dup req-id
Credit processor
(by payee id)
de-dup req-id
{ req:1234, type: debit,
who: joe, amount: - $1m }
{ req:1234, type: credit,
who: mary, amount: + $1m }
Payment balance processor
(by account number)
{req:1234, joe - $1m } {req:1234, mary + $1m }
Why is this fast? Why is it horizontally scalable?
41
Use case: Central nervous system
Human resources
B2B txns
Webscale Payments
& ETL
Regulations, compliance and audit
Data science, ETL
42
What (interesting) problems do we still need
to solve?
- SQL (for everyone)
- Improved auto-scaling of analytics (cloud)
- Improved workload optimisation (calcite)
- Large scale real-time analytics (multi-stage - Hyperloglog-reduce)
- Improved streaming-relational support
- Improved data modelling & constraint handling
- Data lineage
43
The future
- Continued adoption, acceleration of “dataflow systems & central nervous system”.
- In 2022, 60% of business will be using Event Driven architectures
- We grow up the stack, we grow out the stack
We have a long road ahead of us….
44
45
Questions?
Recommended reading:
Confluent Blog: https://www.confluent.io/blog/
Kafka Summit San Francisco: https://kafka-summit.org/
Free ebook: Making Sense of Stream Processing, Martin Kleppmann
Safari books: “The Log.”
“Designing Data Intensive Applications”,
Google dataflow paper (very old now!)
& lots more….
Neil Avery, @avery_neil
Technologist, Office of the CTO
46
Questions?
Demo applications: https://github.com/confluentinc/examples
• Interactive Queries, Joins, Security, Windowing, Avro
integration, …
Confluent documentation: http://docs.confluent.io/current/streams/
• Quickstart, Concepts, Developer Guide, FAQ
Recorded talks
• Introduction to Kafka’s Streams API:
http://www.youtube.com/watch?v=o7zSLNiTZbA
• Application Development and Data in the Emerging
World of Stream Processing (higher level talk):
https://www.youtube.com/watch?v=JQnNHO5506w
Neil Avery, @avery_neil
Technologist, Office of the CTO
47
Thank You!
48
Bio for Neil
Neil is a Technologist in the Office of the CTO at Confluent, the company founded by the creators of
Apache Kafka. He has over 20 years of expertise of working on distributed computing, messaging and
stream processing. He has built or redesigned commercial messaging platforms, distributed caching
products as well as developed large scale bespoke systems for tier-1 banks. After a period at
ThoughtWorks, he went on to build some of the first distributed risk engines in financial services. In
2008 he launched a startup that specialised in distributed data analytics and visualization. Prior to
joining Confluent he was the CTO at a fintech consultancy.
Font:
https://github.com/google/fonts/blob/master/apache/permanentmarker/PermanentMarker-Regular.ttf
49
Synopsis -The State of Stream processing
This session walks through the journey to how stream processing arrived at its current
state and a look towards the future. It explores the use-cases and concepts that drive the
evolution of stream processing and the need for fast data, at scale. This includes the
evolution of how Apache Kafka has grown to become the goto streaming platform that is
the Confluent Platform, where it supports the concepts that underpin many processing
frameworks. Learn about the current challenges being faced by Stream processing
frameworks including real-time semantics surrounding windowing, joining, correctness,
elasticity, and accessibility. Finally, learn how KSQL is evolving to shape the future of
what is to come.
50
The State of Stream processing
Stream processing is now at the forefront of many company strategies. Over the last
couple of years we have seen streaming use cases explode and now proliferate the
landscape of any modern business.
Use-cases including digital transformation, IoT, real-time risk, payments microservices
and machine learning are all built on the fundamental that they need fast data and they
need it at scale.
Apache Kafka has long been the streaming platform of choice, its origins of being dumb
pipes for big data have long since been left behind and now it is the goto-streaming
platform of choice.
Stream processing beckons as being the vehicle for driving those streams, and along
with it brings a world of real-time semantics surrounding windowing, joining, correctness,
elasticity, and accessibility. The ‘current state of stream processing’ walks through the
origins of stream processing, applicable use cases and then dives into the challenges
currently facing the world of stream processing as it drives the next data revolution.
51

More Related Content

What's hot (20)

PDF
Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...
confluent
 
PDF
Introduction to Apache Kafka and Confluent... and why they matter
confluent
 
PDF
Data Driven Enterprise with Apache Kafka
confluent
 
PDF
Kafka summit SF 2019 - the art of the event-streaming app
Neil Avery
 
PDF
Building a Streaming Platform with Kafka
confluent
 
PDF
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
confluent
 
PDF
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
PDF
APAC Kafka Summit - Best Of
confluent
 
PDF
Streaming ETL to Elastic with Apache Kafka and KSQL
confluent
 
PDF
Streaming Visualization
Guido Schmutz
 
PDF
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
confluent
 
PPTX
IoT and Event Streaming at Scale with Apache Kafka
confluent
 
PDF
Confluent real time_acquisition_analysis_and_evaluation_of_data_streams_20190...
confluent
 
PDF
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
confluent
 
PDF
Agile Data Integration: How is it possible?
confluent
 
PDF
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
confluent
 
PDF
Shared time-series-analysis-using-an-event-streaming-platform -_v2
confluent
 
PDF
Streams, Tables, and Time in KSQL
confluent
 
PDF
ksqlDB: A Stream-Relational Database System
confluent
 
PPTX
Stream Processing Live Traffic Data with Kafka Streams
Tom Van den Bulck
 
Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...
confluent
 
Introduction to Apache Kafka and Confluent... and why they matter
confluent
 
Data Driven Enterprise with Apache Kafka
confluent
 
Kafka summit SF 2019 - the art of the event-streaming app
Neil Avery
 
Building a Streaming Platform with Kafka
confluent
 
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
confluent
 
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
APAC Kafka Summit - Best Of
confluent
 
Streaming ETL to Elastic with Apache Kafka and KSQL
confluent
 
Streaming Visualization
Guido Schmutz
 
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
confluent
 
IoT and Event Streaming at Scale with Apache Kafka
confluent
 
Confluent real time_acquisition_analysis_and_evaluation_of_data_streams_20190...
confluent
 
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
confluent
 
Agile Data Integration: How is it possible?
confluent
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
confluent
 
Shared time-series-analysis-using-an-event-streaming-platform -_v2
confluent
 
Streams, Tables, and Time in KSQL
confluent
 
ksqlDB: A Stream-Relational Database System
confluent
 
Stream Processing Live Traffic Data with Kafka Streams
Tom Van den Bulck
 

Similar to The State of Stream Processing (20)

PDF
Concepts and Patterns for Streaming Services with Kafka
QAware GmbH
 
PPTX
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
KafkaZone
 
PDF
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Michael Noll
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
PPTX
Confluent-Ably-AWS-ID-2023 - GSlide.pptx
Ahmed791434
 
PDF
Jay Kreps | Kafka Summit NYC 2019 Keynote (Events Everywhere) | CEO, Confluent
confluent
 
PDF
[WSO2Con EU 2018] The Rise of Streaming SQL
WSO2
 
PDF
All Streams Ahead! ksqlDB Workshop ANZ
confluent
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
PDF
Streaming analytics state of the art
Stavros Kontopoulos
 
PDF
The art of the event streaming application: streams, stream processors and sc...
confluent
 
PDF
[WSO2Con EU 2018] Streaming SQL in the Real World
WSO2
 
PDF
Mastering Kafka Streams and ksqlDB: Building Real-Time Data Systems by Exampl...
dphfmuw5765
 
PDF
The Rise of Streaming SQL
Sriskandarajah Suhothayan
 
PDF
[WSO2Con USA 2018] The Rise of Streaming SQL
WSO2
 
PPTX
Building real-time serverless data applications with Confluent and AWS.pptx
Ahmed791434
 
PPTX
Apache Kafka Streams
Apache Kafka TLV
 
PDF
Rethinking Stream Processing with Apache Kafka, Kafka Streams and KSQL
Kai Wähner
 
PDF
EDA Meets Data Engineering – What's the Big Deal?
confluent
 
PDF
Streaming, Database & Distributed Systems Bridging the Divide
Ben Stopford
 
Concepts and Patterns for Streaming Services with Kafka
QAware GmbH
 
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
KafkaZone
 
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Michael Noll
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Confluent-Ably-AWS-ID-2023 - GSlide.pptx
Ahmed791434
 
Jay Kreps | Kafka Summit NYC 2019 Keynote (Events Everywhere) | CEO, Confluent
confluent
 
[WSO2Con EU 2018] The Rise of Streaming SQL
WSO2
 
All Streams Ahead! ksqlDB Workshop ANZ
confluent
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Streaming analytics state of the art
Stavros Kontopoulos
 
The art of the event streaming application: streams, stream processors and sc...
confluent
 
[WSO2Con EU 2018] Streaming SQL in the Real World
WSO2
 
Mastering Kafka Streams and ksqlDB: Building Real-Time Data Systems by Exampl...
dphfmuw5765
 
The Rise of Streaming SQL
Sriskandarajah Suhothayan
 
[WSO2Con USA 2018] The Rise of Streaming SQL
WSO2
 
Building real-time serverless data applications with Confluent and AWS.pptx
Ahmed791434
 
Apache Kafka Streams
Apache Kafka TLV
 
Rethinking Stream Processing with Apache Kafka, Kafka Streams and KSQL
Kai Wähner
 
EDA Meets Data Engineering – What's the Big Deal?
confluent
 
Streaming, Database & Distributed Systems Bridging the Divide
Ben Stopford
 
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
PDF
Migration, backup and restore made easy using Kannika
confluent
 
PDF
Five Things You Need to Know About Data Streaming in 2025
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
PDF
Unlocking value with event-driven architecture by Confluent
confluent
 
PDF
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
PDF
Building API data products on top of your real-time data infrastructure
confluent
 
PDF
Speed Wins: From Kafka to APIs in Minutes
confluent
 
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
Migration, backup and restore made easy using Kannika
confluent
 
Five Things You Need to Know About Data Streaming in 2025
confluent
 
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Unlocking value with event-driven architecture by Confluent
confluent
 
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Ad

Recently uploaded (20)

PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 

The State of Stream Processing

  • 1. 1Confidential The State of Stream Processing Neil Avery, @avery_neil, March 2018
  • 2. 2 The state of stream processing The rise of streaming... business has changed: customer 360, digital transformation, IoT, machine learning, data science
  • 3. 3 Before stream processing Before: OLTP - RPC/Request-response - Messaging - ESB Almost: - OLTP: TRIGGERs + materialized views, stored procedures… ….CEP….Big data… Acknowledge: Dr Michael Stonebraker - CEP, volt-db, Streaming-SQL etc 2005+
  • 4. … it didn’t happen until now ... … WHY NOT? ... 4
  • 5. 5 Because Stream Processing is hard ● Scaling ● Semantics ● Correctness ● Accuracy (determinism) (eos) ● Transactions ● Partitioning ● Fault tolerance ● Re-processing ● Unifying tables and streams (stream-relational) ● Time
  • 6. 6 Why stream processing is different We are solving different problems today. Businesses want to be real-time, correlated events, react and adapt Turning off batch and moving to real-time, scaling beyond the limits of traditional storage system, applying data science or machine learning (fashion, fraud or retailer, suggestions, advertising) processing the ‘now’ has great value
  • 7. 7 Who is using it Companies: Uber, Netflix, Audi, Banks, Slack… everyone ...obvious use-cases… dumb pipes, Reactive microservices, ML, streaming ETL etc ….there are ALSO, ‘data-type’ problems we can solve...
  • 8. 8 Stream processing == data processing for events ● Correlation (user register and signed up, bad password) ● Scaling out the data flow of a company (connecting data-sources) ● Moving to the Central nervous system (Data flow system) ● Transaction processing: The Database unbundled, AKA, Turning the database inside out… …. why is this so important? [2014]
  • 9. 9 Turning the database inside out…. “The database is a cache of a subset of the log” ● Pat Helland, 2007 With Kafka we can do this at scale… … using the LOG and partitions ..
  • 10. 10 The 12 stage Adoption JOURNEY ● Dumb pipes ● Simple streaming (filtering) ● Streaming ETL ● Analytics ● Monitoring & notifications ● Anomaly detection ● Correlation ● Reactive microservices ● Transaction processing ● Database inside out ● Central nervous system (platform) ● Global, central nervous system (platForm)
  • 11. 11 Turning the database inside out ● Event sourcing ● Source of truth ● Data flow systems ● Central nervous system ● Txns At ‘web-scale’ scale!
  • 12. 12 At scale…. Single Deployments with ● Hundreds of brokers ● Hundreds of thousands of topics ● Millions of partitions Hundreds of millions of Events per second…. AWS Death Star diagram, circa 2008 as per Werner Vogels tweet
  • 14. 14 the streaming platform stack A new breed of data platform - built on unix principles (do lots of simple things) - Messaging - Storage (topics + partitions + log based) - Correctness (EoS, txns) - Processing (KSQL, kafka streams) - Governance - Operations - Elastic scale up/down Kind of a like a modern database but...it can do these clever things, fast and at scale - unbundled using stream processing
  • 15. 15 the streaming platform in action KSQL Stream processing here...
  • 16. 16 State of the union stream-Relational Players ● Kafka Streams ● Flink ● spark Stream Only players ● Apex ● Samza ● Storm ● Google Dataflow A stream-relational processing platform has the following capabilities: ● Relations (or tables) are first-class citizens, i.e. each has an independent identity. ● Relations can be transformed into other relations. ● Relations can be queried in an ad-hoc manner.
  • 18. ..Is a.. - Sequence of facts, records or events - Captures the order of events - Time sensitive But it needs to be stored as a stream => in a LOG A stream…. 18 { Joe pays Mary $1m }
  • 19. ..Is a.. - ‘Materialized’ view of a stream - uses Change Data Capture (CDC) A Table…. 19 insert delete update stream
  • 22. The stream table duality 22
  • 23. Do you think that’s a table you are querying? 23
  • 24. 24 Streams and Tables ● STREAM and TABLE as first-class citizens ● Interpretations of Topic content ● STREAM - data in motion ● TABLE - collected state of a stream (aggregations) ○ One record per key (per window) ○ Current values (compacted topic) ○ Changelog
  • 25. 25 Windows: Aggregations & Tables ● Source: https://www.rittmanmead.com/blog/2017/10/ksql-streaming-sql-for-apache-kafka/
  • 28. 28 ● Scaling ● Semantics ● Correctness ● Accuracy (determinism) (eos) ● Transactions ● Partitioning ● Fault tolerance ● Re-processing ● Unifying tables and streams (stream-relational) ● Time Stream Processing (the hard parts)
  • 30. 30 Or some SQL CREATE STREAM notifications AS SELECT code, definition FROM ticks LEFT JOIN opens ON opens.id = ticks.code; CREATE TABLE metrics AS SELECT * FROM notifications WINDOW TUMBLING (size 30 second) GROUP BY id, value WHERE value = ‘Closed’;
  • 32. 32 Applied Stream processing 1. Data processing: enriching streams of data, filtering, cleaning, joining, splitting: ETL, Machine learning 2. Analytics: asking questions of data 3. Apps: Build reactive (stateful) microservice apps with materialized views 4. Transaction processing: Stream proc + log-store (Kafka), handle txns across multiple partitions (replacing 40 years of slow ACID with Async, fast, correct behaviour) 80% of stream processing is for doing simple stuff, like ETL
  • 33. 33 Stream processing using SQL 80% of stream processing is for doing ‘simple stuff’, like ETL SQL == winning! => Simple, repeatable, easy to deploy & run…. => Accessible: anyone can do stream processing! https://github.com/confluentinc/ksql
  • 34. 34 KSQL: Streaming ETL ● Streaming ETL ○ Kafka is popular for data pipelines. ○ KSQL enables easy transformations of data within the pipe CREATE STREAM vip_actions AS SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = 'Platinum';
  • 35. 35 KSQL: Monitoring ● Real-Time Monitoring ○ Infrastructure/data flows/apps monitoring, tracking, and alerting (e.g. logs, metrics) ○ Sensor / IoT data CREATE TABLE error_counts AS SELECT error_code, count(*) FROM monitoring_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE type = 'ERROR’ GROUP BY error_code;
  • 36. 36 KSQL: Anomaly detection ● Anomaly Detection ○ Identifying patterns or anomalies in real-time data, surfaced in milliseconds CREATE TABLE possible_fraud AS SELECT card_number, COUNT(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING COUNT(*) > 3;
  • 37. 37 Use case: Anomaly detection - fraud Goal: Detect when ‘new’ transactions occur against suspicious recipients Real world: once a fraudster has access to your account - they need to take the money. This is done by trying to make multiple transactions of small amounts against common recipients. There is usually a group of them I.e. take $100 per month for a broadband or mobile phone provider etc
  • 38. 38 Use case: Anomaly detection - fraud 3x Data Sources: 1) realtime txns, 2) set of suspicious recipients, 3) set of valid recipients Check ‘previous use’ processor - str-table join (by account-recipient) { Type: txn req_id:1234, joe -> mary, joe_account: 1234 mary_account: 5678 amount: $1m } Check ‘is suspicious’ processor - str-table join (by recipient) { Type: suspicious_account account: 1234 recipient: verizon } { Type: valid_account account: 1234 recipient: verizon } table table { Type: detected_multi-sus-txns account: 1234 } Check ‘multi-sus-24h’ processor - count-by- (by recipient) For a complete solution (including data generators) see: https://github.com/bluemonk3y/ksql-recipe-fraudulent-txns Window - 24h - count(recipient)
  • 39. 39 Use case: Web-scale Transaction processing > Joe pays mary $1,000,000 Nb: requires a log based streaming platform to support determinism (kafka!) A Traditional OLTP would require atomic commit on all tables Log-based stream processing scales by using partitioning (partitions are log-append, atomic and provide the source of truth)
  • 40. 40 Use case: Web-scale Transaction processing 1. Request #1234 Joe -> mary $1m 2. Payer account (joe) 3. Payee account (mary) Requests processor (by request#) { req_id:1234, joe -> mary, joe_account: 1234 mary_account: 5678 amount: $1m } Debit processor (by payer id) De-dup req-id Credit processor (by payee id) de-dup req-id { req:1234, type: debit, who: joe, amount: - $1m } { req:1234, type: credit, who: mary, amount: + $1m } Payment balance processor (by account number) {req:1234, joe - $1m } {req:1234, mary + $1m } Why is this fast? Why is it horizontally scalable?
  • 41. 41 Use case: Central nervous system Human resources B2B txns Webscale Payments & ETL Regulations, compliance and audit Data science, ETL
  • 42. 42 What (interesting) problems do we still need to solve? - SQL (for everyone) - Improved auto-scaling of analytics (cloud) - Improved workload optimisation (calcite) - Large scale real-time analytics (multi-stage - Hyperloglog-reduce) - Improved streaming-relational support - Improved data modelling & constraint handling - Data lineage
  • 43. 43 The future - Continued adoption, acceleration of “dataflow systems & central nervous system”. - In 2022, 60% of business will be using Event Driven architectures - We grow up the stack, we grow out the stack
  • 44. We have a long road ahead of us…. 44
  • 45. 45 Questions? Recommended reading: Confluent Blog: https://www.confluent.io/blog/ Kafka Summit San Francisco: https://kafka-summit.org/ Free ebook: Making Sense of Stream Processing, Martin Kleppmann Safari books: “The Log.” “Designing Data Intensive Applications”, Google dataflow paper (very old now!) & lots more…. Neil Avery, @avery_neil Technologist, Office of the CTO
  • 46. 46 Questions? Demo applications: https://github.com/confluentinc/examples • Interactive Queries, Joins, Security, Windowing, Avro integration, … Confluent documentation: http://docs.confluent.io/current/streams/ • Quickstart, Concepts, Developer Guide, FAQ Recorded talks • Introduction to Kafka’s Streams API: http://www.youtube.com/watch?v=o7zSLNiTZbA • Application Development and Data in the Emerging World of Stream Processing (higher level talk): https://www.youtube.com/watch?v=JQnNHO5506w Neil Avery, @avery_neil Technologist, Office of the CTO
  • 48. 48
  • 49. Bio for Neil Neil is a Technologist in the Office of the CTO at Confluent, the company founded by the creators of Apache Kafka. He has over 20 years of expertise of working on distributed computing, messaging and stream processing. He has built or redesigned commercial messaging platforms, distributed caching products as well as developed large scale bespoke systems for tier-1 banks. After a period at ThoughtWorks, he went on to build some of the first distributed risk engines in financial services. In 2008 he launched a startup that specialised in distributed data analytics and visualization. Prior to joining Confluent he was the CTO at a fintech consultancy. Font: https://github.com/google/fonts/blob/master/apache/permanentmarker/PermanentMarker-Regular.ttf 49
  • 50. Synopsis -The State of Stream processing This session walks through the journey to how stream processing arrived at its current state and a look towards the future. It explores the use-cases and concepts that drive the evolution of stream processing and the need for fast data, at scale. This includes the evolution of how Apache Kafka has grown to become the goto streaming platform that is the Confluent Platform, where it supports the concepts that underpin many processing frameworks. Learn about the current challenges being faced by Stream processing frameworks including real-time semantics surrounding windowing, joining, correctness, elasticity, and accessibility. Finally, learn how KSQL is evolving to shape the future of what is to come. 50
  • 51. The State of Stream processing Stream processing is now at the forefront of many company strategies. Over the last couple of years we have seen streaming use cases explode and now proliferate the landscape of any modern business. Use-cases including digital transformation, IoT, real-time risk, payments microservices and machine learning are all built on the fundamental that they need fast data and they need it at scale. Apache Kafka has long been the streaming platform of choice, its origins of being dumb pipes for big data have long since been left behind and now it is the goto-streaming platform of choice. Stream processing beckons as being the vehicle for driving those streams, and along with it brings a world of real-time semantics surrounding windowing, joining, correctness, elasticity, and accessibility. The ‘current state of stream processing’ walks through the origins of stream processing, applicable use cases and then dives into the challenges currently facing the world of stream processing as it drives the next data revolution. 51