SlideShare a Scribd company logo
A Practical Guide to Selecting a
Stream Processing Technology
Michael  G.  Noll
Product  Manager,  Confluent
Kafka Talk Series
Date Title
Sep 27 Introduction	
  To	
  Streaming	
  Data	
  and	
  Stream	
  Processing	
  with	
  Apache	
  Kafka
Oct	
  06 Deep	
  Dive	
  into	
  Apache	
  Kafka
Oct	
  27 Data	
  Integration	
  with	
  Apache	
  Kafka
Nov	
  17 Demystifying	
  Stream	
  Processing	
  with	
  Apache	
  Kafka
Dec	
  01 A	
  Practical	
  Guide	
  to	
  Selecting	
  a	
  Stream	
  Processing	
  Technology
Dec	
  15 Streaming	
  in	
  Practice:	
  Putting	
  Apache	
  Kafka	
  in	
  Production
https://www.confluent.io/apache-­‐kafka-­‐talk-­‐series
Agenda
• Recap:  What  is  Stream  Processing?
• The  Three  Pillars  of  Stream  Processing  in  Practice
• Key  Selection  Criteria
• Organizational/Non-Technical  Dimensions
• Technical  Dimensions
• Summary
Agenda
• Recap:  What  is  Stream  Processing?
• The  Three  Pillars  of  Stream  Processing  in  Practice
• Key  Selection  Criteria
• Organizational/Non-Technical  Dimensions
• Technical  Dimensions
• Summary
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
Agenda
• Recap:  What  is  Stream  Processing?
• The  Three  Pillars  of  Stream  Processing  in  Practice
• Key  Selection  Criteria
• Organizational/Non-Technical  Dimensions
• Technical  Dimensions
• Summary
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
Powered by Kafka (﴾thousands more)﴿
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
Spark Streaming API (﴾2.0)﴿
Kafka’s Streams API (﴾0.10)﴿
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
Example: Streams and Tables in Kafka
Word Count
hello 2
kafka 1
world 1
… …
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
Streams & Databases
• A  stream  processing  technology  must  have  first-class  
support  for Streams  and Tables
• With  scalability,  fault  tolerance,  …
• Why?  Because  most  use  cases  require  not  just  one,  but  both!
• Support  – or  lack  thereof  – strongly  impacts  the  resulting  
technical  architecture  and  development  efforts
• No  support  means:
• Painful  Do-It-Yourself
• Increased  complexity,  more  moving  pieces  to  juggle
Agenda
• Recap:  What  is  Stream  Processing?
• The  Three  Pillars  of  Stream  Processing  in  Practice
• Key  Selection  Criteria
• Organizational/Non-Technical  Dimensions
• Technical  Dimensions
• Summary
Agenda
• Recap:  What  is  Stream  Processing?
• The  Three  Pillars  of  Stream  Processing  in  Practice
• Key  Selection  Criteria
• Organizational/Non-Technical  Dimensions
• Technical  Dimensions
• Summary
Organizational/Non-‐Tech Dimensions
• Can  your  org  understand  and  leverage  the  technology?
• Familiarity  with  languages;  intuitive  concepts  and  APIs;  trainings
• Are  you  permitted  to  use  it  in  your  organization?
• Security  features,  licensing,  open  source  vs.  proprietary
• Can  you  continue  to  use  it  in  the  future?
• Longevity  of  technology,  licensing,  vendor  strength
Organizational/Non-‐Tech Dimensions
• Do  you  believe  in  the  long-term  vision?
• Switching  technologies  in  an  organization  is  often  expensive/slow:  
legacy  migration,  re-training,  resistance  to  change,  etc.
• What  is  the  path  and  time  to  success?
• Can  you  move  smoothly  and  quickly  from  proof-of-concept  to  
production?
• Areas  and  range  of  applicability in  your  organization
• General-purpose  vs.  niche  technology
• Viable  for  S/M/L/XL  use  cases  vs.  for  XL  use  cases  only
• Building  core  business  apps  vs.  doing  backend  analytics
Organizational/Non-‐Tech Dimensions
Licensing Vision/Roadmap ROI
Impact	
  on
Organization
Broad	
  vs.	
  Niche
Applicability
Time	
  to	
  Market
Professional
Services
Documentation Examples User	
  CommunityLearning	
  Curve
Impact	
  on	
  Tools,
Infrastructure,	
  …
Agenda
• Recap:  What  is  Stream  Processing?
• The  Three  Pillars  of  Stream  Processing  in  Practice
• Key  Selection  Criteria
• Organizational/Non-Technical  Dimensions
• Technical  Dimensions
• Summary
Technical Dimensions
Reprocessing Scalability	
  &
Elasticity
Fault	
  Tolerance
API Dev/Ops
Lifecycle
Security Processing
Model
Out	
  of	
  Order
Data
Abstractions Time	
  Model WindowingState
State
• Stateful  processing  of  any  kind  requires…state
• Many  (most?)  use  cases  for  stream  processing  are  stateful
• Joins,  aggregations,  windowing,  counting,  ...
• Is  state  performant?  Local  vs.  remote  state?
50
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
State
• Stateful  processing  of  any  kind  requires…state
• Many  (most?)  use  cases  for  stream  processing  are  stateful
• Joins,  aggregations,  windowing,  counting,  ...
• Is  state  performant?  Local  vs.  remote  state?
• Is  state  fault-tolerant?  How  fast  is  recovery/failover?
53
A Practical Guide to Selecting a Stream Processing Technology
State
• Stateful  processing  of  any  kind  requires…state
• Many  (most?)  use  cases  for  stream  processing  are  stateful
• Joins,  aggregations,  windowing,  counting,  ...
• Is  state  performant?  Local  vs.  remote  state?
• Is  state  fault-tolerant?  How  fast  is  recovery/failover?
• Is  state  interactively  queryable?
• Kafka:  ready  for  use  (GA)
• Spark,  Flink:  under  development  (alpha)
• Storm,  Samza,  and  others:  not  available
55
A Practical Guide to Selecting a Stream Processing Technology
Technical Dimensions
Reprocessing Scalability	
  &
Elasticity
Fault	
  Tolerance
API Dev/Ops
Lifecycle
Security Processing
Model
Out	
  of	
  Order
Data
Abstractions Time	
  Model WindowingState
Abstractions
• What  are  the  data  model  and  the  available  abstractions?
• Most  common  abstraction:  stream of  records,  events
• Kafka,  Spark,  Storm,  Samza,  Flink,  Apex,  ...
• New,  very  powerful:  table  of  records
• Currently  unique  to  Kafka
• Represents  latest  state and  materialized  views
• State  must  have  a  first-class  abstraction  because,  as  we  just  saw  in  
the  previous  section,  state  is  crucial  for  stream  processing!
58
Technical Dimensions
Reprocessing Scalability	
  &
Elasticity
Fault	
  Tolerance
API Dev/Ops
Lifecycle
Security Processing
Model
Out	
  of	
  Order
Data
Abstractions Time	
  Model WindowingState
Time model
• Different  use  cases  require  different  time  semantics
• Great  majority  of  use  cases  require  event-time semantics
• Other  use  cases  may  require  processing-time (e.g.  real-
time  monitoring)  or  special  variants  like  ingestion-time
• A  stream  processing  technology  should,  at  a  minimum,  
support  event-time  to  cover  most  use  cases  in  practice
• Examples:  Kafka,  Beam,  Flink
Time Model
61
Technical Dimensions
Reprocessing Scalability	
  &
Elasticity
Fault	
  Tolerance
API Dev/Ops
Lifecycle
Security Processing
Model
Out	
  of	
  Order
Data
Abstractions Time	
  Model WindowingState
Windowing
• Windowing  is  an  operation  that  groups events
Windowing
Input	
  data,	
  where
colors	
  represent
different	
  users	
  events
Rectangles	
  denote
different	
  event-­‐time
windows
processing-­‐time
event-­‐time
windowing
alice
bob
dave
Windowing
• Windowing  is  an  operation  that  groups events
• Most  commonly  needed:  time  windows,  session  windows
• Examples:
• Real-time  monitoring:  5-minute  averages
• Reader  behavior  on  a  website:  user  browsing  sessions
Windowing
Technical Dimensions
Reprocessing Scalability	
  &
Elasticity
Fault	
  Tolerance
API Dev/Ops
Lifecycle
Security Processing
Model
Out	
  of	
  Order
Data
Abstractions Time	
  Model WindowingState
Out-‐of-‐order and late-‐arriving data
• Is  very  common in  practice,  not  a  rare  corner  case
• Related  to  time  model  discussion
Out-‐of-‐order and late-‐arriving data
Users	
  with	
  mobile	
  phones	
  enter
airplane,	
  lose	
  Internet	
  connectivity
Emails	
  are	
  being	
  written
during	
  the	
  10h	
  flight
Internet	
  connectivity	
  is	
  restored,
phones	
  will	
  send	
  queued	
  emails	
  now
Out-‐of-‐order and late-‐arriving data
• Is  very  common in  practice,  not  a  rare  corner  case
• Related  to  time  model  discussion
• We  want  control over  how  out-of-order  data  is  handled
• Example:
• We  process  data  in  5-minute  windows,  e.g.  compute  statistics
• When  event  arrives  1  minute  late:  update the  original  result!
• When  event  arrives  2  hours  late:  discard it!
• Handling  must  be  efficient because  it  happens  so  often
Technical Dimensions
Reprocessing Scalability	
  &
Elasticity
Fault	
  Tolerance
API Dev/Ops
Lifecycle
Security Processing
Model
Out	
  of	
  Order
Data
Abstractions Time	
  Model WindowingState
Reprocessing
• Re-process  data  by  rewinding  a  stream  back  in  time
• Use  cases  in  practice  include
• Correcting  output  data  after  fixing  a  bug
• Facilitate  iterative  and  explorative  development
• A/B  testing
• Processing  historical  data
• Walking  through  "What  If?"  scenarios
• Also:  often  used  behind-the-scenes  for  fault  tolerance
A Practical Guide to Selecting a Stream Processing Technology
Technical Dimensions
Reprocessing Scalability	
  &
Elasticity
Fault	
  Tolerance
API Dev/Ops
Lifecycle
Security Processing
Model
Out	
  of	
  Order
Data
Abstractions Time	
  Model WindowingState
Scalability, Elasticity, Fault Tolerance
• Can  the  technology  scale according  to  your  needs?
• Desired  latency,  throughput?
• Able  to  process  millions  of  messages  per  second?
• What  is  the  minimum  footprint?
• Expand/shrink  capacity  dynamically  during  operations?
• Helps  with  resource  utilization  because  most  stream  apps  run  continuously
• Resilience and  fault  tolerance
• Which  guarantees  for  data  delivery  and  for  state?  "At-least-once",  "exactly-
once",  "effectively-once",  etc.
• Failover  behavior  and  recovery  time?  Automated  or  manual?
• Any  negative  impact  of  fault  tolerance  features  on  performance?
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
Technical Dimensions
Reprocessing Scalability	
  &
Elasticity
Fault	
  Tolerance
API Dev/Ops
Lifecycle
Security Processing
Model
Out	
  of	
  Order
Data
Abstractions Time	
  Model WindowingState
Security
• To  meet  internal  security  policies,  legal  compliance,  etc.
• Typical  base  requirements  for  stream  processing  applications:
• Encrypt  data-in-transit  (e.g.  from/to  Kafka)
• Authentication:  "only  some  applications  may  talk  to  production"
• Authorization:  "access  to  sensitive  data  such  as  PII  is  restricted”
• The  easier  it  is  to  use  security  features,  the  more  likely  they  are  
actually  being  used  in  practice
Technical Dimensions
Reprocessing Scalability	
  &
Elasticity
Fault	
  Tolerance
API Dev/Ops
Lifecycle
Security Processing
Model
Out	
  of	
  Order
Data
Abstractions Time	
  Model WindowingState
Processing Model
• True  stream  processing  is  record-at-a-time processing
• Benefits  include  low  latency (millisecs),  dealing  efficiently  with  out-of-order  data
• Can  provide  both  latency  and  high  throughput  via  internal  optimizations
• Examples:  Kafka,  Storm,  Samza,  Flink,  Beam
• Some  processing  technologies  opt  for  (micro)batching
• Micro-batching  has  no  true  benefits:  consider  it  a  technical  workaround  to  
shoehorn  stream-like  functionality  into  a  tool
• Suffers  from  significant  overhead  when  dealing  with  e.g.  out-of-order/late-arriving  
data,  when  performing  windowed  analyses  (e.g.  session  windows)
• Typically  a  strong  blocker  for  use  cases  such  as  fraud  detection  or  anything  where  
"a  few  seconds"  of  latency  is  prohibitive
• Examples:  Spark,  Storm  (Trident),  Hadoop*
Technical Dimensions
Reprocessing Scalability	
  &
Elasticity
Fault	
  Tolerance
API Dev/Ops
Lifecycle
Security Processing
Model
Out	
  of	
  Order
Data
Abstractions Time	
  Model WindowingState
API
• Choice  of  API  is  a  subjective  matter  – skills,  preference,  …
• Typical  options
• Declarative,  expressive  API:  operations  like  map(),  filter()
• Imperative,  lower-level  API:  callbacks  like  process(event)
• Streaming  SQL:  STREAM  SELECT  …  FROM  …  WHERE  …  
• In  the  best  case  you  get  not  just  one,  but  all  three
• "Abstractions  are  great!"
• "Abstractions  considered  harmful!"
Technical Dimensions
Reprocessing Scalability	
  &
Elasticity
Fault	
  Tolerance
API Dev/Ops
Lifecycle
Security Processing
Model
Out	
  of	
  Order
Data
Abstractions Time	
  Model WindowingState
Developer/Operations Lifecycle
• How  should  your  daily  work  look  and  feel  like?
• "I  like  to  do  quick,  iterative  development"  (modify/test/repeat)
• "I  want  to  decouple  team  roadmaps,  project  schedules"
• Big  difference  between  App  Model  <->  Cluster  Model
• Testing,  packaging,  deployment,  monitoring,  operations
• "Do  I  need  to  know  Java  (app)  or  YARN  (cluster)  for  this?”
• "I  want  reactive  processing  in  containers  that  run  on  Mesos!"
• Rolling,  no-downtime  upgrades?
• Integration  with  existing  Ops  infra,  tools,  processes?
Agenda
• Recap:  What  is  Stream  Processing?
• The  Three  Pillars  of  Stream  Processing  in  Practice
• Key  Selection  Criteria
• Organizational/Non-Technical  Dimensions
• Technical  Dimensions
• Summary
Summary
• What  we  covered  is  a  good  starting  point
• But,  no  free  lunch!
• Understand  what  you  need,  and  weigh  criteria  appropriately
• Think  end-to-end:  idea,  development,  operations,  troubleshooting
• Think  big-picture:  future  use  cases,  architecture,  security,  training,  …
• Do  your  own  internal  hackathons,  proof-of-concepts
• Do  your  own  benchmarks
• If  in  doubt:  simplicity  beats  complexity
• Faster  to  learn,  easier  to  understand,  less  likely  to  fail,  …
Q&A Session
89
Coming Up Next
Date Title Speaker
Dec	
  15 Streaming in Practice: Putting Apache
Kafka in Production
Roger Hoover
https://www.confluent.io/apache-­‐kafka-­‐talk-­‐series

More Related Content

What's hot (20)

PDF
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
PPTX
Kafka Streams for Java enthusiasts
Slim Baltagi
 
PDF
Event Driven Architectures with Apache Kafka on Heroku
Heroku
 
PDF
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
confluent
 
PDF
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
confluent
 
PDF
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
confluent
 
PDF
What's new in Confluent 3.2 and Apache Kafka 0.10.2
confluent
 
PPTX
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
PDF
How to over-engineer things and have fun? | Oto Brglez, OPALAB
HostedbyConfluent
 
PPTX
Data Streaming with Apache Kafka & MongoDB
confluent
 
PDF
Monitoring Apache Kafka with Confluent Control Center
confluent
 
PPTX
Confluent building a real-time streaming platform using kafka streams and k...
Thomas Alex
 
PDF
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
confluent
 
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
PDF
Using Apache Kafka to Analyze Session Windows
confluent
 
PDF
Evolving from Messaging to Event Streaming
confluent
 
PDF
Introduction to Apache Kafka and Confluent... and why they matter
confluent
 
PPTX
Streaming Data Integration - For Women in Big Data Meetup
Gwen (Chen) Shapira
 
PPTX
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
PDF
Kafka Streams: What it is, and how to use it?
confluent
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
Kafka Streams for Java enthusiasts
Slim Baltagi
 
Event Driven Architectures with Apache Kafka on Heroku
Heroku
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
confluent
 
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
confluent
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
confluent
 
What's new in Confluent 3.2 and Apache Kafka 0.10.2
confluent
 
Apache Kafka 0.8 basic training - Verisign
Michael Noll
 
How to over-engineer things and have fun? | Oto Brglez, OPALAB
HostedbyConfluent
 
Data Streaming with Apache Kafka & MongoDB
confluent
 
Monitoring Apache Kafka with Confluent Control Center
confluent
 
Confluent building a real-time streaming platform using kafka streams and k...
Thomas Alex
 
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
confluent
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
Using Apache Kafka to Analyze Session Windows
confluent
 
Evolving from Messaging to Event Streaming
confluent
 
Introduction to Apache Kafka and Confluent... and why they matter
confluent
 
Streaming Data Integration - For Women in Big Data Meetup
Gwen (Chen) Shapira
 
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
Kafka Streams: What it is, and how to use it?
confluent
 

Viewers also liked (20)

PPTX
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
PPTX
Deep Dive into Apache Kafka
confluent
 
PDF
Demystifying Stream Processing with Apache Kafka
confluent
 
PDF
Leveraging Mainframe Data for Modern Analytics
confluent
 
PDF
The Data Dichotomy- Rethinking the Way We Treat Data and Services
confluent
 
PDF
Power of the Log: LSM & Append Only Data Structures
confluent
 
PPTX
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
confluent
 
PDF
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
confluent
 
PPTX
Introduction To Streaming Data and Stream Processing with Apache Kafka
confluent
 
PDF
Distributed stream processing with Apache Kafka
confluent
 
PDF
Data Pipelines Made Simple with Apache Kafka
confluent
 
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
PPTX
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
confluent
 
PDF
Stream Processing with Kafka in Uber, Danny Yuan
confluent
 
PDF
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
confluent
 
PDF
Building Event-Driven Services with Apache Kafka
confluent
 
PDF
Partner Development Guide for Kafka Connect
confluent
 
PPTX
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
confluent
 
PDF
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
confluent
 
PDF
Confluent & Attunity: Mainframe Data Modern Analytics
confluent
 
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Deep Dive into Apache Kafka
confluent
 
Demystifying Stream Processing with Apache Kafka
confluent
 
Leveraging Mainframe Data for Modern Analytics
confluent
 
The Data Dichotomy- Rethinking the Way We Treat Data and Services
confluent
 
Power of the Log: LSM & Append Only Data Structures
confluent
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
confluent
 
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
confluent
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
confluent
 
Distributed stream processing with Apache Kafka
confluent
 
Data Pipelines Made Simple with Apache Kafka
confluent
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
Protecting your data at rest with Apache Kafka by Confluent and Vormetric
confluent
 
Stream Processing with Kafka in Uber, Danny Yuan
confluent
 
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
confluent
 
Building Event-Driven Services with Apache Kafka
confluent
 
Partner Development Guide for Kafka Connect
confluent
 
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
confluent
 
Introducing Kafka Streams: Large-scale Stream Processing with Kafka, Neha Nar...
confluent
 
Confluent & Attunity: Mainframe Data Modern Analytics
confluent
 
Ad

Similar to A Practical Guide to Selecting a Stream Processing Technology (20)

PPTX
Azure architecture design patterns - proven solutions to common challenges
Ivo Andreev
 
PDF
Integration strategies best practices- Mulesoft meetup April 2018
Rohan Rasane
 
PDF
Introduction to the Typesafe Reactive Platform
BoldRadius Solutions
 
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
PPTX
Modern DevOps across Technologies on premises and clouds with Oracle Manageme...
Lucas Jellema
 
PDF
Spark Development Lifecycle at Workday - ApacheCon 2020
Pavel Hardak
 
PDF
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Eren Avşaroğulları
 
PDF
From Traction to Production Maturing your LLMOps step by step
Maxim Salnikov
 
PPTX
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
PPTX
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Lucas Jellema
 
PDF
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
HostedbyConfluent
 
PDF
6 GigaSpaces Principles to Survive Black Friday
Ali Hodroj
 
PDF
Ultra-scale e-Commerce Transaction Services with Lean Middleware
WSO2
 
PDF
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Value Association
 
PPTX
Oracle Forms Modernization Roadmap
Kai-Uwe Möller
 
PPTX
Oracle Sistemas Convergentes
Fran Navarro
 
PPTX
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
Richard Robinson
 
PDF
Top Down Network Design - ebrahma.com
Pawan Sharma
 
PDF
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
GlobalLogic Ukraine
 
PPTX
Open, Secure & Transparent AI Pipelines
Nick Pentreath
 
Azure architecture design patterns - proven solutions to common challenges
Ivo Andreev
 
Integration strategies best practices- Mulesoft meetup April 2018
Rohan Rasane
 
Introduction to the Typesafe Reactive Platform
BoldRadius Solutions
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
Modern DevOps across Technologies on premises and clouds with Oracle Manageme...
Lucas Jellema
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Pavel Hardak
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Eren Avşaroğulları
 
From Traction to Production Maturing your LLMOps step by step
Maxim Salnikov
 
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Lucas Jellema
 
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
HostedbyConfluent
 
6 GigaSpaces Principles to Survive Black Friday
Ali Hodroj
 
Ultra-scale e-Commerce Transaction Services with Lean Middleware
WSO2
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Value Association
 
Oracle Forms Modernization Roadmap
Kai-Uwe Möller
 
Oracle Sistemas Convergentes
Fran Navarro
 
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
Richard Robinson
 
Top Down Network Design - ebrahma.com
Pawan Sharma
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
GlobalLogic Ukraine
 
Open, Secure & Transparent AI Pipelines
Nick Pentreath
 
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
PDF
Migration, backup and restore made easy using Kannika
confluent
 
PDF
Five Things You Need to Know About Data Streaming in 2025
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
PDF
Unlocking value with event-driven architecture by Confluent
confluent
 
PDF
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
PDF
Building API data products on top of your real-time data infrastructure
confluent
 
PDF
Speed Wins: From Kafka to APIs in Minutes
confluent
 
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
Migration, backup and restore made easy using Kannika
confluent
 
Five Things You Need to Know About Data Streaming in 2025
confluent
 
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Unlocking value with event-driven architecture by Confluent
confluent
 
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 

Recently uploaded (20)

PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 

A Practical Guide to Selecting a Stream Processing Technology

  • 1. A Practical Guide to Selecting a Stream Processing Technology Michael  G.  Noll Product  Manager,  Confluent
  • 2. Kafka Talk Series Date Title Sep 27 Introduction  To  Streaming  Data  and  Stream  Processing  with  Apache  Kafka Oct  06 Deep  Dive  into  Apache  Kafka Oct  27 Data  Integration  with  Apache  Kafka Nov  17 Demystifying  Stream  Processing  with  Apache  Kafka Dec  01 A  Practical  Guide  to  Selecting  a  Stream  Processing  Technology Dec  15 Streaming  in  Practice:  Putting  Apache  Kafka  in  Production https://www.confluent.io/apache-­‐kafka-­‐talk-­‐series
  • 3. Agenda • Recap:  What  is  Stream  Processing? • The  Three  Pillars  of  Stream  Processing  in  Practice • Key  Selection  Criteria • Organizational/Non-Technical  Dimensions • Technical  Dimensions • Summary
  • 4. Agenda • Recap:  What  is  Stream  Processing? • The  Three  Pillars  of  Stream  Processing  in  Practice • Key  Selection  Criteria • Organizational/Non-Technical  Dimensions • Technical  Dimensions • Summary
  • 9. Agenda • Recap:  What  is  Stream  Processing? • The  Three  Pillars  of  Stream  Processing  in  Practice • Key  Selection  Criteria • Organizational/Non-Technical  Dimensions • Technical  Dimensions • Summary
  • 14. Powered by Kafka (﴾thousands more)﴿
  • 20. Spark Streaming API (﴾2.0)﴿
  • 21. Kafka’s Streams API (﴾0.10)﴿
  • 37. Example: Streams and Tables in Kafka Word Count hello 2 kafka 1 world 1 … …
  • 42. Streams & Databases • A  stream  processing  technology  must  have  first-class   support  for Streams  and Tables • With  scalability,  fault  tolerance,  … • Why?  Because  most  use  cases  require  not  just  one,  but  both! • Support  – or  lack  thereof  – strongly  impacts  the  resulting   technical  architecture  and  development  efforts • No  support  means: • Painful  Do-It-Yourself • Increased  complexity,  more  moving  pieces  to  juggle
  • 43. Agenda • Recap:  What  is  Stream  Processing? • The  Three  Pillars  of  Stream  Processing  in  Practice • Key  Selection  Criteria • Organizational/Non-Technical  Dimensions • Technical  Dimensions • Summary
  • 44. Agenda • Recap:  What  is  Stream  Processing? • The  Three  Pillars  of  Stream  Processing  in  Practice • Key  Selection  Criteria • Organizational/Non-Technical  Dimensions • Technical  Dimensions • Summary
  • 45. Organizational/Non-‐Tech Dimensions • Can  your  org  understand  and  leverage  the  technology? • Familiarity  with  languages;  intuitive  concepts  and  APIs;  trainings • Are  you  permitted  to  use  it  in  your  organization? • Security  features,  licensing,  open  source  vs.  proprietary • Can  you  continue  to  use  it  in  the  future? • Longevity  of  technology,  licensing,  vendor  strength
  • 46. Organizational/Non-‐Tech Dimensions • Do  you  believe  in  the  long-term  vision? • Switching  technologies  in  an  organization  is  often  expensive/slow:   legacy  migration,  re-training,  resistance  to  change,  etc. • What  is  the  path  and  time  to  success? • Can  you  move  smoothly  and  quickly  from  proof-of-concept  to   production? • Areas  and  range  of  applicability in  your  organization • General-purpose  vs.  niche  technology • Viable  for  S/M/L/XL  use  cases  vs.  for  XL  use  cases  only • Building  core  business  apps  vs.  doing  backend  analytics
  • 47. Organizational/Non-‐Tech Dimensions Licensing Vision/Roadmap ROI Impact  on Organization Broad  vs.  Niche Applicability Time  to  Market Professional Services Documentation Examples User  CommunityLearning  Curve Impact  on  Tools, Infrastructure,  …
  • 48. Agenda • Recap:  What  is  Stream  Processing? • The  Three  Pillars  of  Stream  Processing  in  Practice • Key  Selection  Criteria • Organizational/Non-Technical  Dimensions • Technical  Dimensions • Summary
  • 49. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  • 50. State • Stateful  processing  of  any  kind  requires…state • Many  (most?)  use  cases  for  stream  processing  are  stateful • Joins,  aggregations,  windowing,  counting,  ... • Is  state  performant?  Local  vs.  remote  state? 50
  • 53. State • Stateful  processing  of  any  kind  requires…state • Many  (most?)  use  cases  for  stream  processing  are  stateful • Joins,  aggregations,  windowing,  counting,  ... • Is  state  performant?  Local  vs.  remote  state? • Is  state  fault-tolerant?  How  fast  is  recovery/failover? 53
  • 55. State • Stateful  processing  of  any  kind  requires…state • Many  (most?)  use  cases  for  stream  processing  are  stateful • Joins,  aggregations,  windowing,  counting,  ... • Is  state  performant?  Local  vs.  remote  state? • Is  state  fault-tolerant?  How  fast  is  recovery/failover? • Is  state  interactively  queryable? • Kafka:  ready  for  use  (GA) • Spark,  Flink:  under  development  (alpha) • Storm,  Samza,  and  others:  not  available 55
  • 57. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  • 58. Abstractions • What  are  the  data  model  and  the  available  abstractions? • Most  common  abstraction:  stream of  records,  events • Kafka,  Spark,  Storm,  Samza,  Flink,  Apex,  ... • New,  very  powerful:  table  of  records • Currently  unique  to  Kafka • Represents  latest  state and  materialized  views • State  must  have  a  first-class  abstraction  because,  as  we  just  saw  in   the  previous  section,  state  is  crucial  for  stream  processing! 58
  • 59. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  • 60. Time model • Different  use  cases  require  different  time  semantics • Great  majority  of  use  cases  require  event-time semantics • Other  use  cases  may  require  processing-time (e.g.  real- time  monitoring)  or  special  variants  like  ingestion-time • A  stream  processing  technology  should,  at  a  minimum,   support  event-time  to  cover  most  use  cases  in  practice • Examples:  Kafka,  Beam,  Flink
  • 62. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  • 63. Windowing • Windowing  is  an  operation  that  groups events
  • 64. Windowing Input  data,  where colors  represent different  users  events Rectangles  denote different  event-­‐time windows processing-­‐time event-­‐time windowing alice bob dave
  • 65. Windowing • Windowing  is  an  operation  that  groups events • Most  commonly  needed:  time  windows,  session  windows • Examples: • Real-time  monitoring:  5-minute  averages • Reader  behavior  on  a  website:  user  browsing  sessions
  • 67. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  • 68. Out-‐of-‐order and late-‐arriving data • Is  very  common in  practice,  not  a  rare  corner  case • Related  to  time  model  discussion
  • 69. Out-‐of-‐order and late-‐arriving data Users  with  mobile  phones  enter airplane,  lose  Internet  connectivity Emails  are  being  written during  the  10h  flight Internet  connectivity  is  restored, phones  will  send  queued  emails  now
  • 70. Out-‐of-‐order and late-‐arriving data • Is  very  common in  practice,  not  a  rare  corner  case • Related  to  time  model  discussion • We  want  control over  how  out-of-order  data  is  handled • Example: • We  process  data  in  5-minute  windows,  e.g.  compute  statistics • When  event  arrives  1  minute  late:  update the  original  result! • When  event  arrives  2  hours  late:  discard it! • Handling  must  be  efficient because  it  happens  so  often
  • 71. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  • 72. Reprocessing • Re-process  data  by  rewinding  a  stream  back  in  time • Use  cases  in  practice  include • Correcting  output  data  after  fixing  a  bug • Facilitate  iterative  and  explorative  development • A/B  testing • Processing  historical  data • Walking  through  "What  If?"  scenarios • Also:  often  used  behind-the-scenes  for  fault  tolerance
  • 74. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  • 75. Scalability, Elasticity, Fault Tolerance • Can  the  technology  scale according  to  your  needs? • Desired  latency,  throughput? • Able  to  process  millions  of  messages  per  second? • What  is  the  minimum  footprint? • Expand/shrink  capacity  dynamically  during  operations? • Helps  with  resource  utilization  because  most  stream  apps  run  continuously • Resilience and  fault  tolerance • Which  guarantees  for  data  delivery  and  for  state?  "At-least-once",  "exactly- once",  "effectively-once",  etc. • Failover  behavior  and  recovery  time?  Automated  or  manual? • Any  negative  impact  of  fault  tolerance  features  on  performance?
  • 79. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  • 80. Security • To  meet  internal  security  policies,  legal  compliance,  etc. • Typical  base  requirements  for  stream  processing  applications: • Encrypt  data-in-transit  (e.g.  from/to  Kafka) • Authentication:  "only  some  applications  may  talk  to  production" • Authorization:  "access  to  sensitive  data  such  as  PII  is  restricted” • The  easier  it  is  to  use  security  features,  the  more  likely  they  are   actually  being  used  in  practice
  • 81. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  • 82. Processing Model • True  stream  processing  is  record-at-a-time processing • Benefits  include  low  latency (millisecs),  dealing  efficiently  with  out-of-order  data • Can  provide  both  latency  and  high  throughput  via  internal  optimizations • Examples:  Kafka,  Storm,  Samza,  Flink,  Beam • Some  processing  technologies  opt  for  (micro)batching • Micro-batching  has  no  true  benefits:  consider  it  a  technical  workaround  to   shoehorn  stream-like  functionality  into  a  tool • Suffers  from  significant  overhead  when  dealing  with  e.g.  out-of-order/late-arriving   data,  when  performing  windowed  analyses  (e.g.  session  windows) • Typically  a  strong  blocker  for  use  cases  such  as  fraud  detection  or  anything  where   "a  few  seconds"  of  latency  is  prohibitive • Examples:  Spark,  Storm  (Trident),  Hadoop*
  • 83. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  • 84. API • Choice  of  API  is  a  subjective  matter  – skills,  preference,  … • Typical  options • Declarative,  expressive  API:  operations  like  map(),  filter() • Imperative,  lower-level  API:  callbacks  like  process(event) • Streaming  SQL:  STREAM  SELECT  …  FROM  …  WHERE  …   • In  the  best  case  you  get  not  just  one,  but  all  three • "Abstractions  are  great!" • "Abstractions  considered  harmful!"
  • 85. Technical Dimensions Reprocessing Scalability  & Elasticity Fault  Tolerance API Dev/Ops Lifecycle Security Processing Model Out  of  Order Data Abstractions Time  Model WindowingState
  • 86. Developer/Operations Lifecycle • How  should  your  daily  work  look  and  feel  like? • "I  like  to  do  quick,  iterative  development"  (modify/test/repeat) • "I  want  to  decouple  team  roadmaps,  project  schedules" • Big  difference  between  App  Model  <->  Cluster  Model • Testing,  packaging,  deployment,  monitoring,  operations • "Do  I  need  to  know  Java  (app)  or  YARN  (cluster)  for  this?” • "I  want  reactive  processing  in  containers  that  run  on  Mesos!" • Rolling,  no-downtime  upgrades? • Integration  with  existing  Ops  infra,  tools,  processes?
  • 87. Agenda • Recap:  What  is  Stream  Processing? • The  Three  Pillars  of  Stream  Processing  in  Practice • Key  Selection  Criteria • Organizational/Non-Technical  Dimensions • Technical  Dimensions • Summary
  • 88. Summary • What  we  covered  is  a  good  starting  point • But,  no  free  lunch! • Understand  what  you  need,  and  weigh  criteria  appropriately • Think  end-to-end:  idea,  development,  operations,  troubleshooting • Think  big-picture:  future  use  cases,  architecture,  security,  training,  … • Do  your  own  internal  hackathons,  proof-of-concepts • Do  your  own  benchmarks • If  in  doubt:  simplicity  beats  complexity • Faster  to  learn,  easier  to  understand,  less  likely  to  fail,  …
  • 90. Coming Up Next Date Title Speaker Dec  15 Streaming in Practice: Putting Apache Kafka in Production Roger Hoover https://www.confluent.io/apache-­‐kafka-­‐talk-­‐series