SlideShare a Scribd company logo
WINDOWING DATA IN BIG DATA
STREAMS
ADAM WARSKI, WOLVESSUMMIT
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
BIG DATA? FAST DATA?
▸ What is big data?
▸ Shift of focus
▸ Processing speed
▸ Fast data -> streaming
A TYPE OF DATA PROCESSING
ENGINE THAT IS DESIGNED WITH
INFINITE DATA SETS IN MIND
Tyler Akidau, Google
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
WHAT IS STREAMING?
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
WINDOWING
▸ Time becomes the focus point
▸ How many invalid password errors where there in the
last 5 minutes
▸ During which 30-minute window did we get most
traffic?
▸ What’s the average 5-minute speed on a section of a
highway throughout the day?
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
HOW TO DO STREAMING? WITH WINDOWS?
▸ Many possibilities:
▸ Spark Streaming
▸ Spark Structured Streaming
▸ Kafka Streams
▸ Flink
▸ Akka Streams
▸ …
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
WHICH ONE TO CHOOSE?
LET’S FIND OUT
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
/ME
▸ coder @
▸ Lightbend, Confluent, Datastax consulting partner
▸ mainly Scala
▸ open-source: MacWire, ElasticMQ, Quicklens,
…
▸ http://www.warski.org / @adamwarski
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
WHAT’S THE TIME?
▸ How to associate time with an event:
▸ event time: “logical”, data-dependent
▸ ingestion time: when the event entered the system
▸ processing time: when the event is being processed
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
TYPES OF WINDOWS
▸ Time-based
▸ fixed/tumbling
▸ sliding
▸ Session-based
time
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
OUT-OF-ORDER: WATERMARKS, LATENESS
▸ Windows GC
▸ At some point, enough is enough
▸ Watermark:
▸ all events before X have been observed
▸ heuristics
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
TRIGGERS
▸ When to emit window results
▸ Watermark progress
▸ Event time progress
▸ Processing time progress
▸ Punctuations
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
ACCUMULATION OF RESULTS
▸ If we trigger many times …
▸ discard
▸ accumulate
▸ retract & accumulate
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
FINALLY … HOW TO MANIPULATE THE DATA
▸ map, flatMap, filter …
▸ stateful computation
▸ fold, reduce
▸ past-dependent operations
▸ where to store the state
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SUMMING UP
▸ Event/ingestion/processing time
▸ Tumbling/sliding/session windows
▸ Watermarks
▸ Triggers
▸ Accumulation of results
▸ State management
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SPARK STREAMING
▸ Micro-batches (DStream)
▸ .window() API:
▸ tumbling/sliding windows
▸ only processing time
▸ no watermarks
▸ triggers at the end of the window
▸ state persisted in cluster (e.g. updateStateByKey())
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SPARK STREAMING - WHY BOTHER?
▸ Popular
▸ Not only streaming
▸ ML
▸ SQL
▸ GraphX
▸ but …
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SPARK STRUCTURED STREAMING
▸ Alpha in Spark 2.0
▸ Micro-batches not exposed
▸ groupBy(window(…))
▸ Event-time support
▸ No watermarks, session windows (2.1?)
▸ Trigger: processing time; outputs changed windows
▸ Exactly-once processing*
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
FLINK
▸ Mostly with keyed streams (parallelism)
▸ TimeCharacteristic: event/ingestion/processing
▸ TimestampAssigner: also generates watermarks
▸ WindowAssigner: arbitrary, built-in tumbling, sliding, session
▸ Trigger: event/processing time, count, single/continuous
▸ Window function: fold/reduce/with-kv-state
▸ Exactly-once* / at-least-once
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
KAFKA STREAMS
▸ State: Kafka topics/local key-value backed by a topic for resiliency
▸ Watermarks: no, but windows are retained for 1 day
▸ Time: event/ingestion/processing; TimestampExtractor
▸ Tumbling/sliding windows
▸ Trigger: after every element
▸ aggregate by key&window into an ever-updating KTable
▸ At-least-once
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
AKKA STREAMS
▸ Single-node, no clustering
▸ No OOTB support, but quite easy to implement:
▸ Windows: arbitrary, assign windows to each element
▸ Trigger: only window-close
▸ State: local
▸ Watermarks: can be implemented
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SUMMING UP
▸ Spark: widely used, some features missing
▸ Flink: versatile
▸ Kafka: simple model
▸ Akka: single-node
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SUMMING UP
▸ Windowing is just one of the aspects
▸ Other:
▸ State management
▸ Work distribution
▸ Processing guarantees
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
SUMMING UP
▸ Other stream processing systems out there!
▸ Apache Storm
▸ Google Cloud Dataflow
▸ Amazon Kinesis
▸ Apache Beam
▸ …
ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI
LINKS
▸ Streaming 101 & 102: 
▸ https://www.oreilly.com/ideas/the-world-beyond-batch-
streaming-101
▸ https://www.oreilly.com/ideas/the-world-beyond-batch-
streaming-102
▸ https://softwaremill.com/windowing-data-in-akka-streams/
THANKS!
ADAM WARSKI
@ADAMWARSKI /
ADAM.WARSKI@SOFTWAREMILL.COM

More Related Content

Similar to Windowing data in big data streams (20)

PDF
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PDF
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
C4Media
 
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
PPTX
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Soroosh Khodami
 
PPTX
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
DataWorks Summit/Hadoop Summit
 
PPTX
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Slim Baltagi
 
PPTX
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Slim Baltagi
 
PDF
Streaming analytics state of the art
Stavros Kontopoulos
 
PPTX
Have your cake and eat it too, further dispelling the myths of the lambda arc...
Dimos Raptis
 
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
PPTX
Streaming Data Pipelines With Apache Beam
All Things Open
 
PDF
Let's get to know the Data Streaming
Knoldus Inc.
 
PDF
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
PDF
Stream Processing with Apache Flink
C4Media
 
PDF
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
 
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
PDF
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
HostedbyConfluent
 
PDF
Streaming Analytics for Financial Enterprises
Databricks
 
PDF
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
 
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
C4Media
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Soroosh Khodami
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
DataWorks Summit/Hadoop Summit
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Slim Baltagi
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Streaming analytics state of the art
Stavros Kontopoulos
 
Have your cake and eat it too, further dispelling the myths of the lambda arc...
Dimos Raptis
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Streaming Data Pipelines With Apache Beam
All Things Open
 
Let's get to know the Data Streaming
Knoldus Inc.
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
Stream Processing with Apache Flink
C4Media
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
HostedbyConfluent
 
Streaming Analytics for Financial Enterprises
Databricks
 
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
 

More from SoftwareMill (20)

PDF
Growing Oxen: channel operators and retries
SoftwareMill
 
PDF
How To Survive a Live-Coding Session
SoftwareMill
 
PDF
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
SoftwareMill
 
PPTX
Have you ever wondered about code review?
SoftwareMill
 
PDF
Reactive Integration with Akka Streams and Alpakka
SoftwareMill
 
PDF
W świecie botów czyli po co nam SI
SoftwareMill
 
PDF
Small intro to Big Data
SoftwareMill
 
PDF
Out-of-the-box Reactive Streams with Java 9
SoftwareMill
 
PDF
Hiring, Bots and Beer. (Hiring in the IT industry)
SoftwareMill
 
PDF
Teal Is The New Black
SoftwareMill
 
PDF
Kafka as a message queue
SoftwareMill
 
PDF
Introduction to Cassandra
SoftwareMill
 
PDF
Origins of Free
SoftwareMill
 
PDF
Cassandra - how to fail?
SoftwareMill
 
PDF
How to manage in a flat organized, remote and transparent company
SoftwareMill
 
PDF
Performance tests with gatling
SoftwareMill
 
PDF
Origins of free
SoftwareMill
 
PDF
Projekt z punktu widzenia UX designera
SoftwareMill
 
PDF
Machine learning by example
SoftwareMill
 
PPTX
Open source big data landscape and possible ITS applications
SoftwareMill
 
Growing Oxen: channel operators and retries
SoftwareMill
 
How To Survive a Live-Coding Session
SoftwareMill
 
Goryle i ser szwajcarski. Czego medycyna ratunkowa może Cię nauczyć o tworzen...
SoftwareMill
 
Have you ever wondered about code review?
SoftwareMill
 
Reactive Integration with Akka Streams and Alpakka
SoftwareMill
 
W świecie botów czyli po co nam SI
SoftwareMill
 
Small intro to Big Data
SoftwareMill
 
Out-of-the-box Reactive Streams with Java 9
SoftwareMill
 
Hiring, Bots and Beer. (Hiring in the IT industry)
SoftwareMill
 
Teal Is The New Black
SoftwareMill
 
Kafka as a message queue
SoftwareMill
 
Introduction to Cassandra
SoftwareMill
 
Origins of Free
SoftwareMill
 
Cassandra - how to fail?
SoftwareMill
 
How to manage in a flat organized, remote and transparent company
SoftwareMill
 
Performance tests with gatling
SoftwareMill
 
Origins of free
SoftwareMill
 
Projekt z punktu widzenia UX designera
SoftwareMill
 
Machine learning by example
SoftwareMill
 
Open source big data landscape and possible ITS applications
SoftwareMill
 
Ad

Recently uploaded (20)

PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
July Patch Tuesday
Ivanti
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Ad

Windowing data in big data streams

  • 1. WINDOWING DATA IN BIG DATA STREAMS ADAM WARSKI, WOLVESSUMMIT
  • 2. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI BIG DATA? FAST DATA? ▸ What is big data? ▸ Shift of focus ▸ Processing speed ▸ Fast data -> streaming
  • 3. A TYPE OF DATA PROCESSING ENGINE THAT IS DESIGNED WITH INFINITE DATA SETS IN MIND Tyler Akidau, Google ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI WHAT IS STREAMING?
  • 4. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI WINDOWING ▸ Time becomes the focus point ▸ How many invalid password errors where there in the last 5 minutes ▸ During which 30-minute window did we get most traffic? ▸ What’s the average 5-minute speed on a section of a highway throughout the day?
  • 5. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI HOW TO DO STREAMING? WITH WINDOWS? ▸ Many possibilities: ▸ Spark Streaming ▸ Spark Structured Streaming ▸ Kafka Streams ▸ Flink ▸ Akka Streams ▸ …
  • 6. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI WHICH ONE TO CHOOSE? LET’S FIND OUT
  • 7. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI /ME ▸ coder @ ▸ Lightbend, Confluent, Datastax consulting partner ▸ mainly Scala ▸ open-source: MacWire, ElasticMQ, Quicklens, … ▸ http://www.warski.org / @adamwarski
  • 8. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI WHAT’S THE TIME? ▸ How to associate time with an event: ▸ event time: “logical”, data-dependent ▸ ingestion time: when the event entered the system ▸ processing time: when the event is being processed
  • 9. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI TYPES OF WINDOWS ▸ Time-based ▸ fixed/tumbling ▸ sliding ▸ Session-based time
  • 10. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI OUT-OF-ORDER: WATERMARKS, LATENESS ▸ Windows GC ▸ At some point, enough is enough ▸ Watermark: ▸ all events before X have been observed ▸ heuristics
  • 11. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI TRIGGERS ▸ When to emit window results ▸ Watermark progress ▸ Event time progress ▸ Processing time progress ▸ Punctuations
  • 12. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI ACCUMULATION OF RESULTS ▸ If we trigger many times … ▸ discard ▸ accumulate ▸ retract & accumulate
  • 13. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI FINALLY … HOW TO MANIPULATE THE DATA ▸ map, flatMap, filter … ▸ stateful computation ▸ fold, reduce ▸ past-dependent operations ▸ where to store the state
  • 14. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI SUMMING UP ▸ Event/ingestion/processing time ▸ Tumbling/sliding/session windows ▸ Watermarks ▸ Triggers ▸ Accumulation of results ▸ State management
  • 15. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI SPARK STREAMING ▸ Micro-batches (DStream) ▸ .window() API: ▸ tumbling/sliding windows ▸ only processing time ▸ no watermarks ▸ triggers at the end of the window ▸ state persisted in cluster (e.g. updateStateByKey())
  • 16. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI SPARK STREAMING - WHY BOTHER? ▸ Popular ▸ Not only streaming ▸ ML ▸ SQL ▸ GraphX ▸ but …
  • 17. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI SPARK STRUCTURED STREAMING ▸ Alpha in Spark 2.0 ▸ Micro-batches not exposed ▸ groupBy(window(…)) ▸ Event-time support ▸ No watermarks, session windows (2.1?) ▸ Trigger: processing time; outputs changed windows ▸ Exactly-once processing*
  • 18. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI FLINK ▸ Mostly with keyed streams (parallelism) ▸ TimeCharacteristic: event/ingestion/processing ▸ TimestampAssigner: also generates watermarks ▸ WindowAssigner: arbitrary, built-in tumbling, sliding, session ▸ Trigger: event/processing time, count, single/continuous ▸ Window function: fold/reduce/with-kv-state ▸ Exactly-once* / at-least-once
  • 19. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI KAFKA STREAMS ▸ State: Kafka topics/local key-value backed by a topic for resiliency ▸ Watermarks: no, but windows are retained for 1 day ▸ Time: event/ingestion/processing; TimestampExtractor ▸ Tumbling/sliding windows ▸ Trigger: after every element ▸ aggregate by key&window into an ever-updating KTable ▸ At-least-once
  • 20. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI AKKA STREAMS ▸ Single-node, no clustering ▸ No OOTB support, but quite easy to implement: ▸ Windows: arbitrary, assign windows to each element ▸ Trigger: only window-close ▸ State: local ▸ Watermarks: can be implemented
  • 21. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI SUMMING UP ▸ Spark: widely used, some features missing ▸ Flink: versatile ▸ Kafka: simple model ▸ Akka: single-node
  • 22. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI SUMMING UP ▸ Windowing is just one of the aspects ▸ Other: ▸ State management ▸ Work distribution ▸ Processing guarantees
  • 23. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI SUMMING UP ▸ Other stream processing systems out there! ▸ Apache Storm ▸ Google Cloud Dataflow ▸ Amazon Kinesis ▸ Apache Beam ▸ …
  • 24. ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI LINKS ▸ Streaming 101 & 102:  ▸ https://www.oreilly.com/ideas/the-world-beyond-batch- streaming-101 ▸ https://www.oreilly.com/ideas/the-world-beyond-batch- streaming-102 ▸ https://softwaremill.com/windowing-data-in-akka-streams/