BruJUG - Introduction to data streaming

@nicolas_frankel
Introduction to Data
Streaming

@nicolas_frankel
• Former developer, team lead,
architect, blah-blah
• Developer Advocate
• Curious about Kubernetes
Me, myself and I

@nicolas_frankel
Hazelcast
HAZELCAST IMDG is an operational,
in-memory, distributed computing
platform that manages data using
in-memory storage, and performs
parallel execution for breakthrough
application speed and scale.
HAZELCAST JET is the ultra fast,
application embeddable, 3rd
generation stream processing
engine for low latency batch
and stream processing.

@nicolas_frankel
• Why streaming?
• Streaming approaches
• Hazelcast Jet
• Open Data
• General Transit Feed Specification
• The demo!
• Q&A
Schedule

@nicolas_frankel
Data was neatly stored in SQL
databases
In a time before our time…

@nicolas_frankel
• Analytics
• Supermarket sales in the last hour?
• Reporting
• Banking account annual closing
The need for Extract Transform Load

@nicolas_frankel
• Constraints
• Joints
• Normal forms
What SQL really means

@nicolas_frankel
• Normalized vs. denormalized
• Correct vs. fast
Writes vs. reads

@nicolas_frankel
• Different actors
• With different needs
• Using the same database?
The need for ETL

@nicolas_frankel
The batch model
1. Extract
2. Transform
3. Load

@nicolas_frankel
Batches are everywhere!

@nicolas_frankel
• Scheduled at regular intervals
• Daily
• Weekly
• Monthly
• Yearly
• Run in a specific amount of time
Properties of batches

@nicolas_frankel
• When the execution time overlaps
the next execution schedule
• When the space taken by the data
exceeds the storage capacity
• When the batch fails mid-execution
• etc.
Oops

@nicolas_frankel
• Parallelize everything
• Map - Reduce
• Hadoop
• NoSQL
• Schema on Read vs. Schema on Write
Big data!

@nicolas_frankel
• Keep a cursor
• And only manage “chunks” of data
• What about new data coming in?
Or chunking?

@nicolas_frankel
Event-Driven Programming
“Programming paradigm in which the flow of the
program is determined by events such as user
actions (mouse clicks, key presses), sensor outputs, or
messages from other programs or threads”
-- Wikipedia

@nicolas_frankel
Event Sourcing
“Event sourcing persists the state of a business entity
such an Order or a Customer as a sequence of state-
changing events. Whenever the state of a business
entity changes, a new event is appended to the list of
events. Since saving an event is a single operation, it is
inherently atomic. The application reconstructs an
entity’s current state by replaying the events.”
-- https://microservices.io/patterns/data/event-sourcing.html

@nicolas_frankel
• Ordered append-only log
• e.g. MySQL binlog
Database internals

@nicolas_frankel
Make everything event-based!

@nicolas_frankel
• Memory-friendly
• Easily processed
• Pull vs. push
• Very close to real-time
• Keeps derived data in-sync
Benefits

@nicolas_frankel
From finite datasets to infinite

@nicolas_frankel
Streaming is smart ETL
Processing
Ingest
In-Memory
Operational
Storage
Combine
Join, Enrich,
Group, Aggregate
Stream
Windowing, Event-
Time
Processing
Compute
Distributed and
Parallel
Computation
Transform
Filter, Clean,
Convert
Publish
In-Memory,
Subscriber
Notifications
Notify if response
time is 10% over 24
hour average, second
by second

@nicolas_frankel
• Real-time dashboards
• Decision making
• Recommendations
• Stats (gaming, infrastructure
monitoring)
• Prediction - often based on
algorithmic prediction
• Push stream through ML model
• Complex Event Processing
Use Case: Analytics and Decision Making

@nicolas_frankel
• Kafka
• Pulsar
Persistent event-storage systems

@nicolas_frankel
• Distributed
• On-disk storage
• Messages sent and read from a
topic
• Publish-subscribe
• Queue
• Consumer can keep track of the
offset
Kafka

@nicolas_frankel
• Apache Flink
• Amazon Kinesis
• IBM Streams
• Hazelcast Jet
• Apache Beam
• Abstraction over some of the above
• …
In-memory stream processing engines

@nicolas_frankel
• Apache 2 Open Source
• Single JAR
• Leverages Hazelcast IMDG
• Unified batch/streaming API
• (Hazelcast Jet Enterprise)
Hazelcast Jet

@nicolas_frankel
Hazelcast Jet

@nicolas_frankel
• Declaration (code) that defines and
links sources, transforms, and
sinks
• Platform-specific SDK (Pipeline API
in Jet)
• Client submits pipeline to the
Stream Processing Engine (SPE)
Concept: Pipeline

@nicolas_frankel
• Running instance of pipeline in SPE
• SPE executes the pipeline
• Code execution
• Data routing
• Flow control
• Parallel and distributed execution
Concept: Job

@nicolas_frankel
Imperative model
final String text = "...";
final Map<String, Long> counts = new HashMap<>();
for (String word : text.split("W+")) {
Long count = counts.get(word);
counts.put(count == null ? 1L : count + 1);
}

@nicolas_frankel
Declarative model
Map<String, Long> counts = lines.stream()
.map(String::toLowerCase)
.flatMap(
line -> Arrays.stream(line.split("W+"))
)
.filter(word -> !word.isEmpty())
.collect(Collectors.groupingBy(
word -> word, Collectors.counting())
);

@nicolas_frankel
• Multiple nodes
• Scalable storage and performance
• Elasticity
• Data stored, partitioned and
replicated
• No single point of failure
What Distributed Means to Hazelcast

@nicolas_frankel
Distributed Parallel Processing
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<Long, String>map(BOOK_LINES))
.flatMap(line -> traverseArray(line.getValue().split("W+")))
.filter(word -> !word.isEmpty())
.groupingKey(wholeItem())
.aggregate(counting())
.drainTo(Sinks.map(COUNTS));
Data
Sink
Data
Source
from aggrmap filter to
Translate declarative code to a Directed Acyclic Graph

@nicolas_frankel
Node 1
Distributed Parallel Processing
read cmb
map
+
filter
acc sink
read cmb
map
+
filter
acc
Node 2
read cmb
map
+
filter
acc
sinkread cmb
map
+
filter
acc
Data
Source
Data
Sink
sink
sink

@nicolas_frankel
« Open data is the idea that some
data should be freely available to
everyone to use and republish as
they wish, without restrictions from
copyright, patents or other
mechanisms of control. »
--https://en.wikipedia.org/wiki/Open_data
Open Data

@nicolas_frankel
• France:
• https://www.data.gouv.fr/fr/
• Switzerland:
• https://opendata.swiss/en/
• European Union:
• https://data.europa.eu/euodp/en/data/
Some Open Data initiatives

@nicolas_frankel
1. Access
2. Format
3. Standard
4. Data correctness
Challenges

@nicolas_frankel
• Download a file
• Access it interactively through a
web-service
Access

@nicolas_frankel
In general, Open Data means Open
Format
• PDF
• CSV
• XML
• JSON
• etc.
Format

@nicolas_frankel
• Let’s pretend the format is XML
• Which grammar is used?
• A shared standard is required
• Congruent to a domain
Standard

@nicolas_frankel
Data correctness
"32.TA.66-43","16:20:00","16:20:00","8504304"
"32.TA.66-44","24:53:00","24:53:00","8500100"
"32.TA.66-44","25:00:00","25:00:00","8500162"
"32.TA.66-44","25:02:00","25:02:00","8500170"
"32.TA.66-45","23:32:00","23:32:00","8500170"

@nicolas_frankel
General Transit Feed Specification
”The General Transit Feed Specification (GTFS) […]
defines a common format for public transportation
schedules and associated geographic information.
GTFS feeds let public transit agencies publish their
transit data and developers write applications that
consume that data in an interoperable way.”

@nicolas_frankel
GTFS static model
Filename Required Defines
agency.txt Required Transit agencies with service represented in this dataset.
stops.txt Required Stops where vehicles pick up or drop off riders. Also defines stations and station entrances.
routes.txt Required Transit routes. A route is a group of trips that are displayed to riders as a single service.
trips.txt Required
Trips for each route. A trip is a sequence of two or more stops that occur during a specific
time period.
stop_times.txt Required Times that a vehicle arrives at and departs from stops for each trip.
calendar.txt
Conditionally
required
Service dates specified using a weekly schedule with start and end dates. This file is required
unless all dates of service are defined in calendar_dates.txt.
calendar_dates.txt
Conditionally
required
Exceptions for the services defined in the calendar.txt. If calendar.txt is omitted, then
calendar_dates.txt is required and must contain all dates of service.
fare_attributes.txt Optional Fare information for a transit agency's routes.

@nicolas_frankel
GTFS static model
Filename Required Defines
fare_rules.txt Optional Rules to apply fares for itineraries.
shapes.txt Optional Rules for mapping vehicle travel paths, sometimes referred to as route alignments.
frequencies.txt Optional
Headway (time between trips) for headway-based service or a compressed representation of fixed-
schedule service.
transfers.txt Optional Rules for making connections at transfer points between routes.
pathways.txt Optional Pathways linking together locations within stations.
levels.txt Optional Levels within stations.
feed_info.txt Optional Dataset metadata, including publisher, version, and expiration information.
translations.txt Optional Translated information of a transit agency.
attributions.txt Optional Specifies the attributions that are applied to the dataset.

@nicolas_frankel
GTFS dynamic model

@nicolas_frankel
• Open Data
• GTFS static available as
downloadable .txt files
• GTFS dynamic available as a REST
endpoint
Use-case: Swiss Public Transport

@nicolas_frankel
The available data model
Where’s the position?!

@nicolas_frankel
• Source: web service
• Split into trip updates
• Enrich with static trip data
• Enrich with static stop times data
• Transform hours into timestamp
• Enrich with static location data
• Sink: Hazelcast IMDG
The dynamic data pipeline

@nicolas_frankel
Architecture overview

@nicolas_frankel
Recap
• Streaming has a lot of benefits
• Leverage Open Data
• It’s the Wild West out there
• No standards
• Real-world data sucks!
• But you can get cool stuff done

@nicolas_frankel
• https://blog.frankel.ch/
• @nicolas_frankel
• https://jet.hazelcast.org/
• https://bit.ly/opendataswiss
• https://bit.ly/gtransportfs
• https://bit.ly/jet-train
Thanks a lot!

BruJUG - Introduction to data streaming

More Related Content

What's hot (20)

Similar to BruJUG - Introduction to data streaming (20)

More from Nicolas Fränkel (20)

Recently uploaded (20)

BruJUG - Introduction to data streaming

Editor's Notes