Apache Kafka
The amountof data in the world is growing exponentially
The number of bytes being stored in the world already far exceeds the number of stars in the
observable universe
Resting State - The data can be stored in data warehouses, relational databases, or on distributed file
systems. In a real-time scenario, we can’t simply wait for it to pile up somewhere and then run a query
or job at some interval of our choosing. We need something that has a very different world view of
data: a technology that gives us access to data in it’s flowing state, and which allows us to work with
these continuous and unbounded data streams quickly and efficiently. This is where Apache Kafka
comes in.
Apache Kafka is a streaming platform for ingesting, storing, accessing, and processing streams of
data.
3.
Apache Kafka
Kafka combinesthree key capabilities so you can implement your use cases for event streaming end-to-end
with a single battle-tested solution:
● To publish (write) and subscribe to (read) streams of events, including continuous import/export of your
data from other systems.
● To store streams of events durably and reliably for as long as you want.
● To process streams of events as they occur or retrospectively.
And all this functionality is provided in a distributed, highly scalable, elastic, fault-tolerant, and secure manner.
Kafka can be deployed on bare-metal hardware, virtual machines, and containers, and on-premises as well as
in the cloud.
4.
Communication Model -Point to Point
● Systems interact with each other
directly
● Complex web of communication
● Tightly coupled systems
● Difficult to scale - Add or remove
new systems
System System System
System System System
5.
Communication Model -Pub/Sub
Publish/subscribe messaging, or pub/sub messaging, is a form of asynchronous service-to-
service communication used in serverless and microservices architectures
Any message published to a topic is immediately received by all of the subscribers to the topic.
Pub/sub messaging can be used to enable event-driven architectures, or to decouple
applications in order to increase performance, reliability and scalability.
The Publish Subscribe model allows messages to be broadcast to different parts of a system
asynchronously.
6.
Communication Model -Pub/Sub
● Producers simply publish their data to one
or more topics, without caring who comes
along to read the data.
● Topics are named streams (or channels) of
related data. They serve a similar purpose
as tables in a database (i.e. to group related
data).
● Consumers are processes that read (or
subscribe) to data in one or more topics.
They do not communicate directly with the
producers, but rather, listen to data on any
stream they happen to be interested in.
● Consumers can work together as a group
(called a consumer group) in order to
distribute work across multiple processes.
Producer Producer Producer
Topic Topic Topic
Consumer
Consumer Consumer
Consumer
Consumer
7.
Pub/Sub - ApacheKafka
● Kafka handles flowing stream of data
● Decoupled systems
● Stronger delivery guarantees. If a
consumer goes down, it will simply pick up
from where it left off when it comes back
online again
● Consumers can process data at a rate they
can handle. Unprocessed data is stored in
Kafka, in a durable and fault-tolerant
manner, until the consumer is ready to
process it.
● Systems can rebuild their state any time by
replaying the events in a topic.
● Both Producers and Consumers run within
the application layer outside of Kafka
cluster (Example: Java application)
Producer Producer Producer
Topic Topic Topic
Consumer
Consumer Consumer
Consumer
Consumer
8.
How are datastored and consumed?
● Data are stored as commit logs in Kafka
● Logs are append-only data structures
which capture an ordered sequence of
events.
● Once a record is written to the log, it is
considered immutable.
● The logs are distributed across brokers in
the Kafka cluster
● Kafka refers to the position of each entry
in it’s distributed log as an offset. Offsets
start at 0 and they allow multiple
consumer groups to each read from the
same log, and maintain their own
positions in the log / stream they are
reading from.
Producer 7 6 5 4 3 2 1 0
Consumer A Consumer B
Write
Current Position - 6 Current Position - 2
Offset
s
9.
Partitioning
● Topics arepartitioned. Partitions are
individual logs where data is produced
and consumed from.
● The commit log abstraction is
implemented at the partition-level, this is
the level at which ordering is guaranteed,
with each partition having its own set of
offsets.
● Ideally, data will be distributed relatively
evenly across all partitions in a topic to
allow producer and consumers to read
and write data at the same time. This
would improve scalability.
10.
Before Kafka Streams
●Two main options for building Kafka-based stream processing applications:
○ Use the Consumer and Producer APIs directly
○ Use another stream processing framework (e.g. Apache Spark Streaming,
Apache Flink)
● These APIs lack many of the primitives that would qualify them as a stream processing API,
including:
○ local and fault tolerant state
○ a rich set of operators for transforming streams of data
○ more advanced representations of streams
○ sophisticated handling of time
11.
Enter Kafka Streams
Unlikethe Producer, Consumer, and Connect
APIs, Kafka Streams is dedicated to helping
you process real-time data streams, not just
move data to and from Kafka. It makes it
easy to consume real-time streams of events
as they move through our data pipeline,
apply data transformation logic using a rich
set of stream processing operators and
primitives, and optionally write new
representations of the data back to Kafka
(i.e. if we want to make the transformed or
enriched events available to downstream
systems).
Kafka Cluster
Kafka
Streams
Stream
Processing
Uses embedded consumer
and producer apis to enrich
and transform the data
Basic processing using
Consumer/Producer
Consume
r
Producer
Embedded Consumer
Embedded Producer
App Layer
App Layer
12.
Kafka Streams -Features
● Highly scalable, elastic, distributed, and fault-tolerant application.
● Stateful and stateless processing.
● Event-time processing with windowing, joins, and aggregations.
● We can use the already-defined most common transformation operation using Kafka Streams
DSL or the lower-level processor API, which allow us to define and connect custom processors.
● Low barrier to entry, which means it does not take much configuration and setup to run a small
scale trial of stream processing; the rest depends on your use case.
● No separate cluster requirements for processing (integrated with Kafka).
● Employs one-record-at-a-time processing to achieve millisecond processing latency, and
supports event-time based windowing operations with the late arrival of records.
● Supports Kafka Connect to connect to different applications and databases.
13.
Processor Topology
● KafkaStreams leverages a programming paradigm called dataflow programming (DFP) - a data-
centric method of representing programs as a series of inputs, outputs, and processing stages.
● The stream processing logic in a Kafka Streams application is structured as a directed acyclic
graph (DAG), where nodes represent a processing step, or processor, and the edges represent
input and output streams (where data flows from one processor to another).
14.
Processor Topology
● Sourceprocessors - Sources are where information flows
into the Kafka Streams application. Data is read from a
Kafka topic and sent to one or more stream processors
● Stream processors - These processors are responsible
for applying data processing / transformation logic on
the input stream. In the high-level DSL, these processors
are defined using a set of built-in operators that are
exposed by the Kafka Streams library. Some example
operators are: filter, map, flatMap, and join.
● Sink processors - Sinks are where enriched, transformed,
filtered, or otherwise processed records are written back
to Kafka, either to be handled by another stream
processing application or to be sent to a downstream
datastore via something like Kafka Connect.
Source Processor
Stream Processor
Sink Processor Stream Processor
Sink Processor
15.
High-level DSL vsLow-level Processor API
Low-level Processor API
Imperative. Provides lower-level access to your data (e.g. access to record metadata), the ability to
schedule periodic functions, more granular access to the application state, or more fine-grained
control over the timing of certain operations
High-level DSL
Declarative. Built on top of the Low-level Processor API, but the interface each exposes is slightly
different. High-Level DSL contains already implemented methods ready to use. It is composed of two
main abstractions: KStream and KTable or GlobalKTable.
16.
● A streamprovides immutable data
● Supports only inserting new events,
whereas existing events cannot be changed.
● Streams are persistent, durable, and fault
tolerant.
● Events in a stream can be keyed, and can
have many events for one key, like “all of
Bob’s payments.”
● Stream is like a table in a relational
database (RDBMS) that has no unique key
constraint and that is append only.
Streams and Tables
Instead of working with Kafka topics directly, the Kafka Streams DSL allows you to work with different
representations of a topic, each of which are suitable for different use cases. Streams and Tables
● A table provides mutable data
● New events/rows can be inserted, and
existing rows can be updated and deleted.
● A key aka row key identifies which row is
being mutated.
● Like streams, tables are persistent, durable,
and fault tolerant.
● Behaves much like an RDBMS materialized
view because it is being changed
automatically as soon as any of its input
streams or tables change.
Streams and Tables
●KStream: an abstraction of a partitioned record stream, in which data is represented
using insert semantics (i.e. each event is considered to be independent of other
events).
● KTable: an abstraction of a partitioned table (i.e. changelog stream), in which data is
represented using update semantics (the latest representation of a given key is
tracked by the application). Since KTables are partitioned, each Kafka Streams task
contains only a subset of the full table
● GlobalKTable: similar to a KTable, except each GlobalKTable contains a complete (i.e.
unpartitioned) copy of the underlying data.
19.
Stateless vs StatefulProcessing
Stateless Processing each event
handled by the Kafka Streams
application is processed independently
of other events, and only Stream views
are needed by the application. In other
words, the application treats each
event as a self-contained insert and
requires no memory of previously seen
events. Example: Processing streams of
tweets from Twitter to identify market
sentiment for a product
Stateful Processing, on the other hand,
need to remember information about
previously seen events in one or more
steps of your processor topology,
usually for the purpose of aggregating,
windowing, or joining event streams.
These applications are more complex
under the hood since they need to
track additional data, or state.
20.
Stateful Processing -Persistent vs In-memory
Store
● States can be stored either in-memory (RAM) or in persistent storage.
● Kafka streams uses RocksDB by default to store states in the stream
application
● Kafka streams can be configured to use any external datastore to save and
retrieve state
● The persistent state stores are operationally more complex and can be
slower than a pure in-memory store, which always pulls data from RAM
21.
Kafka Connect
● KafkaConnect is a tool for scalably and reliably streaming data between
Apache Kafka and other data systems.
● It makes it simple to quickly define connectors that move large data sets
into and out of Kafka.
● Kafka Connect can ingest entire databases or collect metrics from all your
application servers into Kafka topics, making the data available for stream
processing with low latency.
22.
Kafka Connect
Kafka Connectincludes two types of connectors:
Source connector – Ingests entire databases and streams table updates to Kafka topics. A source
connector can also collect metrics from all your application servers and store these in Kafka topics,
making the data available for stream processing with low latency.
Sink connector – Delivers data from Kafka topics into secondary indexes such as Elasticsearch, or
batch systems such as Hadoop for offline analysis.
23.
ksqlDB
● ksqlDB isan open-source, event streaming database that was released by
Confluent in 2017
● It simplifies the way stream processing applications are built, deployed,
and maintained, by integrating two specialized components in the Kafka
ecosystem (Kafka Connect and Kafka Streams) into a single system, and by
giving us a high-level, SQL interface for interacting with these components.