Apache Kafka 101
Stream Processing
Apache Kafka
The amount of data in the world is growing exponentially
The number of bytes being stored in the world already far exceeds the number of stars in the
observable universe
Resting State - The data can be stored in data warehouses, relational databases, or on distributed file
systems. In a real-time scenario, we can’t simply wait for it to pile up somewhere and then run a query
or job at some interval of our choosing. We need something that has a very different world view of
data: a technology that gives us access to data in it’s flowing state, and which allows us to work with
these continuous and unbounded data streams quickly and efficiently. This is where Apache Kafka
comes in.
Apache Kafka is a streaming platform for ingesting, storing, accessing, and processing streams of
data.
Apache Kafka
Kafka combines three key capabilities so you can implement your use cases for event streaming end-to-end
with a single battle-tested solution:
● To publish (write) and subscribe to (read) streams of events, including continuous import/export of your
data from other systems.
● To store streams of events durably and reliably for as long as you want.
● To process streams of events as they occur or retrospectively.
And all this functionality is provided in a distributed, highly scalable, elastic, fault-tolerant, and secure manner.
Kafka can be deployed on bare-metal hardware, virtual machines, and containers, and on-premises as well as
in the cloud.
Communication Model - Point to Point
● Systems interact with each other
directly
● Complex web of communication
● Tightly coupled systems
● Difficult to scale - Add or remove
new systems
System System System
System System System
Communication Model - Pub/Sub
Publish/subscribe messaging, or pub/sub messaging, is a form of asynchronous service-to-
service communication used in serverless and microservices architectures
Any message published to a topic is immediately received by all of the subscribers to the topic.
Pub/sub messaging can be used to enable event-driven architectures, or to decouple
applications in order to increase performance, reliability and scalability.
The Publish Subscribe model allows messages to be broadcast to different parts of a system
asynchronously.
Communication Model - Pub/Sub
● Producers simply publish their data to one
or more topics, without caring who comes
along to read the data.
● Topics are named streams (or channels) of
related data. They serve a similar purpose
as tables in a database (i.e. to group related
data).
● Consumers are processes that read (or
subscribe) to data in one or more topics.
They do not communicate directly with the
producers, but rather, listen to data on any
stream they happen to be interested in.
● Consumers can work together as a group
(called a consumer group) in order to
distribute work across multiple processes.
Producer Producer Producer
Topic Topic Topic
Consumer
Consumer Consumer
Consumer
Consumer
Pub/Sub - Apache Kafka
● Kafka handles flowing stream of data
● Decoupled systems
● Stronger delivery guarantees. If a
consumer goes down, it will simply pick up
from where it left off when it comes back
online again
● Consumers can process data at a rate they
can handle. Unprocessed data is stored in
Kafka, in a durable and fault-tolerant
manner, until the consumer is ready to
process it.
● Systems can rebuild their state any time by
replaying the events in a topic.
● Both Producers and Consumers run within
the application layer outside of Kafka
cluster (Example: Java application)
Producer Producer Producer
Topic Topic Topic
Consumer
Consumer Consumer
Consumer
Consumer
How are data stored and consumed?
● Data are stored as commit logs in Kafka
● Logs are append-only data structures
which capture an ordered sequence of
events.
● Once a record is written to the log, it is
considered immutable.
● The logs are distributed across brokers in
the Kafka cluster
● Kafka refers to the position of each entry
in it’s distributed log as an offset. Offsets
start at 0 and they allow multiple
consumer groups to each read from the
same log, and maintain their own
positions in the log / stream they are
reading from.
Producer 7 6 5 4 3 2 1 0
Consumer A Consumer B
Write
Current Position - 6 Current Position - 2
Offset
s
Partitioning
● Topics are partitioned. Partitions are
individual logs where data is produced
and consumed from.
● The commit log abstraction is
implemented at the partition-level, this is
the level at which ordering is guaranteed,
with each partition having its own set of
offsets.
● Ideally, data will be distributed relatively
evenly across all partitions in a topic to
allow producer and consumers to read
and write data at the same time. This
would improve scalability.
Before Kafka Streams
● Two main options for building Kafka-based stream processing applications:
○ Use the Consumer and Producer APIs directly
○ Use another stream processing framework (e.g. Apache Spark Streaming,
Apache Flink)
● These APIs lack many of the primitives that would qualify them as a stream processing API,
including:
○ local and fault tolerant state
○ a rich set of operators for transforming streams of data
○ more advanced representations of streams
○ sophisticated handling of time
Enter Kafka Streams
Unlike the Producer, Consumer, and Connect
APIs, Kafka Streams is dedicated to helping
you process real-time data streams, not just
move data to and from Kafka. It makes it
easy to consume real-time streams of events
as they move through our data pipeline,
apply data transformation logic using a rich
set of stream processing operators and
primitives, and optionally write new
representations of the data back to Kafka
(i.e. if we want to make the transformed or
enriched events available to downstream
systems).
Kafka Cluster
Kafka
Streams
Stream
Processing
Uses embedded consumer
and producer apis to enrich
and transform the data
Basic processing using
Consumer/Producer
Consume
r
Producer
Embedded Consumer
Embedded Producer
App Layer
App Layer
Kafka Streams - Features
● Highly scalable, elastic, distributed, and fault-tolerant application.
● Stateful and stateless processing.
● Event-time processing with windowing, joins, and aggregations.
● We can use the already-defined most common transformation operation using Kafka Streams
DSL or the lower-level processor API, which allow us to define and connect custom processors.
● Low barrier to entry, which means it does not take much configuration and setup to run a small
scale trial of stream processing; the rest depends on your use case.
● No separate cluster requirements for processing (integrated with Kafka).
● Employs one-record-at-a-time processing to achieve millisecond processing latency, and
supports event-time based windowing operations with the late arrival of records.
● Supports Kafka Connect to connect to different applications and databases.
Processor Topology
● Kafka Streams leverages a programming paradigm called dataflow programming (DFP) - a data-
centric method of representing programs as a series of inputs, outputs, and processing stages.
● The stream processing logic in a Kafka Streams application is structured as a directed acyclic
graph (DAG), where nodes represent a processing step, or processor, and the edges represent
input and output streams (where data flows from one processor to another).
Processor Topology
● Source processors - Sources are where information flows
into the Kafka Streams application. Data is read from a
Kafka topic and sent to one or more stream processors
● Stream processors - These processors are responsible
for applying data processing / transformation logic on
the input stream. In the high-level DSL, these processors
are defined using a set of built-in operators that are
exposed by the Kafka Streams library. Some example
operators are: filter, map, flatMap, and join.
● Sink processors - Sinks are where enriched, transformed,
filtered, or otherwise processed records are written back
to Kafka, either to be handled by another stream
processing application or to be sent to a downstream
datastore via something like Kafka Connect.
Source Processor
Stream Processor
Sink Processor Stream Processor
Sink Processor
High-level DSL vs Low-level Processor API
Low-level Processor API
Imperative. Provides lower-level access to your data (e.g. access to record metadata), the ability to
schedule periodic functions, more granular access to the application state, or more fine-grained
control over the timing of certain operations
High-level DSL
Declarative. Built on top of the Low-level Processor API, but the interface each exposes is slightly
different. High-Level DSL contains already implemented methods ready to use. It is composed of two
main abstractions: KStream and KTable or GlobalKTable.
● A stream provides immutable data
● Supports only inserting new events,
whereas existing events cannot be changed.
● Streams are persistent, durable, and fault
tolerant.
● Events in a stream can be keyed, and can
have many events for one key, like “all of
Bob’s payments.”
● Stream is like a table in a relational
database (RDBMS) that has no unique key
constraint and that is append only.
Streams and Tables
Instead of working with Kafka topics directly, the Kafka Streams DSL allows you to work with different
representations of a topic, each of which are suitable for different use cases. Streams and Tables
● A table provides mutable data
● New events/rows can be inserted, and
existing rows can be updated and deleted.
● A key aka row key identifies which row is
being mutated.
● Like streams, tables are persistent, durable,
and fault tolerant.
● Behaves much like an RDBMS materialized
view because it is being changed
automatically as soon as any of its input
streams or tables change.
Streams and Tables
Streams and Tables
● KStream: an abstraction of a partitioned record stream, in which data is represented
using insert semantics (i.e. each event is considered to be independent of other
events).
● KTable: an abstraction of a partitioned table (i.e. changelog stream), in which data is
represented using update semantics (the latest representation of a given key is
tracked by the application). Since KTables are partitioned, each Kafka Streams task
contains only a subset of the full table
● GlobalKTable: similar to a KTable, except each GlobalKTable contains a complete (i.e.
unpartitioned) copy of the underlying data.
Stateless vs Stateful Processing
Stateless Processing each event
handled by the Kafka Streams
application is processed independently
of other events, and only Stream views
are needed by the application. In other
words, the application treats each
event as a self-contained insert and
requires no memory of previously seen
events. Example: Processing streams of
tweets from Twitter to identify market
sentiment for a product
Stateful Processing, on the other hand,
need to remember information about
previously seen events in one or more
steps of your processor topology,
usually for the purpose of aggregating,
windowing, or joining event streams.
These applications are more complex
under the hood since they need to
track additional data, or state.
Stateful Processing - Persistent vs In-memory
Store
● States can be stored either in-memory (RAM) or in persistent storage.
● Kafka streams uses RocksDB by default to store states in the stream
application
● Kafka streams can be configured to use any external datastore to save and
retrieve state
● The persistent state stores are operationally more complex and can be
slower than a pure in-memory store, which always pulls data from RAM
Kafka Connect
● Kafka Connect is a tool for scalably and reliably streaming data between
Apache Kafka and other data systems.
● It makes it simple to quickly define connectors that move large data sets
into and out of Kafka.
● Kafka Connect can ingest entire databases or collect metrics from all your
application servers into Kafka topics, making the data available for stream
processing with low latency.
Kafka Connect
Kafka Connect includes two types of connectors:
Source connector – Ingests entire databases and streams table updates to Kafka topics. A source
connector can also collect metrics from all your application servers and store these in Kafka topics,
making the data available for stream processing with low latency.
Sink connector – Delivers data from Kafka topics into secondary indexes such as Elasticsearch, or
batch systems such as Hadoop for offline analysis.
ksqlDB
● ksqlDB is an open-source, event streaming database that was released by
Confluent in 2017
● It simplifies the way stream processing applications are built, deployed,
and maintained, by integrating two specialized components in the Kafka
ecosystem (Kafka Connect and Kafka Streams) into a single system, and by
giving us a high-level, SQL interface for interacting with these components.
ksqlDB Architecture
References
● https://kafka.apache.org/intro
● https://www.confluent.io/blog/apache-kafka-intro-how-kafka-works/
● https://learning.oreilly.com/library/view/mastering-kafka-streams/9781492
062486/
● https://www.confluent.io/blog/kafka-streams-tables-part-1-event-streamin
g/
● https://www.confluent.io/blog/kafka-streams-tables-part-1-event-streamin
g/
● https://aws.amazon.com/pub-sub-messaging/#:~:text=Publish%2Fsubscrib
e%20messaging%2C%20or%20pub,the%20subscribers%20to%20the%20to
pic
● https://docs.ksqldb.io/en/latest/concepts/ksqldb-architecture/

Apache Kafka 101 for Techies in IT dep.pptx

  • 1.
  • 2.
    Apache Kafka The amountof data in the world is growing exponentially The number of bytes being stored in the world already far exceeds the number of stars in the observable universe Resting State - The data can be stored in data warehouses, relational databases, or on distributed file systems. In a real-time scenario, we can’t simply wait for it to pile up somewhere and then run a query or job at some interval of our choosing. We need something that has a very different world view of data: a technology that gives us access to data in it’s flowing state, and which allows us to work with these continuous and unbounded data streams quickly and efficiently. This is where Apache Kafka comes in. Apache Kafka is a streaming platform for ingesting, storing, accessing, and processing streams of data.
  • 3.
    Apache Kafka Kafka combinesthree key capabilities so you can implement your use cases for event streaming end-to-end with a single battle-tested solution: ● To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems. ● To store streams of events durably and reliably for as long as you want. ● To process streams of events as they occur or retrospectively. And all this functionality is provided in a distributed, highly scalable, elastic, fault-tolerant, and secure manner. Kafka can be deployed on bare-metal hardware, virtual machines, and containers, and on-premises as well as in the cloud.
  • 4.
    Communication Model -Point to Point ● Systems interact with each other directly ● Complex web of communication ● Tightly coupled systems ● Difficult to scale - Add or remove new systems System System System System System System
  • 5.
    Communication Model -Pub/Sub Publish/subscribe messaging, or pub/sub messaging, is a form of asynchronous service-to- service communication used in serverless and microservices architectures Any message published to a topic is immediately received by all of the subscribers to the topic. Pub/sub messaging can be used to enable event-driven architectures, or to decouple applications in order to increase performance, reliability and scalability. The Publish Subscribe model allows messages to be broadcast to different parts of a system asynchronously.
  • 6.
    Communication Model -Pub/Sub ● Producers simply publish their data to one or more topics, without caring who comes along to read the data. ● Topics are named streams (or channels) of related data. They serve a similar purpose as tables in a database (i.e. to group related data). ● Consumers are processes that read (or subscribe) to data in one or more topics. They do not communicate directly with the producers, but rather, listen to data on any stream they happen to be interested in. ● Consumers can work together as a group (called a consumer group) in order to distribute work across multiple processes. Producer Producer Producer Topic Topic Topic Consumer Consumer Consumer Consumer Consumer
  • 7.
    Pub/Sub - ApacheKafka ● Kafka handles flowing stream of data ● Decoupled systems ● Stronger delivery guarantees. If a consumer goes down, it will simply pick up from where it left off when it comes back online again ● Consumers can process data at a rate they can handle. Unprocessed data is stored in Kafka, in a durable and fault-tolerant manner, until the consumer is ready to process it. ● Systems can rebuild their state any time by replaying the events in a topic. ● Both Producers and Consumers run within the application layer outside of Kafka cluster (Example: Java application) Producer Producer Producer Topic Topic Topic Consumer Consumer Consumer Consumer Consumer
  • 8.
    How are datastored and consumed? ● Data are stored as commit logs in Kafka ● Logs are append-only data structures which capture an ordered sequence of events. ● Once a record is written to the log, it is considered immutable. ● The logs are distributed across brokers in the Kafka cluster ● Kafka refers to the position of each entry in it’s distributed log as an offset. Offsets start at 0 and they allow multiple consumer groups to each read from the same log, and maintain their own positions in the log / stream they are reading from. Producer 7 6 5 4 3 2 1 0 Consumer A Consumer B Write Current Position - 6 Current Position - 2 Offset s
  • 9.
    Partitioning ● Topics arepartitioned. Partitions are individual logs where data is produced and consumed from. ● The commit log abstraction is implemented at the partition-level, this is the level at which ordering is guaranteed, with each partition having its own set of offsets. ● Ideally, data will be distributed relatively evenly across all partitions in a topic to allow producer and consumers to read and write data at the same time. This would improve scalability.
  • 10.
    Before Kafka Streams ●Two main options for building Kafka-based stream processing applications: ○ Use the Consumer and Producer APIs directly ○ Use another stream processing framework (e.g. Apache Spark Streaming, Apache Flink) ● These APIs lack many of the primitives that would qualify them as a stream processing API, including: ○ local and fault tolerant state ○ a rich set of operators for transforming streams of data ○ more advanced representations of streams ○ sophisticated handling of time
  • 11.
    Enter Kafka Streams Unlikethe Producer, Consumer, and Connect APIs, Kafka Streams is dedicated to helping you process real-time data streams, not just move data to and from Kafka. It makes it easy to consume real-time streams of events as they move through our data pipeline, apply data transformation logic using a rich set of stream processing operators and primitives, and optionally write new representations of the data back to Kafka (i.e. if we want to make the transformed or enriched events available to downstream systems). Kafka Cluster Kafka Streams Stream Processing Uses embedded consumer and producer apis to enrich and transform the data Basic processing using Consumer/Producer Consume r Producer Embedded Consumer Embedded Producer App Layer App Layer
  • 12.
    Kafka Streams -Features ● Highly scalable, elastic, distributed, and fault-tolerant application. ● Stateful and stateless processing. ● Event-time processing with windowing, joins, and aggregations. ● We can use the already-defined most common transformation operation using Kafka Streams DSL or the lower-level processor API, which allow us to define and connect custom processors. ● Low barrier to entry, which means it does not take much configuration and setup to run a small scale trial of stream processing; the rest depends on your use case. ● No separate cluster requirements for processing (integrated with Kafka). ● Employs one-record-at-a-time processing to achieve millisecond processing latency, and supports event-time based windowing operations with the late arrival of records. ● Supports Kafka Connect to connect to different applications and databases.
  • 13.
    Processor Topology ● KafkaStreams leverages a programming paradigm called dataflow programming (DFP) - a data- centric method of representing programs as a series of inputs, outputs, and processing stages. ● The stream processing logic in a Kafka Streams application is structured as a directed acyclic graph (DAG), where nodes represent a processing step, or processor, and the edges represent input and output streams (where data flows from one processor to another).
  • 14.
    Processor Topology ● Sourceprocessors - Sources are where information flows into the Kafka Streams application. Data is read from a Kafka topic and sent to one or more stream processors ● Stream processors - These processors are responsible for applying data processing / transformation logic on the input stream. In the high-level DSL, these processors are defined using a set of built-in operators that are exposed by the Kafka Streams library. Some example operators are: filter, map, flatMap, and join. ● Sink processors - Sinks are where enriched, transformed, filtered, or otherwise processed records are written back to Kafka, either to be handled by another stream processing application or to be sent to a downstream datastore via something like Kafka Connect. Source Processor Stream Processor Sink Processor Stream Processor Sink Processor
  • 15.
    High-level DSL vsLow-level Processor API Low-level Processor API Imperative. Provides lower-level access to your data (e.g. access to record metadata), the ability to schedule periodic functions, more granular access to the application state, or more fine-grained control over the timing of certain operations High-level DSL Declarative. Built on top of the Low-level Processor API, but the interface each exposes is slightly different. High-Level DSL contains already implemented methods ready to use. It is composed of two main abstractions: KStream and KTable or GlobalKTable.
  • 16.
    ● A streamprovides immutable data ● Supports only inserting new events, whereas existing events cannot be changed. ● Streams are persistent, durable, and fault tolerant. ● Events in a stream can be keyed, and can have many events for one key, like “all of Bob’s payments.” ● Stream is like a table in a relational database (RDBMS) that has no unique key constraint and that is append only. Streams and Tables Instead of working with Kafka topics directly, the Kafka Streams DSL allows you to work with different representations of a topic, each of which are suitable for different use cases. Streams and Tables ● A table provides mutable data ● New events/rows can be inserted, and existing rows can be updated and deleted. ● A key aka row key identifies which row is being mutated. ● Like streams, tables are persistent, durable, and fault tolerant. ● Behaves much like an RDBMS materialized view because it is being changed automatically as soon as any of its input streams or tables change.
  • 17.
  • 18.
    Streams and Tables ●KStream: an abstraction of a partitioned record stream, in which data is represented using insert semantics (i.e. each event is considered to be independent of other events). ● KTable: an abstraction of a partitioned table (i.e. changelog stream), in which data is represented using update semantics (the latest representation of a given key is tracked by the application). Since KTables are partitioned, each Kafka Streams task contains only a subset of the full table ● GlobalKTable: similar to a KTable, except each GlobalKTable contains a complete (i.e. unpartitioned) copy of the underlying data.
  • 19.
    Stateless vs StatefulProcessing Stateless Processing each event handled by the Kafka Streams application is processed independently of other events, and only Stream views are needed by the application. In other words, the application treats each event as a self-contained insert and requires no memory of previously seen events. Example: Processing streams of tweets from Twitter to identify market sentiment for a product Stateful Processing, on the other hand, need to remember information about previously seen events in one or more steps of your processor topology, usually for the purpose of aggregating, windowing, or joining event streams. These applications are more complex under the hood since they need to track additional data, or state.
  • 20.
    Stateful Processing -Persistent vs In-memory Store ● States can be stored either in-memory (RAM) or in persistent storage. ● Kafka streams uses RocksDB by default to store states in the stream application ● Kafka streams can be configured to use any external datastore to save and retrieve state ● The persistent state stores are operationally more complex and can be slower than a pure in-memory store, which always pulls data from RAM
  • 21.
    Kafka Connect ● KafkaConnect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems. ● It makes it simple to quickly define connectors that move large data sets into and out of Kafka. ● Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency.
  • 22.
    Kafka Connect Kafka Connectincludes two types of connectors: Source connector – Ingests entire databases and streams table updates to Kafka topics. A source connector can also collect metrics from all your application servers and store these in Kafka topics, making the data available for stream processing with low latency. Sink connector – Delivers data from Kafka topics into secondary indexes such as Elasticsearch, or batch systems such as Hadoop for offline analysis.
  • 23.
    ksqlDB ● ksqlDB isan open-source, event streaming database that was released by Confluent in 2017 ● It simplifies the way stream processing applications are built, deployed, and maintained, by integrating two specialized components in the Kafka ecosystem (Kafka Connect and Kafka Streams) into a single system, and by giving us a high-level, SQL interface for interacting with these components.
  • 24.
  • 25.
    References ● https://kafka.apache.org/intro ● https://www.confluent.io/blog/apache-kafka-intro-how-kafka-works/ ●https://learning.oreilly.com/library/view/mastering-kafka-streams/9781492 062486/ ● https://www.confluent.io/blog/kafka-streams-tables-part-1-event-streamin g/ ● https://www.confluent.io/blog/kafka-streams-tables-part-1-event-streamin g/ ● https://aws.amazon.com/pub-sub-messaging/#:~:text=Publish%2Fsubscrib e%20messaging%2C%20or%20pub,the%20subscribers%20to%20the%20to pic ● https://docs.ksqldb.io/en/latest/concepts/ksqldb-architecture/