Shuhsi Lin
2017/06/09 at PyconTw 2017
Connect K of SMACK:
pykafka, kafka-python or ?
About Me
Data Software Engineer of EAD
in the manufacturer, Micron
Currently working with
- data and people
- Lurking in PyHug, Taipei.py and various Meetups
Shuhsi Lin
sucitw gmail.com
K in SMACK
http://datastrophic.io/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka/
https://www.linkedin.com/pulse/smack-my-bdaas-why-2017-year-big-data-goes-tom-martin
http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html
https://dzone.com/articles/short-interview-with-smack-tech-stack-1
https://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka
● Apache Spark: Processing Engine.
● Apache Mesos: The Container.
● Akka: The Model.
● Apache Cassandra: The Storage.
● Apache Kafka: The Broker.
Agenda
» Pipeline to streaming
» What is Apache Kafka
⋄ Overview
⋄ Architecture
⋄ Use cases
» Kafka API
⋄ Python clients
» Conclusion and More about Kafka
What we will not focus on
» Reliability and durability
⋄ Scaling, replication, guarantee
⋄ Zookeeper
» Compact log
» Administration, Configuration, Operations
» Kafka connect
» Kafka Stream
» Apache Kafka vs XXX
⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ,
ZeroMQ, Redis, and ....
What is
Stream Processing
3 Paradigms for Programming
1. Request/response
2. Batch
3. Stream processing
https://qconnewyork.com/ny2016/ny2016/presentation/large-scale-stream-processing-apache-kafka.html
Request/response
Batch
Stream Processing
What is streaming process
» Data comes from the rise of events
(orders, sales, clicks or trades)
» Databases are event streams
⋄ the process of creating a backup or standby copy
of a database
⋄ publishing the database changes
Data pipeline
https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini
What often happen in a
complex Data pipeline
● Complexity meant that the data
was always unreliable
● Reports were untrustworthy,
● Derived indexes and stores were
questionable
● Everyone spent a lot of time
battling data quality issues of
all kinds.
● Data discrepancy
Data pipeline
Data streaming
Apache Kafka 101
The name, “Kafka”, came from?
https://www.quora.com/What-is-the-relation-between-Kafka-the-writer-and-Apache-Kafka-the-distributed-messaging-system
http://slideplayer.com/slide/4221536/
https://en.wikipedia.org/wiki/Franz_Kafka
What is Apache Kafka?
Apache Kafka is a distributed system designed for streams. It is built to be
fault-tolerant, high-throughput, horizontally scalable, and allows geographically
distributing data streams and processing.
https://kafka.apache.org
Why Apache Kafka
Fast
Scalable
Durable
Distributedhttps://pixabay.com/photo-2135057/
Stream data platform (Orignal mechanism)
https://www.confluent.io/blog/stream-data-platform-1/
Integration mechanism between systems
Kafka as a service
https://www.confluent.io/
What a streaming data platform can provide
» “Data integration” (ETL)
⋄ How to transport data between systems
⋄ Captures streams of events or data changes and
feeds these to other data systems
» “Stream processing” (messaging)
⋄ Continuous, real-time processing and
transformation of these streams and makes the
results available system-wide.
various systems in LinkedIn
https://www.confluent.io/blog/stream-data-platform-1/
Analytical data processing with very low latency
Kafka terminology
» Producer
» Consumer
⋄ Consumer group
⋄ offset
» Broker
» Topic
» Patition
» Message
» Replica
What Kafka Does
Publish & subscribe
● to streams of data like a messaging system
Process
● streams of data efficiently and in real time
Store
● streams of data safely in a distributed replicated cluster
https://kafka.apache.org/
Publish/Subscribe
P14 at
https://www.slideshare.net/lucasjellema/amis-sig-introducing-a
pache-kafka-scalable-reliable-event-bus-message-queue
P15 at https://www.slideshare.net/rahuldausa/real-time-analytics-with-apache-kafka-and-apache-spark
v0.10
Update offset
v08
Update offset
Smart consumer
2181
9092
A modern stream-centric data architecture built around Apache Kafka
https://www.confluent.io/blog/stream-data-platform-1/
500 billion events per day
The key abstraction in Kafka is a
structured commit log of updates
append records to this log
https://www.confluent.io/blog/stream-data-platform-1/
Each of these data consumers
has its own position in the log
and advances independently.
This allows a reliable, ordered stream of updates
to be distributed to each consumer.
The log can be sharded and spread
over a cluster of machines, and
each shard is replicated for
fault-tolerance.
consumers
producers
parallel, ordered consumption
(important to a change capture system
for database updates)
TBs of data
Topics and Partitions
» Topics are split into partitions
» Partitions are strongly ordered & immutable
» Partitions can exist on different servers
» Partition enable scalability
» Producers assign a message to a partition within the topic
⋄ Either round robin ( simply to balance load)
⋄ or according to the keys
https://kafka.apache.org/documentation/#gettingStarted
Offsets
» Message are assigned an offset in the partition
» Consumers track with ( offset, partition, topic)
https://kafka.apache.org/documentation/#gettingStarted
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups
Consumers and Partitions
» A consumer group consumes one topic
» A partition is always sent to the same consumer instance
https://kafka.apache.org/documentation/#gettingStarted
Consumer
● Messages are available to consumers only when they have been
committed
● Kafka does not push
○ Unlike JMS
● Read does not destroy by consumers
○ Unlike JMS Topic
● (some) History available
○ Offline consumers can catch up
○ Consumers can re-consume from the past
● Delivery Guarantees
○ Ordering maintained
○ At-least-once (per consumer) by default; at-most-once and exactly-once can be
implemented
P11 at https://www.slideshare.net/lucasjellema/amis-sig-introducing-apache-kafka-scalable-reliable-event-bus-message-queue
ZooKeeper: the coordination interface
between the Kafka broker and consumers
https://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/#section_3
» Stores configuration data for distributed services
» Used primarily by brokers
» Used by consumers in 0.8 but not 0.9
Apache Kafka timeline
Apache Kafka timeline
2011-Nov
2016-May2013-Nov 2015-Nov
Next
version
v0.10
Kafka Stream
rack awareness
v0.8
New Producer
Reassign-partitions
v0.9
Kafka Connect
Security
New Consumer
Apache
Software
Foundation
incubator
2010
Creation
In Linkedin
2014, Confluent
v0.10.2
Single Message Transforms
for Kafka Connect
TLS connection
SSL is supported only for the new Kafka Producer and Consumer (Kafka versions 0.9.0 and higher)
http://kafka.apache.org/documentation.html#security_ssl
http://docs.confluent.io/current/kafka/ssl.html
http://maximilianchrist.com/blog/connect-to-apache-kafka-from-python-using-ssl
https://github.com/edenhill/librdkafka/wiki/Using-SSL-with-librdkafka
Apache Kafka is consider as :
Stream data platform
» Commit log service
» Messaging system
» circular buffer
Cons of Apache Kafka
» Consumer Complexity (smart, but poor client)
» Lack of tooling/monitoring (3rd party)
» Still pre 1.0 release
» Operationally, it’s more manual than desired
» Requires ZooKeeper
Sep 26, 2015http://www.slideshare.net/jimplush/introduction-to-apache-kafka-53225326
Use Cases
» Website Activity Tracking
» Log Aggregation
» Stream Processing
» Event Sourcing
» Commit logs
» Metrics (Performance index streaming)
⋄ CPU/IO/Memory usage
⋄ Application Specific:
⋄ Time taken to load a web-page
⋄ Time taken to build a web-page
⋄ No. of requests
⋄ No. of hits on a particular page/url
Event-driven Applications
» how it first is adopted and how its role
evolves over time in their architecture.
https://aws.amazon.com/tw/kafka/
https://www.slideshare.net/ConfluentInc/iot-data-platforms-processing-iot-data-with-apache-kafka
Conceptual Reference Architecture
for Real-Time Processing in HDP 2.2
https://hortonworks.com/blog/storm-kafka-together-real-time-data-refinery/ February 12, 2015
Event delivery system design in Spotify
43
https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/
Case: Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spark Streaming
http://helenaedelson.com/?p=1186 (2016/03)
2 + 2 Core APIs
Four Core APIs
» Producer API
» Consumer API
» Connect API
» Streams API
» Legacy APIs
$ cat < in.txt | grep “python” | tr a-z A-Z > out.txt
https://www.slideshare.net/ConfluentInc/apache-kafkaa-distributed-streaming-platform
Kafka Clients
» JAVA (officially maintain)
» C/C++ (librdkafka)
» Go (AKA golang)
» Erlang
» .NET
» Clojure
» Ruby
» Node.js
» Proxy (HTTP REST, etc)
» Perl
» stdin/stdout
» PHP
» Rust
» Alternative Java
» Storm
» Scala DSL
» Clojure
https://cwiki.apache.org/confluence/display/KAFKA/Clients
» Python
⋄ Confluent-kafka-python
⋄ Kafka-python
⋄ pykafka
Kafka Clients survey
https://www.confluent.io/blog/first-annual-state-apache-kafka-client-use-survey (February 14, 2017)
How users choose a Kafka client
Kafka Client: Language Adoption
Results from 187 responses
Reliability:
● Stability should be
priority
● Good error handling
● Good testing
● Good metrics and logging
3rd
Create your own Kafka broker
https://github.com/Landoop/fast-data-dev
See your brokers and topics
● Kafka-topics-ui
○ Demo http://kafka-topics-ui.landoop.com/#/
● Kafka-connect-ui
○ Demo http://kafka-connect-ui.landoop.com/
● Kafka-manager (yahoo)
● Kafka Eagle
● kafka-offset-monitor
Kafka Tool (GUI)
https://www.datadoghq.com/
Kafka Tool
Kafka UI(landoop)
2 + 2 Core APIs
And python clients
Kafka API Documents
https://kafka.apache.org/0102/javadoc/index.html?
Apache Kafka client for Python
» Pykafka
» kafka-python
» Confluent-kafka-python
» Librdkafka
⋄ The Apache Kafka C/C++ library
Pykafka
https://github.com/Parsely/pykafka
http://pykafka.readthedocs.io/en/latest/
» Similar level of abstraction
to the JVM Kafka client
» Built on librdkafka
https://blog.parse.ly/post/3886/pykafka-now/ (2016,June)
kafka-python
https://github.com/dpkp/kafka-python/
http://kafka-python.readthedocs.io/
API
● Producer
● Consumer
● Message
● TopicPartition
● KafkaError
● KafkaException
● kafka-python is designed to function
much like the official java client,
with a sprinkling of pythonic
interfaces.
Confluent-kafka-python
Confluent's Python client for Apache Kafka and
the Confluent Platform.
Features:
● High performance
⋄ librdkafka
● Reliability
● Supported
● Future proof
https://github.com/confluentinc/confluent-kafka-python
http://docs.confluent.io/current/clients/confluent-kafka-python/index.html?
Producer API (JAVA)
https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html
https://www.tutorialspoint.com/apache_kafka/apache_kafka_simple_producer_example.htm
● KafkaProducer – Sync and Async
○ close()
○ flush()
○ metrics()
○ partitionsFor( topic)
○ send(ProducerRecord<K,V> record)
Writing data to Kafka: A client that publishes records to the Kafka cluster.
Class KafkaProducer<K,V>
Class ProducerRecord<K,V>
● ProducerRecord( topic, V value)
● ProducerRecord( topic, Integer partition, K key, V value)
A key/value pair to be sent to Kafka.
Configuration Settings
(configuration is externalized in a property file)
● client.id
● producer.type
● acks
● retries
● bootstrap.servers
● linger.ms
● key.serializer
● value.serializer
● batch.size
● buffer.memory
messages
Producer API -Pykafka
from pykafka import KafkaClient
from settings import ….
client = KafkaClient(hosts=bootstrap_servers)
topic = client.topics [topic.encode('UTF-8')]
producer = topic.get_producer(use_rdkafka=use_rdkafka)
producer.produce(msg_payload)
producer.stop() # Will flush background queue
Class pykafka.producer.Producer()
Classpykafka.topic.Topic(cluster, topic_metadata)
http://pykafka.readthedocs.io/en/latest/api/producer.html
● produce(msg, partition_key=None)
● stop()
● get_producer(use_rdkafka=False,
**kwargs)
Performance assessment
https://blog.parse.ly/post/3886/pykafka-now/
Must be type bytes, or be
serializable to bytes via
configured value_serializer.
Producer API -Kafka-Python
from kafka import KafkaConsumer, KafkaProducer
from settings import BOOTSTRAP_SERVERS, TOPICS, MSG
p = KafkaProducer(bootstrap_servers=BOOTSTRAP_SERVERS)
p.send(TOPICS, MSG.encode('utf-8'))
p.flush()
Class kafka.KafkaProducer(**configs)
https://kafka-python.readthedocs.io/en/master/_modules/kafka/producer/kafka.html#KafkaProducer
● close(timeout=None)
● flush(timeout=None)
● partitions_for(topic)
● send(topic, value=None, key=None,
partition=None, timestamp_ms=None)
http://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html
Producer API -Confluent-python -Kafka
from confluent_kafka import Producer
from settings import BOOTSTRAP_SERVERS,
TOPICS, MSG
p = Producer({'bootstrap.servers':
BOOTSTRAP_SERVERS})
p.produce(TOPICS, MSG.encode('utf-8'))
p.flush()
http://docs.confluent.io/current/clients/confluent-kafka-python/#producer
Class confluent_kafka.Producer(*kwargs)
● len()
● flush([timeout])
● poll([timeout])
● produce(topic[, value][, key][, partition][,
on_delivery][, timestamp])
Consumer
● Consumer group
○ group.id
○ session.timout.ms
○ max.poll.records
○ heartbeat.interval.ms
● Offset Management
○ enable.auto.commit
○ Auto.commit.interval.ms
○ auto.offset.reset
https://kafka.apache.org/documentation.html#newconsumerconfigs
Consumer API (JAVA)
https://kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
● assign(<TopicPartition> partitions)
● assignment()
● beginningOffsets(<TopicPartition> partitions)
● close(long timeout, TimeUnit timeUnit)
● commitAsync(Map<TopicPartition,OffsetAndMetadata> offsets,
OffsetCommitCallback callback)
● commitSync(Map<TopicPartition,OffsetAndMetadata> offsets)
● committed(TopicPartition partition)
● endOffsets(<TopicPartition> partitions)
● listTopics()
● metrics()
● offsetsForTimes(Map<TopicPartition,Long> timestampsToSearch)
● partitionsFor(topic)
● pause(<TopicPartition> partitions)
Reading data from Kafka: A client that consumes records from a Kafka cluster.
Class KafkaConsumer<K,V>
● poll(long timeout)
● position(TopicPartition partition)
● resume(<TopicPartition> partitions)
● seek(TopicPartition partition, long offset)
● seekToBeginning(<TopicPartition> partitions)
● seekToEnd(<TopicPartition> partitions)
● subscribe(topics, ConsumerRebalanceListener
listener)
● subscribe(Pattern pattern,
ConsumerRebalanceListener listener)
● subscription()
● unsubscribe()
● wakeup()
Kafka shell scripts
Create a Kafka Topic
» Let's create a topic named "test" with a single partition and
only one replica:
⋄ kafka-topics.sh --create --zookeeper zhost:2181
--replication-factor 1 --partitions 1 --topic test
» See that topic
⋄ bin/kafka-topics.sh --list --zookeeper zhost:2181
bin/kafka-topics.sh
» Create, delete, describe, or change a topic.
Python Kafka Client Benchmarking
DEMO
1. http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/
2. https://github.com/sucitw/benchmark-python-client-for-kafka
http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/
Python Kafka Client Benchmarking
Conclusion:
pykafka, kafka-python or ?
https://github.com/Parsely/pykafka/issues/559
More about Kafka
More about Kafka
» Reliability and durability
⋄ Scaling, replication, guarantee, Zookeeper
» Compact log
» Administration, Configuration, Operations, Monitoring
» Kafka connect
» Kafka Stream
» Schema Registry
» Rest proxy
» Apache Kafka vs XXX
⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis,
and ....
The Another 2 APIs
» Connect API
○ JDBC, HDFS, S3, ….
» Streams API
○ MAP, filter, aggregate, join
More references
1. The Log: What every software engineer should know about real-time data's unifying abstraction,
Jay Kreps, 2013
2. Pykafka and Kafka-python? https://github.com/Parsely/pykafka/issues/559
3. Why I am not a fan of Apache Kafka (2015-2016 Sep)
4. Kafka vs RabbitMQ
a. What are the differences between Apache Kafka and RabbitMQ?
b. Understanding When to use RabbitMQ or Apache Kafka
5. Kafka summit (2016~)
6. Future features of Kafka (Kafka Improvement Proposals)
7. Kafka- The Definitive Guide
We’re hiring
(104 link)

Connect K of SMACK:pykafka, kafka-python or?