Kafka 101
Just enough knowledge
to break everything
(Simplified) Glossary
Kafka ~ Distributed messaging system (distributed Pub Sub)
Brokers ~ The machines where the data is stored
Topic ~ Queue(s) of messages on cluster
Producer & Consumer ~ Pub Sub clients for the topic
Avro ~ A serialization format
OVERVIEW
Kafka Why and How ?
Producer - Consumer
Topics
A common format : Avro
Where is the data ?
Isn’t that just one big single point of failure ?
Kafka Why and How ?
Without a centralised communication pipe
DATA SOURCES
DATA OPERATION
With a centralised communication pipe
DATA SOURCES
DATA OPERATION
Articulated around 3 parts
Publish & Subscribe using a messaging queue
● Topic represented by a dedicated queue
● Writer and Reader don’t known each other
● Processing data is the reader’s responsibility
Processing in real time
Kafka storage
By default on kafka :
● Write on disk (0 copy)
● Retention of message is of 6 months by topic
● Topics are distributed for parallelism
● Topics are replicated for resilience
Producer - Consumer
Producer consumer model in Kafka
Kafka producer
Kafka producer pattern of publication
At-Least-Once:
=> Wait for ack from cluster
At-Most-Once
=> Don’t wait for ack from cluster
Kafka consumer
Kafka consumer pattern by default “latest”
Kafka consumer pattern “earliest”
Kafka consumer using a specific offset
Topics and partitions
Topic are glorified log file (sic)
Splitting topics into partitions
Consumer groups
A common format avro
Avro example
{"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
● Binary file
● Strictly typed data structure
● Allow Union and Default value
● Schema version attached to file
● Schema needed to Read/Write
● One schema but multiple versions
Avro usage in kafka
Schema registry in action
Where is the data ?
Brokers are where most of the stuff happens
The data sits on the brokers’
disk(s).
Data flows to/from Kafka. It’s
immutable, you can’t change
it directly.
Dump the data
By default, keep for approx. 6
months but it can stay there
indefinitely.
In all cases, its expiration is
totally independent from it’s
consumption.
Retention
To increase space we can
“simply” add a new broker.
Scalable
Replication
Isn’t that just a big SPOF ?
Failures resilience
Partition follower failure
Partition leader failure
Zookeeper: the puppet master
Kafka at JobTeaser
Talent bank’s use case
Stream “Latest”
1 topic by domain.entity
3 partitions by topic
Retention > weeks
Data team’s use case with JT MySQL
Stream full content of DB
1 topic by table
1 partition by topic
Retention > months
Data team’s use case with Salesforce
Stream “Latest”
1 topic by “Object”
1 partition
Retention < 1 week
(Complete) Glossary
Kakfa -> Your new best friend
topic -> Log file of the message (exist on cluster level)
Offset -> Primary key of the message (on partition level)
Brokers -> The machines that fully handle the topics
Producer & Consumer -> Your job
Avro -> So much better than json ;)
Join the movement !
Valuables resources
Kafka for beginners : https://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/
Kafka overview : https://www.alibabacloud.com/blog/an-overview-of-kafka-distributed-message-system_594218
Kafka a database : https://speakerdeck.com/ept/is-kafka-a-database
Putting the Power of Kafka into the Hands of Data Scientists :
https://multithreaded.stitchfix.com/blog/2018/09/05/datahighway/
Why we choose Kafka : https://tech.trello.com/why-we-chose-kafka/
Salesforce notifications to Kafka topics : https://glenmazza.net/blog/entry/salesforce-notifications-to-kafka-topics
Streaming data out of the monolith : https://medium.com/blablacar-tech/streaming-data-out-of-the-monolith-building-a-
highly-reliable-cdc-stack-d71599131acb
Kafka client At Most One, At Least Once, Exactly Once : https://dzone.com/articles/kafka-clients-at-most-once-at-least-
once-exactly-o
Message serialization in Kafka using Avro part 1 : http://blog.cloudera.com/blog/2018/07/robust-message-serialization-in-
apache-kafka-using-apache-avro-part-1/
Message serialization in Kafka using Avro part 2 :
http://blog.cloudera.com/blog/2018/07/robust-message-serialization-in-apache-kafka-using-apache-avro-part-2/
Offset management in Kafka : https://fr.slideshare.net/jjkoshy/offset-management-in-kafka
Kafka listeners explained : https://rmoff.net/2018/08/02/kafka-listeners-explained/
The power of rebalancing in Kafka : https://www.youtube.com/watch?v=MmLezWRI3Ys

Kafka 101

  • 1.
    Kafka 101 Just enoughknowledge to break everything
  • 2.
    (Simplified) Glossary Kafka ~Distributed messaging system (distributed Pub Sub) Brokers ~ The machines where the data is stored Topic ~ Queue(s) of messages on cluster Producer & Consumer ~ Pub Sub clients for the topic Avro ~ A serialization format
  • 3.
    OVERVIEW Kafka Why andHow ? Producer - Consumer Topics A common format : Avro Where is the data ? Isn’t that just one big single point of failure ?
  • 4.
  • 5.
    Without a centralisedcommunication pipe DATA SOURCES DATA OPERATION
  • 6.
    With a centralisedcommunication pipe DATA SOURCES DATA OPERATION
  • 7.
  • 8.
    Publish & Subscribeusing a messaging queue ● Topic represented by a dedicated queue ● Writer and Reader don’t known each other ● Processing data is the reader’s responsibility
  • 9.
  • 10.
    Kafka storage By defaulton kafka : ● Write on disk (0 copy) ● Retention of message is of 6 months by topic ● Topics are distributed for parallelism ● Topics are replicated for resilience
  • 11.
  • 12.
  • 13.
  • 14.
    Kafka producer patternof publication At-Least-Once: => Wait for ack from cluster At-Most-Once => Don’t wait for ack from cluster
  • 15.
  • 16.
    Kafka consumer patternby default “latest”
  • 17.
    Kafka consumer pattern“earliest”
  • 18.
    Kafka consumer usinga specific offset
  • 19.
  • 20.
    Topic are glorifiedlog file (sic)
  • 21.
  • 22.
  • 23.
  • 24.
    Avro example {"namespace": "example.avro", "type":"record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } ● Binary file ● Strictly typed data structure ● Allow Union and Default value ● Schema version attached to file ● Schema needed to Read/Write ● One schema but multiple versions
  • 25.
  • 26.
  • 27.
  • 28.
    Brokers are wheremost of the stuff happens The data sits on the brokers’ disk(s). Data flows to/from Kafka. It’s immutable, you can’t change it directly. Dump the data By default, keep for approx. 6 months but it can stay there indefinitely. In all cases, its expiration is totally independent from it’s consumption. Retention To increase space we can “simply” add a new broker. Scalable
  • 29.
  • 30.
    Isn’t that justa big SPOF ?
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
    Talent bank’s usecase Stream “Latest” 1 topic by domain.entity 3 partitions by topic Retention > weeks
  • 37.
    Data team’s usecase with JT MySQL Stream full content of DB 1 topic by table 1 partition by topic Retention > months
  • 38.
    Data team’s usecase with Salesforce Stream “Latest” 1 topic by “Object” 1 partition Retention < 1 week
  • 39.
    (Complete) Glossary Kakfa ->Your new best friend topic -> Log file of the message (exist on cluster level) Offset -> Primary key of the message (on partition level) Brokers -> The machines that fully handle the topics Producer & Consumer -> Your job Avro -> So much better than json ;)
  • 40.
  • 41.
    Valuables resources Kafka forbeginners : https://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/ Kafka overview : https://www.alibabacloud.com/blog/an-overview-of-kafka-distributed-message-system_594218 Kafka a database : https://speakerdeck.com/ept/is-kafka-a-database Putting the Power of Kafka into the Hands of Data Scientists : https://multithreaded.stitchfix.com/blog/2018/09/05/datahighway/ Why we choose Kafka : https://tech.trello.com/why-we-chose-kafka/ Salesforce notifications to Kafka topics : https://glenmazza.net/blog/entry/salesforce-notifications-to-kafka-topics Streaming data out of the monolith : https://medium.com/blablacar-tech/streaming-data-out-of-the-monolith-building-a- highly-reliable-cdc-stack-d71599131acb Kafka client At Most One, At Least Once, Exactly Once : https://dzone.com/articles/kafka-clients-at-most-once-at-least- once-exactly-o Message serialization in Kafka using Avro part 1 : http://blog.cloudera.com/blog/2018/07/robust-message-serialization-in- apache-kafka-using-apache-avro-part-1/ Message serialization in Kafka using Avro part 2 : http://blog.cloudera.com/blog/2018/07/robust-message-serialization-in-apache-kafka-using-apache-avro-part-2/ Offset management in Kafka : https://fr.slideshare.net/jjkoshy/offset-management-in-kafka Kafka listeners explained : https://rmoff.net/2018/08/02/kafka-listeners-explained/ The power of rebalancing in Kafka : https://www.youtube.com/watch?v=MmLezWRI3Ys