SlideShare a Scribd company logo
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
End-to-end large messages processing
with Kafka Streams & Kafka Connect
Philipp Schirmer | March 19th, 2020
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
Text-heavy data
Icons made by Pause08, Freepik, Zlatko Najdenovski, monkik from www.flaticon.com
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
Load data into Kafka, process and index in Elasticsearch
Amazon S3 Elasticsearch
https://aws.amazon.com/architecture/icons/
https://www.seekpng.com/ipng/u2q8r5r5w7e6q8t4_white-on-transparent-kafka-logo-svg/
https://cdn.worldvectorlogo.com/logos/elasticsearch.svg
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
Large messages in Kafka
NLP-Operations require full document
But:
● Kafka prefers many small messages
● max.message.bytes defined by brokers (default 1MB)
● Limit cannot be configured in some scenarios (e.g., Confluent Cloud)
● There will always be (few) messages exceeding the limit
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
Solutions: Chunking
Split document into multiple small parts
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
Solutions: Chunking
Split document into multiple small parts
But: requires change of processing logic
● User must be aware of partial documents
● Expensive aggregations required
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
Solutions: External Storage
Store large messages in external system, e.g., Amazon S3, and send pointers to Kafka
s3://
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
Kafka Streams SerDe
Kafka stores binary messages
SerDes offer an API to interpret raw messages with Kafka Streams (Java)
KStream<String, String> stream = builder.stream("my_topic", Consumed.with(Serdes.String(), Serdes.String()));
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
S3-backed SerDe
Our S3-backed SerDe transparently handles storage and retrieval of messages
● Delegates actual (de-)serialization to a wrapped SerDe
● No changes to processing logic required, only configuration changes
● Small messages are sent to Kafka (almost) as before
Similar implementations for Kafka Connect and Faust (Python)
bakdata/kafka-s3-backed-serde (GitHub)
bakdata/faust-s3-backed-serializer (GitHub)
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
Serialization
1. Serialize message using actual SerDe
2. Check if message size in bytes exceeds configured limit
3.
a. If message is small enough
i. Send actual message small message flag to Kafka
b. If message is too large
i. Store message on S3
ii. Send S3 URI and large message flag to Kafka
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
Serialization - S3 URI
S3 URIs must be unique
s3://my-bucket/large-messages/my-topic/values/12ef79ab1097ebca1abe5982ce1763aa
s3backed.base.path key/value
topic unique id
s3backed.id.generator
● Random UUID
● Murmur hash
● SHA-256
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
Deserialization
1. Check flag
2.
a. If flag denotes small message
i. Proceed
b. If denotes large message
i. Retrieve message from S3
3. Deserialize message using actual SerDe
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
Demo
https://github.com/bakdata/s3-backed-serde-demo
Amazon S3 Elasticsearch
https://aws.amazon.com/architecture/icons/
https://www.seekpng.com/ipng/u2q8r5r5w7e6q8t4_white-on-transparent-kafka-logo-svg/
https://cdn.worldvectorlogo.com/logos/elasticsearch.svg
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
tf-idf
… is a numerical statistic that is intended to reflect how important a word is to a
document in a collection or corpus.
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://de.wikipedia.org/wiki/Tf-idf-Ma%C3%9F
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
S3-backed SerDe - Retention
Kafka topics have a message retention
S3 supports object expiration
● Each rule applies to a prefix → each topic has its own prefix
● Timestamp of S3 object and Kafka message can differ
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
S3-backed SerDe - Limitations
● Communication with S3 is costly
● Small message overhead (1 byte)
● Partitioning affected if used for keys
● Dangling references if compaction is used
● No KSQL support (KSQL Formats)
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
Read more
Processing Large Messages with Kafka Streams (Medium)
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
Read more
bakdata/kafka-s3-backed-serde (GitHub)
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
bakdata/faust-s3-backed-serializer (GitHub)
Read more
End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020
Contact us
community@bakdata.com
@bakdata
https://www.hiclipart.com/free-transparent-background-png-clipart-qbavj
https://upload.wikimedia.org/wikipedia/en/thumb/9/9f/Twitter_bird_logo_2012.svg/300px-Twitter_bird_logo_2012.svg.png

More Related Content

What's hot (9)

PPTX
LinkedIn talk at Netflix ML Platform meetup Sep 2019
Faisal Siddiqi
 
PDF
Building real time analytics applications using pinot : A LinkedIn case study
Kishore Gopalakrishna
 
PPTX
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
PPTX
Tiki.vn - How we scale as a tech startup
Tung Ns
 
PDF
Apache con2016final
Salesforce
 
PPTX
Component Preparation methods & techniques.
KIMS
 
PDF
seven-ways-to-run-flink-on-aws.pdf
SergioBruno21
 
PDF
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...
HostedbyConfluent
 
PDF
How Lazada ranks products to improve customer experience and conversion
Eugene Yan Ziyou
 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
Faisal Siddiqi
 
Building real time analytics applications using pinot : A LinkedIn case study
Kishore Gopalakrishna
 
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Tiki.vn - How we scale as a tech startup
Tung Ns
 
Apache con2016final
Salesforce
 
Component Preparation methods & techniques.
KIMS
 
seven-ways-to-run-flink-on-aws.pdf
SergioBruno21
 
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...
HostedbyConfluent
 
How Lazada ranks products to improve customer experience and conversion
Eugene Yan Ziyou
 

Similar to End to-end large messages processing with Kafka Streams & Kafka Connect (20)

PDF
Apache Kafka - Scalable Message Processing and more!
Guido Schmutz
 
PPTX
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Edureka!
 
PDF
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
PPTX
Kafka Tutorial: Streaming Data Architecture
Jean-Paul Azar
 
PDF
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
PPTX
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Lucas Jellema
 
PPTX
What is Kafka & why is it Important? (UKOUG Tech17, Birmingham, UK - December...
Lucas Jellema
 
PPTX
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PPTX
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Data Con LA
 
PDF
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
PDF
Building Streaming Data Applications Using Apache Kafka
Slim Baltagi
 
PPTX
Kafka for data scientists
Jenn Rawlins
 
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
PDF
Building the Pivotal RabbitMQ for Kubernetes Beta
VMware Tanzu
 
PDF
Apache Kafka - A modern Stream Processing Platform
Guido Schmutz
 
PPTX
Data Integration with Apache Kafka: What, Why, How
Pat Patterson
 
PDF
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis
 
PPTX
Apache Kafka: Next Generation Distributed Messaging System
Edureka!
 
PPTX
Kafka Tutorial, Kafka ecosystem with clustering examples
Jean-Paul Azar
 
PDF
Connect K of SMACK:pykafka, kafka-python or?
Micron Technology
 
Apache Kafka - Scalable Message Processing and more!
Guido Schmutz
 
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Edureka!
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Kafka Tutorial: Streaming Data Architecture
Jean-Paul Azar
 
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Lucas Jellema
 
What is Kafka & why is it Important? (UKOUG Tech17, Birmingham, UK - December...
Lucas Jellema
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Data Con LA
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Building Streaming Data Applications Using Apache Kafka
Slim Baltagi
 
Kafka for data scientists
Jenn Rawlins
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Building the Pivotal RabbitMQ for Kubernetes Beta
VMware Tanzu
 
Apache Kafka - A modern Stream Processing Platform
Guido Schmutz
 
Data Integration with Apache Kafka: What, Why, How
Pat Patterson
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis
 
Apache Kafka: Next Generation Distributed Messaging System
Edureka!
 
Kafka Tutorial, Kafka ecosystem with clustering examples
Jean-Paul Azar
 
Connect K of SMACK:pykafka, kafka-python or?
Micron Technology
 
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
PDF
Migration, backup and restore made easy using Kannika
confluent
 
PDF
Five Things You Need to Know About Data Streaming in 2025
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
PDF
Unlocking value with event-driven architecture by Confluent
confluent
 
PDF
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
PDF
Building API data products on top of your real-time data infrastructure
confluent
 
PDF
Speed Wins: From Kafka to APIs in Minutes
confluent
 
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
Migration, backup and restore made easy using Kannika
confluent
 
Five Things You Need to Know About Data Streaming in 2025
confluent
 
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Unlocking value with event-driven architecture by Confluent
confluent
 
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Ad

Recently uploaded (20)

PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Biography of Daniel Podor.pdf
Daniel Podor
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 

End to-end large messages processing with Kafka Streams & Kafka Connect

  • 1. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 End-to-end large messages processing with Kafka Streams & Kafka Connect Philipp Schirmer | March 19th, 2020
  • 2. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 Text-heavy data Icons made by Pause08, Freepik, Zlatko Najdenovski, monkik from www.flaticon.com
  • 3. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 Load data into Kafka, process and index in Elasticsearch Amazon S3 Elasticsearch https://aws.amazon.com/architecture/icons/ https://www.seekpng.com/ipng/u2q8r5r5w7e6q8t4_white-on-transparent-kafka-logo-svg/ https://cdn.worldvectorlogo.com/logos/elasticsearch.svg
  • 4. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 Large messages in Kafka NLP-Operations require full document But: ● Kafka prefers many small messages ● max.message.bytes defined by brokers (default 1MB) ● Limit cannot be configured in some scenarios (e.g., Confluent Cloud) ● There will always be (few) messages exceeding the limit
  • 5. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 Solutions: Chunking Split document into multiple small parts
  • 6. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 Solutions: Chunking Split document into multiple small parts But: requires change of processing logic ● User must be aware of partial documents ● Expensive aggregations required
  • 7. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 Solutions: External Storage Store large messages in external system, e.g., Amazon S3, and send pointers to Kafka s3://
  • 8. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 Kafka Streams SerDe Kafka stores binary messages SerDes offer an API to interpret raw messages with Kafka Streams (Java) KStream<String, String> stream = builder.stream("my_topic", Consumed.with(Serdes.String(), Serdes.String()));
  • 9. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 S3-backed SerDe Our S3-backed SerDe transparently handles storage and retrieval of messages ● Delegates actual (de-)serialization to a wrapped SerDe ● No changes to processing logic required, only configuration changes ● Small messages are sent to Kafka (almost) as before Similar implementations for Kafka Connect and Faust (Python) bakdata/kafka-s3-backed-serde (GitHub) bakdata/faust-s3-backed-serializer (GitHub)
  • 10. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 Serialization 1. Serialize message using actual SerDe 2. Check if message size in bytes exceeds configured limit 3. a. If message is small enough i. Send actual message small message flag to Kafka b. If message is too large i. Store message on S3 ii. Send S3 URI and large message flag to Kafka
  • 11. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 Serialization - S3 URI S3 URIs must be unique s3://my-bucket/large-messages/my-topic/values/12ef79ab1097ebca1abe5982ce1763aa s3backed.base.path key/value topic unique id s3backed.id.generator ● Random UUID ● Murmur hash ● SHA-256
  • 12. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 Deserialization 1. Check flag 2. a. If flag denotes small message i. Proceed b. If denotes large message i. Retrieve message from S3 3. Deserialize message using actual SerDe
  • 13. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 Demo https://github.com/bakdata/s3-backed-serde-demo Amazon S3 Elasticsearch https://aws.amazon.com/architecture/icons/ https://www.seekpng.com/ipng/u2q8r5r5w7e6q8t4_white-on-transparent-kafka-logo-svg/ https://cdn.worldvectorlogo.com/logos/elasticsearch.svg
  • 14. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 tf-idf … is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. https://en.wikipedia.org/wiki/Tf%E2%80%93idf https://de.wikipedia.org/wiki/Tf-idf-Ma%C3%9F
  • 15. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 S3-backed SerDe - Retention Kafka topics have a message retention S3 supports object expiration ● Each rule applies to a prefix → each topic has its own prefix ● Timestamp of S3 object and Kafka message can differ
  • 16. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 S3-backed SerDe - Limitations ● Communication with S3 is costly ● Small message overhead (1 byte) ● Partitioning affected if used for keys ● Dangling references if compaction is used ● No KSQL support (KSQL Formats)
  • 17. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 Read more Processing Large Messages with Kafka Streams (Medium)
  • 18. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 Read more bakdata/kafka-s3-backed-serde (GitHub)
  • 19. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 bakdata/faust-s3-backed-serializer (GitHub) Read more
  • 20. End-to-end large messages processing with Kafka Streams & Kafka Connect | Philipp Schirmer | bakdata GmbH | March 19th, 2020 Contact us [email protected] @bakdata https://www.hiclipart.com/free-transparent-background-png-clipart-qbavj https://upload.wikimedia.org/wikipedia/en/thumb/9/9f/Twitter_bird_logo_2012.svg/300px-Twitter_bird_logo_2012.svg.png