SlideShare a Scribd company logo
Luan Moreno
Big Data Engineer at Pythian
maciel@pythian.com
Mateus Oliveira
Data In-Motion Specialist at One Way Solution
mateus.oliveira@owshq.com
Real-Time Analytics at Scale is Challenging - The Business Problem
Microservices
• Microservices & Datastores Problem
• K8S as De-Facto Orchestration Service
• Containerized Stateless & Stateful Applications
Toolset
• Modern Market Urges for Real-Time
• Immutable Infrastructure with CI
• Distributed Engines of Big Data
• Scale-Up & Down Resources
Real-Time Analytics
• Challenge for Querying Data In-Motion
• QPS in a Low Latency Fashion
• Answers in Real-Time for Business Level
• Scalable Data Pipelines
• Rapidly Reaction of a Business Change
Event Sourcing Engine
apache kafka as datahub and
event sourcing for real-time
data transportation
Querying Data
ability to retrieve & store
data from apache kafka in
real-time fashion efficiently
Microservices
Kubernetes as Cornerstone Solution
for Big Data Infrastructure
applications are now containerized and, the
microservices age gives the power to scale up
resources as needed in a timely fashion.
Statefulsets
in your prior days, K8s didn't support stateful
applications (2017), now it’s possible, and big
data solutions can be implemented and
deployed efficiently.
Current Gen of Big Data
we've seen more and more big data applications
being converted to work at k8s or born in k8s,
e.g., strimzi and confluent operator.
Cost-wise
containerized applications are designed to
consume fewer resources; you can deploy
many applications at once in k8s; imagine
having ingestion, processing, and serving
layer for a price of 1.
Real-Time Data Pipeline on Kubernetes using Kafka, KafkaConnect, KSQLDB & Apache Pinot
Event Sourcing Engine
Strimzi Operator as Datahub and Event
Sourcing System for Real-Time Data Transport
Stream Processor
KSQLDB as SQL Processing Engine, Select
Data & Creating New Enriched Topics
Data Sources
Datastore Systems ~ Relational
& NoSQL Databases
Data Serving
Apache Pinot OLAP System
for Real-Time Insights at Scale
Infrastructure Backbone
Big Data Products Deployed on
Kubernetes Cluster as Containers
KafkaConnect
Extraction of Events using
a YAML Configuration File
• Blazing Fast OLAP Engine
• LinkedIN, Uber, Target, Slack, Stripe
• Designed for Real-Time Analytics
Realtime Distributed OLAP DataStore, Designed for
Answering OLAP Queries with Low-Latency
Apache Pinot
Apache Pinot Advantages for Real-Time Analytics
Characteristics
History
• Pinot Noir ~ Name of Grape for Wine
• Developed Internally at LinkedIN in 2014
• Open-Sourced in 2015
• Entered in Apache Incubator in 2018
Key Capabilities
• Columnar-Oriented Storage
• Pluggable Indexing Technologies
• Horizontally Scale & Fault-Tolerant
• Performs Anomaly Detection using ThirdEye
• Joins using Presto & Trino
• Controller (Helix & Zookeeper) ~
Manage State & Health
• Broker ~ Route Queries
• Server ~ Host Segments (Data) ~
Realtime & Offline Tables
• Minion ~ Purge Data
Core Components
• PQL ~ SQL using Apache Calcite
• API ~ RestAPIs for Broker & Controller
• External Clients ~ JDBC, Java, Python & Go
Users
• 100K + Queries per Second at ms Latency
• Ingestion of Million of Events per Second
• 50+ User-Facing Analytics
Performance
Building Data Pipeline using KafkaConnect, KSQLDB & Apache Pinot
Demonstration
Kafka & KafkaConnect KSQLDB Apache Pinot
The Summary
Real-Time Analytics is Impactful
not easy to build and manage, need to
choose the right toolset and strategy
K8s Simplifies Infrastructure
reduce it ops and increases the agility of
the team by offering containerized
applications management
Business Oriented
when we take out the heavy lifting of infrastructure
and choose the right toolset of big data, we can
focus on what matters: the business problem
Cost Reduction Benefit
working with both kubernetes and open-source
tools to deliver big data pipelines, we can see
how cost could be dramatically reduced
Thank You

More Related Content

What's hot (20)

PDF
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
PDF
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Kai Wähner
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
PDF
CDC patterns in Apache Kafka®
confluent
 
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
PDF
Datastores
Raveen Vijayan
 
PDF
Change Data Streaming Patterns for Microservices With Debezium
confluent
 
PDF
A Practical Enterprise Feature Store on Delta Lake
Databricks
 
PDF
Apache Kafka and the Data Mesh | Michael Noll, Confluent
HostedbyConfluent
 
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
PDF
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Theofilos Papapanagiotou
 
PDF
Who needs containers in a serverless world
Matthias Luebken
 
PDF
From Mainframe to Microservice: An Introduction to Distributed Systems
Tyler Treat
 
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
PPTX
An Introduction to Elastic Search.
Jurriaan Persyn
 
PDF
Monitoring Microservices
Weaveworks
 
PPTX
Data warehousing with Hadoop
hadooparchbook
 
PDF
Spark SQL
Joud Khattab
 
PDF
Databricks and Logging in Notebooks
Knoldus Inc.
 
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Kai Wähner
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
CDC patterns in Apache Kafka®
confluent
 
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Datastores
Raveen Vijayan
 
Change Data Streaming Patterns for Microservices With Debezium
confluent
 
A Practical Enterprise Feature Store on Delta Lake
Databricks
 
Apache Kafka and the Data Mesh | Michael Noll, Confluent
HostedbyConfluent
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Theofilos Papapanagiotou
 
Who needs containers in a serverless world
Matthias Luebken
 
From Mainframe to Microservice: An Introduction to Distributed Systems
Tyler Treat
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
An Introduction to Elastic Search.
Jurriaan Persyn
 
Monitoring Microservices
Weaveworks
 
Data warehousing with Hadoop
hadooparchbook
 
Spark SQL
Joud Khattab
 
Databricks and Logging in Notebooks
Knoldus Inc.
 

Similar to Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apache Pinot | Mateus Oliveira, One Way Solution & Luan Moreno Mederios, Pythian (20)

PPTX
Real Time Big Data Processing on AWS
Caserta
 
PDF
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
confluent
 
PPTX
Assessing New Databases– Translytical Use Cases
DATAVERSITY
 
PDF
Cloud-native Semantic Layer on Data Lake
Databricks
 
PDF
Build Real-Time Applications with Databricks Streaming
Databricks
 
PPTX
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise
DataWorks Summit
 
PPTX
From Data to Services at the Speed of Business
Ali Hodroj
 
PDF
Webinar: SQL for Machine Data?
Crate.io
 
PDF
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
ScyllaDB
 
PPTX
Pragmatic CQRS with existing applications and databases (Digital Xchange, May...
Lucas Jellema
 
PDF
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
confluent
 
PDF
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Matt Stubbs
 
PDF
Viele Autos, noch mehr Daten: IoT-Daten-Streaming mit MQTT & Kafka (Kai Waehn...
confluent
 
PDF
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
Kai Wähner
 
PDF
Genji: Framework for building resilient near-realtime data pipelines
Swami Sundaramurthy
 
PPTX
Top 6 Data Ingestion Tools for Seamless Data Integration
YourTechDiet
 
PDF
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
PPTX
Data & analytics challenges in a microservice architecture
Niels Naglé
 
PDF
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Real Time Big Data Processing on AWS
Caserta
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
confluent
 
Assessing New Databases– Translytical Use Cases
DATAVERSITY
 
Cloud-native Semantic Layer on Data Lake
Databricks
 
Build Real-Time Applications with Databricks Streaming
Databricks
 
Smart Enterprise Big Data Bus for the Modern Responsive Enterprise
DataWorks Summit
 
From Data to Services at the Speed of Business
Ali Hodroj
 
Webinar: SQL for Machine Data?
Crate.io
 
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache Kafka
ScyllaDB
 
Pragmatic CQRS with existing applications and databases (Digital Xchange, May...
Lucas Jellema
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
confluent
 
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Matt Stubbs
 
Viele Autos, noch mehr Daten: IoT-Daten-Streaming mit MQTT & Kafka (Kai Waehn...
confluent
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
Kai Wähner
 
Genji: Framework for building resilient near-realtime data pipelines
Swami Sundaramurthy
 
Top 6 Data Ingestion Tools for Seamless Data Integration
YourTechDiet
 
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Data & analytics challenges in a microservice architecture
Niels Naglé
 
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
PDF
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
PDF
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
PDF
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
PDF
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
PDF
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
PDF
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
PDF
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
PDF
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
PDF
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Ad

Recently uploaded (20)

DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 

Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apache Pinot | Mateus Oliveira, One Way Solution & Luan Moreno Mederios, Pythian

  • 1. Luan Moreno Big Data Engineer at Pythian [email protected] Mateus Oliveira Data In-Motion Specialist at One Way Solution [email protected]
  • 2. Real-Time Analytics at Scale is Challenging - The Business Problem Microservices • Microservices & Datastores Problem • K8S as De-Facto Orchestration Service • Containerized Stateless & Stateful Applications Toolset • Modern Market Urges for Real-Time • Immutable Infrastructure with CI • Distributed Engines of Big Data • Scale-Up & Down Resources Real-Time Analytics • Challenge for Querying Data In-Motion • QPS in a Low Latency Fashion • Answers in Real-Time for Business Level • Scalable Data Pipelines • Rapidly Reaction of a Business Change Event Sourcing Engine apache kafka as datahub and event sourcing for real-time data transportation Querying Data ability to retrieve & store data from apache kafka in real-time fashion efficiently
  • 3. Microservices Kubernetes as Cornerstone Solution for Big Data Infrastructure applications are now containerized and, the microservices age gives the power to scale up resources as needed in a timely fashion. Statefulsets in your prior days, K8s didn't support stateful applications (2017), now it’s possible, and big data solutions can be implemented and deployed efficiently. Current Gen of Big Data we've seen more and more big data applications being converted to work at k8s or born in k8s, e.g., strimzi and confluent operator. Cost-wise containerized applications are designed to consume fewer resources; you can deploy many applications at once in k8s; imagine having ingestion, processing, and serving layer for a price of 1.
  • 4. Real-Time Data Pipeline on Kubernetes using Kafka, KafkaConnect, KSQLDB & Apache Pinot Event Sourcing Engine Strimzi Operator as Datahub and Event Sourcing System for Real-Time Data Transport Stream Processor KSQLDB as SQL Processing Engine, Select Data & Creating New Enriched Topics Data Sources Datastore Systems ~ Relational & NoSQL Databases Data Serving Apache Pinot OLAP System for Real-Time Insights at Scale Infrastructure Backbone Big Data Products Deployed on Kubernetes Cluster as Containers KafkaConnect Extraction of Events using a YAML Configuration File
  • 5. • Blazing Fast OLAP Engine • LinkedIN, Uber, Target, Slack, Stripe • Designed for Real-Time Analytics Realtime Distributed OLAP DataStore, Designed for Answering OLAP Queries with Low-Latency Apache Pinot Apache Pinot Advantages for Real-Time Analytics Characteristics History • Pinot Noir ~ Name of Grape for Wine • Developed Internally at LinkedIN in 2014 • Open-Sourced in 2015 • Entered in Apache Incubator in 2018 Key Capabilities • Columnar-Oriented Storage • Pluggable Indexing Technologies • Horizontally Scale & Fault-Tolerant • Performs Anomaly Detection using ThirdEye • Joins using Presto & Trino • Controller (Helix & Zookeeper) ~ Manage State & Health • Broker ~ Route Queries • Server ~ Host Segments (Data) ~ Realtime & Offline Tables • Minion ~ Purge Data Core Components • PQL ~ SQL using Apache Calcite • API ~ RestAPIs for Broker & Controller • External Clients ~ JDBC, Java, Python & Go Users • 100K + Queries per Second at ms Latency • Ingestion of Million of Events per Second • 50+ User-Facing Analytics Performance
  • 6. Building Data Pipeline using KafkaConnect, KSQLDB & Apache Pinot Demonstration Kafka & KafkaConnect KSQLDB Apache Pinot
  • 7. The Summary Real-Time Analytics is Impactful not easy to build and manage, need to choose the right toolset and strategy K8s Simplifies Infrastructure reduce it ops and increases the agility of the team by offering containerized applications management Business Oriented when we take out the heavy lifting of infrastructure and choose the right toolset of big data, we can focus on what matters: the business problem Cost Reduction Benefit working with both kubernetes and open-source tools to deliver big data pipelines, we can see how cost could be dramatically reduced