SlideShare a Scribd company logo
BUILDING REALTIME DATA
PIPELINES WITH KAFKA CONNECT
AND SPARK STREAMING
Guozhang Wang
Confluent
About Me: Guozhang Wang
• Engineer @ Confluent.
• Apache Kafka Committer, PMC Member.
• Before: Engineer @ LinkedIn, Kafka and Samza.
What do you REALLY need
for Stream Processing?
Spark Streaming! Is that All?
Spark Streaming! Is that All?
Spark Streaming! Is that All?
Data can Comes from / Goes to..
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Real-time Data Integration:
getting data to all the right places
Option #1: One-off Tools
• Tools for each specific data systems
• Examples:
• jdbcRDD, Cassandra-Spark connector, etc..
• Sqoop, logstash to Kafka, etc..
Option #2: Kitchen Sink Tools
• Generic point-to-point data copy / ETL tools
• Examples:
• Enterprise application integration tools
Option #3: Streaming as Copying
• Use stream processing frameworks to copy data
• Examples:
• Spark Streaming: MyRDDWriter (forEachPartition)
• Storm, Samza, Flink, etc..
Real-time Integration: E, T & L
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Example: LinkedIn back in 2010
Example: LinkedIn with Kafka
Apache Kafka
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Large-scale streaming data import/export for Kafka
Kafka Connect
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Data Model
Data Model
Parallelism Model
Standalone Execution
Distributed Execution
Distributed Execution
Distributed Execution
Delivery Guarantees
• Offsets automatically committed and restored
• On restart: task checks offsets & rewinds
• At least once delivery – flush data, then commit
• Exactly once for connectors that support it (e.g. HDFS)
Format Converters
• Abstract serialization agnostic to connectors
• Convert between Kafka Connect Data API (Connectors)
and serialized bytes
• JSON and Avro currently supported
Connector Developer APIs
class Connector {
abstract void start(props);
abstract void stop();
abstract Class<? extends Task> taskClass();
abstract List<Map<…>> taskConfigs(maxTasks);
…
}
class Source/SinkTask {
abstract void start(props);
abstract void stop();
abstract List<SourceRecord> poll();
abstract void put(records);
abstract void commit();
…
}
Kafka Connect & Spark Streaming
Kafka Connect Today
• Confluent open source: HDFS, JDBC
• Connector Hub: connectors.confluent.io
• Examples: MySQL, MongoDB, Twitter, Solr, S3, MQTT,


Counchbase, Vertica, Cassandra, Elastic Search,


HBase, Kudu, Attunity, JustOne, Striim, Bloomberg ..
• Improved connector control (0.10.0)
THANK YOU!
Guozhang Wang | guozhang@confluent.io | @guozhangwang

Confluent – Afternoon Break Sponsor for Spark Summit

• Jay Kreps – I Heart Logs book signing and giveaway

• 3:45pm – 4:15pm in Golden Gate
Kafka Training with Confluent University
• Kafka Developer and Operations Courses
• Visit www.confluent.io/training
Want more Kafka?

• Download Confluent Platform Enterprise (incl. Kafka Connect) at

http://www.confluent.io/product
• Apache Kafka 0.10 upgrade documentation at

http://docs.confluent.io/3.0.0/upgrade.html
Separation of Concerns

More Related Content

What's hot (20)

PPTX
Confluent building a real-time streaming platform using kafka streams and k...
Thomas Alex
 
PDF
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
PPTX
Kafka Streams for Java enthusiasts
Slim Baltagi
 
PDF
Introduction to Apache Kafka and why it matters - Madrid
Paolo Castagna
 
PDF
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
confluent
 
PPTX
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf
 
PPTX
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
Data Con LA
 
PPTX
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
confluent
 
PDF
Introduction to Spark Streaming
datamantra
 
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
PPTX
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
confluent
 
PPTX
Real time Messages at Scale with Apache Kafka and Couchbase
Will Gardella
 
PPTX
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
 
PPTX
Bullet: A Real Time Data Query Engine
DataWorks Summit
 
PDF
PostgreSQL + Kafka: The Delight of Change Data Capture
Jeff Klukas
 
PPTX
kafka for db as postgres
PivotalOpenSourceHub
 
PPTX
Apache Kafka at LinkedIn
Guozhang Wang
 
PPTX
Espresso Database Replication with Kafka, Tom Quiggle
confluent
 
PPTX
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
StreamNative
 
PPTX
Design Patterns for working with Fast Data
MapR Technologies
 
Confluent building a real-time streaming platform using kafka streams and k...
Thomas Alex
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Kafka Streams for Java enthusiasts
Slim Baltagi
 
Introduction to Apache Kafka and why it matters - Madrid
Paolo Castagna
 
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
confluent
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf
 
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
Data Con LA
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
confluent
 
Introduction to Spark Streaming
datamantra
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
confluent
 
Real time Messages at Scale with Apache Kafka and Couchbase
Will Gardella
 
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
 
Bullet: A Real Time Data Query Engine
DataWorks Summit
 
PostgreSQL + Kafka: The Delight of Change Data Capture
Jeff Klukas
 
kafka for db as postgres
PivotalOpenSourceHub
 
Apache Kafka at LinkedIn
Guozhang Wang
 
Espresso Database Replication with Kafka, Tom Quiggle
confluent
 
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
StreamNative
 
Design Patterns for working with Fast Data
MapR Technologies
 

Similar to Building Realtim Data Pipelines with Kafka Connect and Spark Streaming (20)

PDF
Jug - ecosystem
Florent Ramiere
 
PDF
Chti jug - 2018-06-26
Florent Ramiere
 
PDF
Apache Spark Streaming
Bartosz Jankiewicz
 
PDF
BBL KAPPA Lesfurets.com
Cedric Vidal
 
PDF
20170126 big data processing
Vienna Data Science Group
 
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PPTX
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
PDF
Streaming with Spring Cloud Stream and Apache Kafka - Soby Chacko
VMware Tanzu
 
PPTX
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Chris Fregly
 
PPTX
ApacheCon North America 2018: Creating Spark Data Sources
Jayesh Thakrar
 
PDF
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
PDF
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
PDF
A Unified Platform for Real-time Storage and Processing
StreamNative
 
PPTX
Kafka connect 101
Whiteklay
 
PDF
Introducing Kafka's Streams API
confluent
 
PPTX
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Lviv Startup Club
 
Jug - ecosystem
Florent Ramiere
 
Chti jug - 2018-06-26
Florent Ramiere
 
Apache Spark Streaming
Bartosz Jankiewicz
 
BBL KAPPA Lesfurets.com
Cedric Vidal
 
20170126 big data processing
Vienna Data Science Group
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Streaming with Spring Cloud Stream and Apache Kafka - Soby Chacko
VMware Tanzu
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Chris Fregly
 
ApacheCon North America 2018: Creating Spark Data Sources
Jayesh Thakrar
 
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
A Unified Platform for Real-time Storage and Processing
StreamNative
 
Kafka connect 101
Whiteklay
 
Introducing Kafka's Streams API
confluent
 
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Lviv Startup Club
 
Ad

More from Guozhang Wang (11)

PDF
Consensus in Apache Kafka: From Theory to Production.pdf
Guozhang Wang
 
PDF
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Guozhang Wang
 
PDF
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Guozhang Wang
 
PDF
Introduction to the Incremental Cooperative Protocol of Kafka
Guozhang Wang
 
PDF
Performance Analysis and Optimizations for Kafka Streams Applications
Guozhang Wang
 
PDF
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Guozhang Wang
 
PPTX
Exactly-once Stream Processing with Kafka Streams
Guozhang Wang
 
PDF
Apache Kafka, and the Rise of Stream Processing
Guozhang Wang
 
PDF
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Guozhang Wang
 
PPTX
Building a Replicated Logging System with Apache Kafka
Guozhang Wang
 
PPTX
Behavioral Simulations in MapReduce
Guozhang Wang
 
Consensus in Apache Kafka: From Theory to Production.pdf
Guozhang Wang
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Guozhang Wang
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Guozhang Wang
 
Introduction to the Incremental Cooperative Protocol of Kafka
Guozhang Wang
 
Performance Analysis and Optimizations for Kafka Streams Applications
Guozhang Wang
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Guozhang Wang
 
Exactly-once Stream Processing with Kafka Streams
Guozhang Wang
 
Apache Kafka, and the Rise of Stream Processing
Guozhang Wang
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Guozhang Wang
 
Building a Replicated Logging System with Apache Kafka
Guozhang Wang
 
Behavioral Simulations in MapReduce
Guozhang Wang
 
Ad

Recently uploaded (20)

PDF
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PDF
NTPC PATRATU Summer internship report.pdf
hemant03701
 
PDF
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
PPTX
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
PDF
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
PPTX
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
PPTX
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
PDF
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PDF
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
PPTX
Alan Turing - life and importance for all of us now
Pedro Concejero
 
PPTX
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
Data structures notes for unit 2 in computer science.pdf
sshubhamsingh265
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PDF
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
PPTX
仿制LethbridgeOffer加拿大莱斯桥大学毕业证范本,Lethbridge成绩单
Taqyea
 
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
NTPC PATRATU Summer internship report.pdf
hemant03701
 
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
How Industrial Project Management Differs From Construction.pptx
jamespit799
 
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
Alan Turing - life and importance for all of us now
Pedro Concejero
 
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
Data structures notes for unit 2 in computer science.pdf
sshubhamsingh265
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
仿制LethbridgeOffer加拿大莱斯桥大学毕业证范本,Lethbridge成绩单
Taqyea
 

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming