Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

5 likes2,206 views

This document discusses building real-time data pipelines using Kafka Connect and Spark Streaming, highlighting the need for efficient stream processing and various integration options. It details Kafka's data model, delivery guarantees, and the functionality of Kafka Connect including connector APIs. The presentation also emphasizes Confluent's offerings, including open-source connectors and training resources.

Engineering

BUILDING REALTIME DATA
PIPELINES WITH KAFKA CONNECT
AND SPARK STREAMING
Guozhang Wang
Confluent

About Me: Guozhang Wang
• Engineer @ Confluent.
• Apache Kafka Committer, PMC Member.
• Before: Engineer @ LinkedIn, Kafka and Samza.

What do you REALLY need
for Stream Processing?

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Real-time Data Integration:
getting data to all the right places

Option #1: One-off Tools
• Tools for each specific data systems
• Examples:
• jdbcRDD, Cassandra-Spark connector, etc..
• Sqoop, logstash to Kafka, etc..

Option #2: Kitchen Sink Tools
• Generic point-to-point data copy / ETL tools
• Examples:
• Enterprise application integration tools

Option #3: Streaming as Copying
• Use stream processing frameworks to copy data
• Examples:
• Spark Streaming: MyRDDWriter (forEachPartition)
• Storm, Samza, Flink, etc..

Example: LinkedIn with Kafka
Apache Kafka

Large-scale streaming data import/export for Kafka
Kafka Connect

Delivery Guarantees
• Offsets automatically committed and restored
• On restart: task checks offsets & rewinds
• At least once delivery – flush data, then commit
• Exactly once for connectors that support it (e.g. HDFS)

Format Converters
• Abstract serialization agnostic to connectors
• Convert between Kafka Connect Data API (Connectors)
and serialized bytes
• JSON and Avro currently supported

Connector Developer APIs
class Connector {
abstract void start(props);
abstract void stop();
abstract Class<? extends Task> taskClass();
abstract List<Map<…>> taskConfigs(maxTasks);
…
}
class Source/SinkTask {
abstract void start(props);
abstract void stop();
abstract List<SourceRecord> poll();
abstract void put(records);
abstract void commit();
…
}

Kafka Connect Today
• Confluent open source: HDFS, JDBC
• Connector Hub: connectors.confluent.io
• Examples: MySQL, MongoDB, Twitter, Solr, S3, MQTT,  
Counchbase, Vertica, Cassandra, Elastic Search,  
HBase, Kudu, Attunity, JustOne, Striim, Bloomberg ..
• Improved connector control (0.10.0)

More Related Content

What's hot (20)

PPTX

Confluent building a real-time streaming platform using kafka streams and k...Thomas Alex

PDF

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent

PPTX

Kafka Streams for Java enthusiastsSlim Baltagi

PDF

Introduction to Apache Kafka and why it matters - MadridPaolo Castagna

PDF

Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020confluent

PPTX

Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf

PPTX

An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.Data Con LA

PPTX

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent

PDF

Introduction to Spark Streamingdatamantra

PDF

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent

PPTX

Kafka Summit NYC 2017 Hanging Out with Your Past Self in VRconfluent

PPTX

Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella

PPTX

I Heart Log: Real-time Data and Apache KafkaJay Kreps

PPTX

Bullet: A Real Time Data Query EngineDataWorks Summit

PDF

PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas

PPTX

kafka for db as postgresPivotalOpenSourceHub

PPTX

Apache Kafka at LinkedInGuozhang Wang

PPTX

Espresso Database Replication with Kafka, Tom Quiggleconfluent

PPTX

The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...StreamNative

PPTX

Design Patterns for working with Fast DataMapR Technologies

Confluent building a real-time streaming platform using kafka streams and k...Thomas Alex

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent

Kafka Streams for Java enthusiastsSlim Baltagi

Introduction to Apache Kafka and why it matters - MadridPaolo Castagna

Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020confluent

Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf

An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.Data Con LA

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent

Introduction to Spark Streamingdatamantra

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent

Kafka Summit NYC 2017 Hanging Out with Your Past Self in VRconfluent

Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella

I Heart Log: Real-time Data and Apache KafkaJay Kreps

Bullet: A Real Time Data Query EngineDataWorks Summit

PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas

kafka for db as postgresPivotalOpenSourceHub

Apache Kafka at LinkedInGuozhang Wang

Espresso Database Replication with Kafka, Tom Quiggleconfluent

The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...StreamNative

Design Patterns for working with Fast DataMapR Technologies

Similar to Building Realtim Data Pipelines with Kafka Connect and Spark Streaming (20)

PDF

Jug - ecosystemFlorent Ramiere

PDF

Chti jug - 2018-06-26Florent Ramiere

PDF

Apache Spark StreamingBartosz Jankiewicz

PDF

BBL KAPPA Lesfurets.comCedric Vidal

PDF

20170126 big data processingVienna Data Science Group

PDF

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz

PDF

Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz

PDF

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson

PPTX

Apache Spark on HDinsight TrainingSynergetics Learning and Cloud Consulting

PDF

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz

PDF

Streaming with Spring Cloud Stream and Apache Kafka - Soby ChackoVMware Tanzu

PPTX

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly

PPTX

ApacheCon North America 2018: Creating Spark Data SourcesJayesh Thakrar

PDF

Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz

PDF

Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan

PDF

Composable Parallel Processing in Apache Spark and WeldDatabricks

PDF

A Unified Platform for Real-time Storage and ProcessingStreamNative

PPTX

Kafka connect 101Whiteklay

PDF

Introducing Kafka's Streams APIconfluent

PPTX

Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...Lviv Startup Club

Jug - ecosystemFlorent Ramiere

Chti jug - 2018-06-26Florent Ramiere

Apache Spark StreamingBartosz Jankiewicz

BBL KAPPA Lesfurets.comCedric Vidal

20170126 big data processingVienna Data Science Group

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz

Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson

Apache Spark on HDinsight TrainingSynergetics Learning and Cloud Consulting

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz

Streaming with Spring Cloud Stream and Apache Kafka - Soby ChackoVMware Tanzu

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly

ApacheCon North America 2018: Creating Spark Data SourcesJayesh Thakrar

Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz

Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan

Composable Parallel Processing in Apache Spark and WeldDatabricks

A Unified Platform for Real-time Storage and ProcessingStreamNative

Kafka connect 101Whiteklay

Introducing Kafka's Streams APIconfluent

Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...Lviv Startup Club

More from Guozhang Wang (11)

PDF

Consensus in Apache Kafka: From Theory to Production.pdfGuozhang Wang

PDF

Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Guozhang Wang

PDF

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang

PDF

Introduction to the Incremental Cooperative Protocol of KafkaGuozhang Wang

PDF

Performance Analysis and Optimizations for Kafka Streams ApplicationsGuozhang Wang

PDF

Apache Kafka from 0.7 to 1.0, History and Lesson LearnedGuozhang Wang

PPTX

Exactly-once Stream Processing with Kafka StreamsGuozhang Wang

PDF

Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang

PDF

Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang

PPTX

Building a Replicated Logging System with Apache KafkaGuozhang Wang

PPTX

Behavioral Simulations in MapReduceGuozhang Wang