SlideShare a Scribd company logo
The Data Driven Network
Kapil Surlaker
Director of Engineering
Bridging Batch and Streaming Data
Integration with Gobblin
Shirshanka Das
Gobblin team
26th Apr, 2017
Big Data Meetup
github.com/linkedin/gobblin
@ApacheGobblin
gitter.im/gobblin
Data Integration: key requirements
Source, Sink
Diversity
Batch
+
Streaming
Data
Quality
So, we built
SFTP
JDBC
REST
Simplifying Data Integration
@LinkedIn
Hundreds of TB per day
Thousands of datasets
~30 different source systems
80%+ of data ingest
Open source @ github.com/linkedin/gobblin
Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal,
CERN, NerdWallet and many more…
Apache incubation under way
SFTP
Azure
StorageAzure
Storage
4
Other Open Source Systems in this Space
Sqoop, Flume, Falcon, Nifi, Kafka Connect
Flink, Spark, Samza, Apex
Similar in pieces, dissimilar in aggregate
Most are tied to a specific execution model (batch / stream)
Most are tied to a specific implementation, ecosystem
(Kafka, Hadoop etc)
: Under the Hood
5
6
Gobblin: The Logical Pipeline
7
WorkUnit
A logical unit of work, typically bounded but not necessary.
Kafka Topic: LoginEvent, Partition: 10, Offsets: 10-200
HDFS Folder: /data/Login, File: part-0.avro
Hive Dataset: Tracking.Login, date-partition=mm-dd-yy-hh
8
Source: A provider of WorkUnits
(typically a system like Kafka, HDFS etc.)
9
Task: A unit of execution that operates on a WorkUnit
Extracts records from the source, writes to the destination
Ends when WorkUnit is exhausted of records
(assigned to Thread in ThreadPool, Mapper in Map-Reduce etc.)
10
Extractor: A provider of records given a WorkUnit
Connects to Data Source
Deserializer of records
11
Converter: A 1:N mapper of input records to output records
Multiple converters can be chained
(e.g. Avro <-> JSON, Schema project, Encrypt)
12
Quality Checker: Can check if the quality of the output is
satisfactory
Row-level (e.g. time value check)
Task-level (e.g. audit check, schema compatibility)
13
Writer: Writes to the destination
Connection to the destination, Serializer of records
Sync / Async
e.g. FsWriter, KafkaWriter, CouchbaseWriter
14
Publisher: Finalizes / Commits the data
Used for destinations that support atomicity
(e.g. move tmp staging directory to final
output directory on HDFS)
15
Gobblin: The Logical Pipeline
16
State Store (HDFS, S3, MySQL, ZK, …)
Load config
previous watermarks
save watermarks
Gobblin: The Logical Pipeline
Stateful
^
: Pipeline Specification
17
Gobblin: Pipeline Specification
job.name=PullFromWikipedia	
job.group=Wikipedia	
job.description=A	getting	started	example	for	Gobblin	
source.class=gobblin.example.wikipedia.WikipediaSource	
source.page.titles=LinkedIn,Wikipedia:Sandbox	
source.revisions.cnt=5	
wikipedia.api.rooturl=https://en.wikipedia.org/w/api.php	
wikipedia.avro.schema={"namespace":	“example.wikipedia.avro”	
,…"null"]}]}	
gobblin.wikipediaSource.maxRevisionsPerPage=10	
converter.classes=gobblin.example.wikipedia.WikipediaConverter	
Pipeline Name, Description
Source
+ configuration
source.revisions.cnt=5	
wikipedia.api.rooturl=https://en.wikipedia.org/w/api.php	
wikipedia.avro.schema={"namespace":	“example.wikipedia.avro”	
,…"null"]}]}	
gobblin.wikipediaSource.maxRevisionsPerPage=10	
converter.classes=gobblin.example.wikipedia.WikipediaConverter	
extract.namespace=gobblin.example.wikipedia	
writer.destination.type=HDFS	
writer.output.format=AVRO	
writer.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner	
data.publisher.type=gobblin.publisher.BaseDataPublisher
Gobblin: Pipeline Specification
Converter
Writer
+ configuration
converter.classes=gobblin.example.wikipedia.WikipediaConverter	
extract.namespace=gobblin.example.wikipedia	
writer.destination.type=HDFS	
writer.output.format=AVRO	
writer.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner	
data.publisher.type=gobblin.publisher.BaseDataPublisher
Gobblin: Pipeline Specification
Publisher
Gobblin: Pipeline Deployment
Bare Metal / AWS / Azure / VM
Standalone:
Single Instance
Small Medium Large
AWS (EC2)
Hadoop (YARN / MR)
Standalone Cluster
Pipeline Specification
Static Cluster Elastic ClusterOne Box
One Spec
Multiple Environments
Execution Model: Batch versus Streaming
Batch
Determine work, Acquire slots, Run, Checkpoint, Repeat
+ Cost-efficient, deterministic, repeatable
- Higher latency
- Setup, Checkpoint costs dominate if “micro-batching”
Execution Model: Batch versus Streaming
Streaming
Determine work streams, Run continuously, Checkpoint periodically
+ Low latency
- Higher-cost because it is harder to provision
accurately
- More sophistication needed to deal with change
Batch
Execution Model Scorecard
Batch
Streaming
Streaming
Streaming
Streaming
Batch
Batch
JDBC <->HDFS Kafka ->HDFS
HDFS ->Kafka Kafka <->Kinesis
Can we run in both models
using the same system?
26
Gobblin: The Logical Pipeline
27
Batch
Determine work
Streaming
Determine work
- unbounded WorkUnit
Pipeline Stages: Start
28
Batch
Acquire slots, Run
Streaming
Run continuously
Checkpoint periodically
Shutdown gracefully
Pipeline Stages: Run
Watermark Manager
State Storage
notify ack
shutdown
29
Batch
Checkpoint, Commit
Streaming
Do nothing
- NoOpPublisher
Pipeline Stages: End
Enabling Streaming mode
task.executionMode = streaming
Standalone:
Single Instance
AWS
Hadoop (YARN / MR)
Standalone Cluster
A Streaming Pipeline Spec: Kafka 2 Kafka
# A sample pull file that copies an input Kafka topic and
# produces to an output Kafka topic with sampling
job.name=Kafka2KafkaStreaming
job.group=Kafka
job.description=This is a job that runs forever, copies an input Kafka
topic to an output Kafka topic
job.lock.enabled=false
source.class=gobblin.source….KafkaSimpleStreamingSource
Pipeline Name, Description
job.description=This is a job that runs forever, copies an input Kafka
topic to an output Kafka topic
job.lock.enabled=false
source.class=gobblin.source….KafkaSimpleStreamingSource
gobblin.streaming.kafka.topic.key.deserializer=org.apache.kafka.com
mon.serialization.StringDeserializer
gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.co
mmon.serialization.ByteArrayDeserializer
gobblin.streaming.kafka.topic.singleton=test
kafka.brokers=localhost:9092
# Sample 10% of the records
Source, configuration
A Streaming Pipeline Spec: Kafka 2 Kafka
mmon.serialization.ByteArrayDeserializer
gobblin.streaming.kafka.topic.singleton=test
kafka.brokers=localhost:9092
# Sample 10% of the records
converter.classes=gobblin.converter.SamplingConverter
converter.sample.ratio=0.10
writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder
writer.kafka.topic=test_copied
writer.kafka.producerConfig.bootstrap.servers=localhost:9092
writer.kafka.producerConfig.value.serializer=org.apache.kafka.comm
on.serialization.ByteArraySerializer
A Streaming Pipeline Spec: Kafka 2 Kafka
Converter, configuration
# Sample 10% of the records
converter.classes=gobblin.converter.SamplingConverter
converter.sample.ratio=0.10
writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder
writer.kafka.topic=test_copied
writer.kafka.producerConfig.bootstrap.servers=localhost:9092
writer.kafka.producerConfig.value.serializer=org.apache.kafka.comm
on.serialization.ByteArraySerializer
data.publisher.type=gobblin.publisher.NoopPublisher
task.executionMode=STREAMING
A Streaming Pipeline Spec: Kafka 2 Kafka
Writer, configuration
Publisher
data.publisher.type=gobblin.publisher.NoopPublisher
task.executionMode=STREAMING
# Configure watermark storage for streaming
#streaming.watermarkStateStore.type=zk
#streaming.watermarkStateStore.config.state.store.zk.connectString=
localhost:2181
# Configure watermark commit settings for streaming
#streaming.watermark.commitIntervalMillis=2000
A Streaming Pipeline Spec: Kafka 2 Kafka
Execution Mode,
watermark storage configuration
Gobblin Streaming: Cluster view
Cluster of processes
Apache Helix:
work-unit assignment,
fault-tolerance,
reassignment Cluster
Master
Helix
Worker 1
Worker 2
Worker 3
Sink
(Kafka,
HDFS,
…)
Stream Source
Active Workstreams in Gobblin
Gobblin as a Service
Global orchestrator with REST API for submitting logical flow specifications
Logical flow specifications compile down to physical pipeline specs
Global Throttling
Throttling capability to ensure Gobblin respects quotas globally (e.g. api calls, network b/w,
Hadoop namenode etc.)
Generic: can be used outside Gobblin
Metadata driven
Integration with Metadata Service (c.f. WhereHows)
Policy driven replication, permissions, encryption etc.
Roadmap
Final LinkedIn Gobblin 0.10.0 release
Apache Incubator code donation and release
More Streaming runtimes
Integration with Apache Samza, LinkedIn Brooklin
GDPR Compliance: Data purge for Hadoop and other systems
Security improvements
Credential storage, Secure specs
39
Gobblin Team @ LinkedIn

More Related Content

What's hot (20)

PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PDF
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
PDF
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Po-Chuan Chen
 
PDF
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
PDF
Intro to LLMs
Loic Merckel
 
PDF
Integrating Apache Kafka Into Your Environment
confluent
 
PDF
AWS_Meetup_BLR_July_22_Social.pdf
Ayyanar Jeyakrishnan
 
PPTX
Introduction to Apache Kafka
AIMDek Technologies
 
PDF
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
PPTX
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
PPTX
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA
 
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
PDF
And then there were ... Large Language Models
Leon Dohmen
 
PPTX
Transformer Zoo
Grigory Sapunov
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PDF
Kafka 101 and Developer Best Practices
confluent
 
PDF
Vector Databases 101 - An introduction to the world of Vector Databases
Zilliz
 
PDF
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
Edureka!
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PDF
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Mihai Criveti
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Po-Chuan Chen
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Intro to LLMs
Loic Merckel
 
Integrating Apache Kafka Into Your Environment
confluent
 
AWS_Meetup_BLR_July_22_Social.pdf
Ayyanar Jeyakrishnan
 
Introduction to Apache Kafka
AIMDek Technologies
 
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
And then there were ... Large Language Models
Leon Dohmen
 
Transformer Zoo
Grigory Sapunov
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Kafka 101 and Developer Best Practices
confluent
 
Vector Databases 101 - An introduction to the world of Vector Databases
Zilliz
 
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
Edureka!
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Mihai Criveti
 

Viewers also liked (20)

PDF
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Shirshanka Das
 
PDF
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Shirshanka Das
 
PDF
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
PDF
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Shirshanka Das
 
PDF
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Shirshanka Das
 
PPTX
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Shirshanka Das
 
PDF
Aksyon radyo
OnlinRadioTune
 
PDF
Брокер сообщений Kafka в условиях повышенной нагрузки / Артём Выборнов (Rambl...
Ontico
 
PPTX
LinkedIn Segmentation & Targeting Platform: A Big Data Application
Amy W. Tang
 
PDF
Data Infrastructure at LinkedIn
Amy W. Tang
 
PDF
Personal branding playbook
Online Business
 
PPTX
Data Infrastructure at LinkedIn
Amy W. Tang
 
PDF
Resume- William Myers FD2016.1.4
William Myers
 
PPTX
Using Big Data for Improved Healthcare Operations and Analytics
Perficient, Inc.
 
PDF
Unlocking the Experts
LinkedIn
 
PDF
Participatory Design: Bringing Users Into Your Process
David Sherwin
 
PDF
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Edureka!
 
PPTX
Big data ppt
Nasrin Hussain
 
PPTX
What to Upload to SlideShare
SlideShare
 
PPTX
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Carol Smith
 
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Shirshanka Das
 
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Shirshanka Das
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Shirshanka Das
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Shirshanka Das
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Shirshanka Das
 
Aksyon radyo
OnlinRadioTune
 
Брокер сообщений Kafka в условиях повышенной нагрузки / Артём Выборнов (Rambl...
Ontico
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
Amy W. Tang
 
Data Infrastructure at LinkedIn
Amy W. Tang
 
Personal branding playbook
Online Business
 
Data Infrastructure at LinkedIn
Amy W. Tang
 
Resume- William Myers FD2016.1.4
William Myers
 
Using Big Data for Improved Healthcare Operations and Analytics
Perficient, Inc.
 
Unlocking the Experts
LinkedIn
 
Participatory Design: Bringing Users Into Your Process
David Sherwin
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Edureka!
 
Big data ppt
Nasrin Hussain
 
What to Upload to SlideShare
SlideShare
 
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Carol Smith
 
Ad

Similar to Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017 (20)

PPTX
Gobblin: Unifying Data Ingestion for Hadoop
Yinan Li
 
PPTX
Gobblin What's New
Abhishek Tiwari
 
PPTX
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
PPTX
Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)
Issac Buenrostro
 
PDF
Gobblin @ NerdWallet (Nov 2015)
NerdWalletHQ
 
PDF
Apache Gobblin
Mike Frampton
 
PPTX
Building data pipelines
Jonathan Holloway
 
PPTX
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Jen Aman
 
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
PDF
Building Big Data Streaming Architectures
David Martínez Rego
 
PDF
Real-world Streaming Architectures
confluent
 
PDF
xGem Data Stream Processing
Jorge Hirtz
 
PDF
Distributed real time stream processing- why and how
Petr Zapletal
 
PDF
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
Gaurav "GP" Pal
 
PPTX
BruJUG - Introduction to data streaming
Nicolas Fränkel
 
PPTX
WaJUG - Introduction to data streaming
Nicolas Fränkel
 
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
PDF
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
Gobblin: Unifying Data Ingestion for Hadoop
Yinan Li
 
Gobblin What's New
Abhishek Tiwari
 
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)
Issac Buenrostro
 
Gobblin @ NerdWallet (Nov 2015)
NerdWalletHQ
 
Apache Gobblin
Mike Frampton
 
Building data pipelines
Jonathan Holloway
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Jen Aman
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
Building Big Data Streaming Architectures
David Martínez Rego
 
Real-world Streaming Architectures
confluent
 
xGem Data Stream Processing
Jorge Hirtz
 
Distributed real time stream processing- why and how
Petr Zapletal
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
Gaurav "GP" Pal
 
BruJUG - Introduction to data streaming
Nicolas Fränkel
 
WaJUG - Introduction to data streaming
Nicolas Fränkel
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
Ad

Recently uploaded (20)

PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
big data eco system fundamentals of data science
arivukarasi
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
BinarySearchTree in datastructures in detail
kichokuttu
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 

Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017