Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

The Data Driven Network
Kapil Surlaker
Director of Engineering
Bridging Batch and Streaming Data
Integration with Gobblin
Shirshanka Das
Gobblin team
26th Apr, 2017
Big Data Meetup
github.com/linkedin/gobblin
@ApacheGobblin
gitter.im/gobblin

Data Integration: key requirements
Source, Sink
Diversity
Batch
+
Streaming
Data
Quality
So, we built

SFTP
JDBC
REST
Simplifying Data Integration
@LinkedIn
Hundreds of TB per day
Thousands of datasets
~30 different source systems
80%+ of data ingest
Open source @ github.com/linkedin/gobblin
Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal,
CERN, NerdWallet and many more…
Apache incubation under way
SFTP
Azure
StorageAzure
Storage

4
Other Open Source Systems in this Space
Sqoop, Flume, Falcon, Nifi, Kafka Connect
Flink, Spark, Samza, Apex
Similar in pieces, dissimilar in aggregate
Most are tied to a specific execution model (batch / stream)
Most are tied to a specific implementation, ecosystem
(Kafka, Hadoop etc)

6
Gobblin: The Logical Pipeline

7
WorkUnit
A logical unit of work, typically bounded but not necessary.
Kafka Topic: LoginEvent, Partition: 10, Offsets: 10-200
HDFS Folder: /data/Login, File: part-0.avro
Hive Dataset: Tracking.Login, date-partition=mm-dd-yy-hh

8
Source: A provider of WorkUnits
(typically a system like Kafka, HDFS etc.)

9
Task: A unit of execution that operates on a WorkUnit
Extracts records from the source, writes to the destination
Ends when WorkUnit is exhausted of records
(assigned to Thread in ThreadPool, Mapper in Map-Reduce etc.)

10
Extractor: A provider of records given a WorkUnit
Connects to Data Source
Deserializer of records

11
Converter: A 1:N mapper of input records to output records
Multiple converters can be chained
(e.g. Avro <-> JSON, Schema project, Encrypt)

12
Quality Checker: Can check if the quality of the output is
satisfactory
Row-level (e.g. time value check)
Task-level (e.g. audit check, schema compatibility)

13
Writer: Writes to the destination
Connection to the destination, Serializer of records
Sync / Async
e.g. FsWriter, KafkaWriter, CouchbaseWriter

14
Publisher: Finalizes / Commits the data
Used for destinations that support atomicity
(e.g. move tmp staging directory to final
output directory on HDFS)

15

16
State Store (HDFS, S3, MySQL, ZK, …)
Load config
previous watermarks
save watermarks
Stateful
^

Gobblin: Pipeline Specification
job.name=PullFromWikipedia
job.group=Wikipedia
job.description=A getting started example for Gobblin
source.class=gobblin.example.wikipedia.WikipediaSource
source.page.titles=LinkedIn,Wikipedia:Sandbox
source.revisions.cnt=5
wikipedia.api.rooturl=https://en.wikipedia.org/w/api.php
wikipedia.avro.schema={"namespace": “example.wikipedia.avro”
,…"null"]}]}
gobblin.wikipediaSource.maxRevisionsPerPage=10
converter.classes=gobblin.example.wikipedia.WikipediaConverter
Pipeline Name, Description
Source
+ configuration

source.revisions.cnt=5
wikipedia.api.rooturl=https://en.wikipedia.org/w/api.php
wikipedia.avro.schema={"namespace": “example.wikipedia.avro”
,…"null"]}]}
gobblin.wikipediaSource.maxRevisionsPerPage=10
extract.namespace=gobblin.example.wikipedia
writer.destination.type=HDFS
writer.output.format=AVRO
writer.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner
data.publisher.type=gobblin.publisher.BaseDataPublisher
Converter
Writer
+ configuration

extract.namespace=gobblin.example.wikipedia
writer.destination.type=HDFS
writer.output.format=AVRO
writer.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner
data.publisher.type=gobblin.publisher.BaseDataPublisher
Publisher

Gobblin: Pipeline Deployment
Bare Metal / AWS / Azure / VM
Standalone:
Single Instance
Small Medium Large
AWS (EC2)
Hadoop (YARN / MR)
Standalone Cluster
Pipeline Specification
Static Cluster Elastic ClusterOne Box
One Spec
Multiple Environments

Execution Model: Batch versus Streaming
Batch
Determine work, Acquire slots, Run, Checkpoint, Repeat
+ Cost-efficient, deterministic, repeatable
- Higher latency
- Setup, Checkpoint costs dominate if “micro-batching”

Execution Model: Batch versus Streaming
Streaming
Determine work streams, Run continuously, Checkpoint periodically
+ Low latency
- Higher-cost because it is harder to provision
accurately
- More sophistication needed to deal with change

Batch
Execution Model Scorecard
Batch
Streaming
Streaming
Streaming
Streaming
Batch
Batch
JDBC <->HDFS Kafka ->HDFS
HDFS ->Kafka Kafka <->Kinesis

Can we run in both models
using the same system?

26

27
Batch
Determine work
Streaming
Determine work
- unbounded WorkUnit
Pipeline Stages: Start

28
Batch
Acquire slots, Run
Streaming
Run continuously
Checkpoint periodically
Shutdown gracefully
Pipeline Stages: Run
Watermark Manager
State Storage
notify ack
shutdown

29
Batch
Checkpoint, Commit
Streaming
Do nothing
- NoOpPublisher
Pipeline Stages: End

Enabling Streaming mode
task.executionMode = streaming
Standalone:
Single Instance
AWS
Hadoop (YARN / MR)
Standalone Cluster

A Streaming Pipeline Spec: Kafka 2 Kafka
# A sample pull file that copies an input Kafka topic and
# produces to an output Kafka topic with sampling
job.name=Kafka2KafkaStreaming
job.group=Kafka
job.description=This is a job that runs forever, copies an input Kafka
topic to an output Kafka topic
job.lock.enabled=false
source.class=gobblin.source….KafkaSimpleStreamingSource
Pipeline Name, Description

job.description=This is a job that runs forever, copies an input Kafka
topic to an output Kafka topic
job.lock.enabled=false
source.class=gobblin.source….KafkaSimpleStreamingSource
gobblin.streaming.kafka.topic.key.deserializer=org.apache.kafka.com
mon.serialization.StringDeserializer
gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.co
mmon.serialization.ByteArrayDeserializer
gobblin.streaming.kafka.topic.singleton=test
kafka.brokers=localhost:9092
# Sample 10% of the records
Source, configuration

mmon.serialization.ByteArrayDeserializer
gobblin.streaming.kafka.topic.singleton=test
kafka.brokers=localhost:9092
converter.classes=gobblin.converter.SamplingConverter
converter.sample.ratio=0.10
writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder
writer.kafka.topic=test_copied
writer.kafka.producerConfig.bootstrap.servers=localhost:9092
writer.kafka.producerConfig.value.serializer=org.apache.kafka.comm
on.serialization.ByteArraySerializer
Converter, configuration

converter.classes=gobblin.converter.SamplingConverter
converter.sample.ratio=0.10
writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder
writer.kafka.topic=test_copied
writer.kafka.producerConfig.bootstrap.servers=localhost:9092
writer.kafka.producerConfig.value.serializer=org.apache.kafka.comm
on.serialization.ByteArraySerializer
data.publisher.type=gobblin.publisher.NoopPublisher
task.executionMode=STREAMING
Writer, configuration
Publisher

data.publisher.type=gobblin.publisher.NoopPublisher
task.executionMode=STREAMING
# Configure watermark storage for streaming
#streaming.watermarkStateStore.type=zk
#streaming.watermarkStateStore.config.state.store.zk.connectString=
localhost:2181
# Configure watermark commit settings for streaming
#streaming.watermark.commitIntervalMillis=2000
Execution Mode,
watermark storage configuration

Gobblin Streaming: Cluster view
Cluster of processes
Apache Helix:
work-unit assignment,
fault-tolerance,
reassignment Cluster
Master
Helix
Worker 1
Worker 2
Worker 3
Sink
(Kafka,
HDFS,
…)
Stream Source

Active Workstreams in Gobblin
Gobblin as a Service
Global orchestrator with REST API for submitting logical flow specifications
Logical flow specifications compile down to physical pipeline specs
Global Throttling
Throttling capability to ensure Gobblin respects quotas globally (e.g. api calls, network b/w,
Hadoop namenode etc.)
Generic: can be used outside Gobblin
Metadata driven
Integration with Metadata Service (c.f. WhereHows)
Policy driven replication, permissions, encryption etc.

Roadmap
Final LinkedIn Gobblin 0.10.0 release
Apache Incubator code donation and release
More Streaming runtimes
Integration with Apache Samza, LinkedIn Brooklin
GDPR Compliance: Data purge for Hadoop and other systems
Security improvements
Credential storage, Secure specs

Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017 (20)

Recently uploaded (20)

Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017