Dok Talks #119 - Cloud-Native Data Pipelines

Cloud Native

Data Pipelines
MARCH 3, 2022, DATA ON KUBERNETES

‣ Why Cloud Native Data Pipelines
 
Bringing cloud-native to ETL

‣ Excursion: Kafka and Kafka Connect
 
How Kafka Connect works and
considerations for deploying on K8

‣ Quarkus and Jib
 
Java snippets of a pipeline, build with Jib
container builder, and deploy via API
2
Contents
Hakan Lofcali

CTO

DataCater GmbH

▸ Software Industry has evolved from Dev & Ops -> DevOps -> GitOps

▸ ETL space has not caught up

▸ Runtimes of ETL tooling diverge significantly from dev -> test -> prod

▸ Scalability has to be taken care of next to business logic

▸ Divergence of infrastructure description and computations to be executed
3
PROBLEM DESCRIPTION
ETL needs evolving, too

4
CLOUD NATIVE PRINCIPLES
Auto Scale Image Immutability Declarative Description

▸ Start with streaming & event-sourcing for continuity, predictable resource
consumption, and ease of horizontally scaling workers

▸ Reduce state in pipeline pod by externalising state to Apache Kafka and event-
sourcing from various systems with Kafka Connect

▸ Declare computations [filters, transformations] and build an image containing all
needed computations
5
CLOUD-NATIVE DATA PIPELINES
Apply Cloud-Native Principles to ETL

6
TARGET ARCHITECTURE
‣ Multiple frontends for defining Data
Pipelines
 
YAML, API, and UI need to produce the
same pipeline

‣ DataCater allows no-code and Python
transformations
 
Filters and Transformations are
packaged into containers

‣ Pipeline -> Kafka Streams app
 
All the goodness of cloud-native in
Java

▸ Kafka is the de-facto industry standard for messaging. Kafka’s API has been adopted by
many other technologies in the realm i.e. Google PubSub, Redpanda

▸ Kafka brokers distribute messages to consumers and expect acknowledgements of
retrieval.

▸ Messages are stored in topics as append only logs, these are partitioned.

▸ Kafka Connect can be thought of as a translation layer between Kafka and other systems.

▸ Framework for creating messages from / for events of external systems such as
databases, cloud events, data warehouses etc.
9
KAFKA AND KAFKA CONNECT
Short intro to Kafka and Kafka Connect

10
STRIMZI DEPLOYS KAFKA
* Violates Self-containment principle
 
** Hopefully obsolete soon, cluster coordination within Kubernetes :D

▸ Source / Sink Connectors are deployed into the same cluster

▸ There is no resource descriptor for a single process

▸ Kafka Connect Pods will probably run more than a single connector

▸ Combined with point two, this can get incredibly painful as a rogue connector will
impact all connectors in that pod
12
KAFKA CONNECT - THE UGLY BITS
Kafka Connect and Self Containment

13
KAFKA CONNECT - SOLUTION
Kafka Connect: Self Containment and Scaling
* Connectors also connect to Kafka Cluster; lines not introduced for visuals

14
Let’s go down the deep end

16
CREATING A PIPELINE - JAVA APPLICATION

17
CREATING A PIPELINE - JAVA APPLICATION

18
ACKNOWLEDGEMENTS
public Multi<String> basicPipe(double number)
▸ In messaging / streaming we need to acknowledge as consumers

▸ Quarkus / Smallrye's Multi class handles this automatically for us

▸ If we do not want anything to be written, we return an empty Multi and still acknowledge
the incoming message as received

▸ In Kafka Terms, these are offsets and by acknowledging / committing an offset, we avoid
duplication of data in sinks [in smallrye: use throttled strategy for exactly-once delivery]

▸ Quarkus utilizes for e.g. small rye for streaming, comes automatically with metrics
endpoint for each pipeline.

▸ In dev mode Quarkus install Vectorized/Redpanda (shout out to redpanda.com), so
need to have a Kafka cluster running.

▸ Dev tools are impeccable, from method profiling to test coverage, all in one interface.
19
PERKS OF QUARKUS
Quarkus and libraries pack a bunch

20
CREATING A PIPELINE - JIB CONTAINER

▸ Utilise caching, and initial build might make sense to have the base image ready on
new start up and running application on a new node.

▸ Jib container builder implements default credential retriever for registry credentials. It
takes dockerconfig files, basic auth, and OAuth2.

▸ Kubernetes secrets not included as credential retriever only via mounts -> key
rotation could be problematic here

▸ Detailed log messages are really helpful in debugging
21
JIB CONTAINER BUILDER LEARNINGS
Jib considerations and perks

▸ Making ETL more cloud-native has still open issues

▸ Stronger self-containment needed

▸ New and evolving tools [most < v1.0.0], rough edges encountered

▸ No specification of declarative data pipeline description, we at DataCater try to make first steps here

▸ We can already reap the benefits of it, thanks to

▸ Strong messaging technology such as Apache Kafka

▸ Great dev ecosystem around Java Quarkus

▸ Strimzi providing means to operate Apacha Kafka & co. easily
22
SUMMARY
Way to go, but ecosystem is getting better

23
THANK YOU DOK
Big Thanks for tuning in and

big thanks to the teams behind …

Dok Talks #119 - Cloud-Native Data Pipelines

More Related Content

What's hot (19)

Similar to Dok Talks #119 - Cloud-Native Data Pipelines (20)

More from DoKC (20)

Recently uploaded (20)

Dok Talks #119 - Cloud-Native Data Pipelines