Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019

2
1. Why Beam at Lyft
2. Beam cross-language support
3. Python Streaming on Flink
4. What’s next
Apache Beam Apache Flink

66
Stream / Schema
Registry
Deployment
Tooling
Metrics &
Dashboards
Alerts Logging
Amazon
EC2
Amazon S3 Wavefront
Salt
(Config / Orca)
Docker

7
● Many big data ecosystem projects are Java / JVM based
○ Barrier to entry for teams that want to adopt streaming.. but
don’t have the Java skills
● Support use cases for different language environments
○ Python primary option for Machine Learning
● Cost of many API styles and runtime environments
● (Currently no good option for native Python + Streaming)

8
Unified model (Batch + strEAM)
What / Where / When / How
2. SDKs (Java, Python, Go, ...) & DSLs (SQL, Scala, …)
3. Runners for Existing Distributed Processing
Backends (Google Dataflow, Spark, Flink, …)
4. IOs: Data store Sources / Sinks
Apache Beam is a unified programming model designed to
provide efficient and portable data processing pipelines

9
1. End users: who want to write pipelines in a
language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
Includes IOs: connectors to data stores.
3. Runner writers: who have a distributed
processing environment and want to
support Beam pipelines
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
https://s.apache.org/apache-beam-project-overview

10
● Started with Java SDK and Java Runners
● 2016: Initiate cross-language support effort
● 2017: Python SDK on Dataflow
● 2018: Go SDK (for portable runners)
● 2018: Python on Flink MVP
● Next: Cross-language pipelines, Samza and other (?) runners

11
p = beam.Pipeline(runner=runner, options=pipeline_options)
(p
| ReadFromText("/path/to/text*") | Map(lambda line: ...)
| WindowInto(FixedWindows(120)
trigger=AfterWatermark(
early=AfterProcessingTime(60),
late=AfterCount(1))
accumulation_mode=ACCUMULATING)
| CombinePerKey(sum))
| WriteToText("/path/to/outputs")
)
result = p.run()
( What, Where, When, How )

13
⋮
input | Sum.PerKey()
Python
input.apply(
Sum.integersPerKey())
Java
SELECT key, SUM(value)
FROM input GROUP BY key
SQL (via Java)
⋮
Cloud Dataflow
Apache Spark
Apache Flink
Apache Apex
Gearpump
Apache Samza
Apache Nemo
(incubating)
IBM Streams
Sum Per Key
Java objects
Sum Per Key
Dataflow JSON API
https://s.apache.org/state-of-beam-sfo-2018

14
⋮
input | Sum.PerKey()
Python
stats.Sum(s, input)
Go
SELECT key, SUM(value)
FROM input GROUP BY key
SQL (via Java)
⋮
input.apply(
Sum.integersPerKey())
Java Apache Spark
Apache Flink
Apache Apex
Gearpump
Cloud Dataflow
Apache Samza
Apache Nemo
(incubating)
IBM Streams
Sum Per Key
Java objects
Sum Per Key
Portable protos
https://s.apache.org/state-of-beam-sfo-2018

15
Job Service
Artifact
Staging
Job Manager
Fn Services
Provision Control Data
Artifact
Retrieval
State Logging
ClusterRunner
Dependencies
(optional)
python -m
apache_beam.examples.wordcount
--input=/etc/profile
--output=/tmp/py-wordcount-direct
--runner=PortableRunner
--job_endpoint=localhost:8099
--streaming
SDK Worker
(UDFs)
SDK Worker
(UDFs)
SDK Worker
(Python)

16
gRPC interfaces for communication between SDK
harness and Runner
https://s.apache.org/beam-fn-api
● Control: Used to tell the SDK which UDFs to execute and when to execute
them.
● Data: Used to move data between the language specific SDK harness and
the runner.
● State: Used to support user state, side inputs, and group by key
reiteration.
● Logging: Used to aggregate logging information from the language
specific SDK harness.

17
Bundle size
matters!
● Amortize
overhead over
many elements
● Watermark
hold effect on
latency
https://s.apache.org/beam-fn-api-processing-a-bundle

18https://s.apache.org/beam-fn-api-send-and-receive-data

20
● Provide Job Service endpoint (Job Management API)
● Translate portable pipeline representation to native (Flink) API
● Provide gRPC endpoints for control/data/logging/state plane
● Manage SDK worker processes that execute user code
● Manage bundle execution (with arbitrary user code) via Fn API
● Manage state for side inputs, user state/timers
Reference runner provides common implementation baseline for JVM
based runners (/runners/java-fn-execution) and we have a portable
Validate Runner integration test suite in Python!

21
● Job Server packaging (fat jar)
● Pipeline translators for batch (DataSet) and
streaming (DataStream)
○ Translation/operators for primitive URNs: Impulse,
Flatten, GBK, Assign Windows, Executable Stage,
Reshuffle
● Side input handlers based on Flink State
● User State and Timer integration
● Flink Job Launch (same as old, non-portable runner)

22
● Translator extension for streaming sources
○ Kinesis, Kafka consumers that we also use in Java Flink jobs
○ Message decoding, watermarking
● Python execution environment for SDK workers
○ Tailored to internal deployment tooling
○ Docker-free, frozen virtual envs
https://github.com/lyft/beam/tree/release-2.10.0-lyft

23
Fn API
● Fn API Overhead 15% ?
● Fused stages
● Bundle size
● Parallelize SDK workers
● TODO: Cython, protobuf
C++ bindings
decode, …, window count
(messages
| 'reshuffle' >> beam.Reshuffle()
| 'decode' >> beam.Map(lambda x: (__import__('random').randint(0, 511), 1))
| 'noop1' >> beam.Map(lambda x : x)
| 'window' >> beam.WindowInto(window.GlobalWindows(),
trigger=Repeatedly(AfterProcessingTime(5 * 1000)),
accumulation_mode= AccumulationMode.DISCARDING)
| 'group' >> beam.GroupByKey()
| 'count' >> beam.Map(count)
)

24
● c5.4xlarge machines (16 vCPU, 32 GB)
● 16 SDK workers / machine
● 1000 ms or 1000 records / bundle
● ~ 17,500 transforms / second / worker
● Python user code will be gating factor

26
● Pipelines written in non-JVM languages on JVM runners
○ Python, Go
● Full isolation of user code
○ Native CPython execution w/o library restrictions
● Configurable SDK worker execution
○ Docker, Process, Embedded, ...
● Multiple languages in a single pipeline (future)
○ Use Java Beam IO with Python
○ Use TFX with Java
○ ...

27
Feature Support Matrix (Beam 2.10.0)
https://s.apache.org/apache-beam-portability-support-table

28
Roadmap
https://beam.apache.org/roadmap/portability/

29
● Streaming Connectors for Python SDK
○ Mixing and matching connectors written in different languages ?
○ Splittable DoFn (SDF)
● Python 3
● User Documentation
● More portable runners

30
Beam Portability Framework
https://beam.apache.org/roadmap/portability/
https://beam.apache.org/contribute/design-documents/#portability
Apache Beam
https://beam.apache.org
https://s.apache.org/slack-invite #beam #beam-portability
https://beam.apache.org/community/contact-us/
Follow @ApacheBeam on Twitter

Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019

More Related Content

What's hot (19)

Similar to Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019 (20)

Recently uploaded (20)

Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019