SlideShare a Scribd company logo
1
2
1. Why Beam at Lyft
2. Beam cross-language support
3. Python Streaming on Flink
4. What’s next
Apache Beam Apache Flink
3
Why Beam at Lyft
4
5
66
Stream / Schema
Registry
Deployment
Tooling
Metrics &
Dashboards
Alerts Logging
Amazon
EC2
Amazon S3 Wavefront
Salt
(Config / Orca)
Docker
7
● Many big data ecosystem projects are Java / JVM based
○ Barrier to entry for teams that want to adopt streaming.. but
don’t have the Java skills
● Support use cases for different language environments
○ Python primary option for Machine Learning
● Cost of many API styles and runtime environments
● (Currently no good option for native Python + Streaming)
8
Unified model (Batch + strEAM)
What / Where / When / How
2. SDKs (Java, Python, Go, ...) & DSLs (SQL, Scala, …)
3. Runners for Existing Distributed Processing
Backends (Google Dataflow, Spark, Flink, …)
4. IOs: Data store Sources / Sinks
Apache Beam is a unified programming model designed to
provide efficient and portable data processing pipelines
9
1. End users: who want to write pipelines in a
language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
Includes IOs: connectors to data stores.
3. Runner writers: who have a distributed
processing environment and want to
support Beam pipelines
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
https://s.apache.org/apache-beam-project-overview
10
● Started with Java SDK and Java Runners
● 2016: Initiate cross-language support effort
● 2017: Python SDK on Dataflow
● 2018: Go SDK (for portable runners)
● 2018: Python on Flink MVP
● Next: Cross-language pipelines, Samza and other (?) runners
11
p = beam.Pipeline(runner=runner, options=pipeline_options)
(p
| ReadFromText("/path/to/text*") | Map(lambda line: ...)
| WindowInto(FixedWindows(120)
trigger=AfterWatermark(
early=AfterProcessingTime(60),
late=AfterCount(1))
accumulation_mode=ACCUMULATING)
| CombinePerKey(sum))
| WriteToText("/path/to/outputs")
)
result = p.run()
( What, Where, When, How )
12
Beam Portability
13
⋮
input | Sum.PerKey()
Python
input.apply(
Sum.integersPerKey())
Java
SELECT key, SUM(value)
FROM input GROUP BY key
SQL (via Java)
⋮
Cloud Dataflow
Apache Spark
Apache Flink
Apache Apex
Gearpump
Apache Samza
Apache Nemo
(incubating)
IBM Streams
Sum Per Key
Java objects
Sum Per Key
Dataflow JSON API
https://s.apache.org/state-of-beam-sfo-2018
14
⋮
input | Sum.PerKey()
Python
stats.Sum(s, input)
Go
SELECT key, SUM(value)
FROM input GROUP BY key
SQL (via Java)
⋮
input.apply(
Sum.integersPerKey())
Java Apache Spark
Apache Flink
Apache Apex
Gearpump
Cloud Dataflow
Apache Samza
Apache Nemo
(incubating)
IBM Streams
Sum Per Key
Java objects
Sum Per Key
Portable protos
https://s.apache.org/state-of-beam-sfo-2018
15
Job Service
Artifact
Staging
Job Manager
Fn Services
Provision Control Data
Artifact
Retrieval
State Logging
ClusterRunner
Dependencies
(optional)
python -m
apache_beam.examples.wordcount 
--input=/etc/profile 
--output=/tmp/py-wordcount-direct 
--runner=PortableRunner 
--job_endpoint=localhost:8099 
--streaming
SDK Worker
(UDFs)
SDK Worker
(UDFs)
SDK Worker
(Python)
16
gRPC interfaces for communication between SDK
harness and Runner
https://s.apache.org/beam-fn-api
● Control: Used to tell the SDK which UDFs to execute and when to execute
them.
● Data: Used to move data between the language specific SDK harness and
the runner.
● State: Used to support user state, side inputs, and group by key
reiteration.
● Logging: Used to aggregate logging information from the language
specific SDK harness.
17
Bundle size
matters!
● Amortize
overhead over
many elements
● Watermark
hold effect on
latency
https://s.apache.org/beam-fn-api-processing-a-bundle
18https://s.apache.org/beam-fn-api-send-and-receive-data
19
Beam Flink Runner
20
● Provide Job Service endpoint (Job Management API)
● Translate portable pipeline representation to native (Flink) API
● Provide gRPC endpoints for control/data/logging/state plane
● Manage SDK worker processes that execute user code
● Manage bundle execution (with arbitrary user code) via Fn API
● Manage state for side inputs, user state/timers
Reference runner provides common implementation baseline for JVM
based runners (/runners/java-fn-execution) and we have a portable
Validate Runner integration test suite in Python!
21
● Job Server packaging (fat jar)
● Pipeline translators for batch (DataSet) and
streaming (DataStream)
○ Translation/operators for primitive URNs: Impulse,
Flatten, GBK, Assign Windows, Executable Stage,
Reshuffle
● Side input handlers based on Flink State
● User State and Timer integration
● Flink Job Launch (same as old, non-portable runner)
22
● Translator extension for streaming sources
○ Kinesis, Kafka consumers that we also use in Java Flink jobs
○ Message decoding, watermarking
● Python execution environment for SDK workers
○ Tailored to internal deployment tooling
○ Docker-free, frozen virtual envs
https://github.com/lyft/beam/tree/release-2.10.0-lyft
23
Fn API
● Fn API Overhead 15% ?
● Fused stages
● Bundle size
● Parallelize SDK workers
● TODO: Cython, protobuf
C++ bindings
decode, …, window count
(messages
| 'reshuffle' >> beam.Reshuffle()
| 'decode' >> beam.Map(lambda x: (__import__('random').randint(0, 511), 1))
| 'noop1' >> beam.Map(lambda x : x)
| 'noop2' >> beam.Map(lambda x : x)
| 'noop3' >> beam.Map(lambda x : x)
| 'window' >> beam.WindowInto(window.GlobalWindows(),
trigger=Repeatedly(AfterProcessingTime(5 * 1000)),
accumulation_mode= AccumulationMode.DISCARDING)
| 'group' >> beam.GroupByKey()
| 'count' >> beam.Map(count)
)
24
● c5.4xlarge machines (16 vCPU, 32 GB)
● 16 SDK workers / machine
● 1000 ms or 1000 records / bundle
● ~ 17,500 transforms / second / worker
● Python user code will be gating factor
25
What’s next
26
● Pipelines written in non-JVM languages on JVM runners
○ Python, Go
● Full isolation of user code
○ Native CPython execution w/o library restrictions
● Configurable SDK worker execution
○ Docker, Process, Embedded, ...
● Multiple languages in a single pipeline (future)
○ Use Java Beam IO with Python
○ Use TFX with Java
○ ...
27
Feature Support Matrix (Beam 2.10.0)
https://s.apache.org/apache-beam-portability-support-table
28
Roadmap
https://beam.apache.org/roadmap/portability/
29
● Streaming Connectors for Python SDK
○ Mixing and matching connectors written in different languages ?
○ Splittable DoFn (SDF)
● Python 3
● User Documentation
● More portable runners
30
Beam Portability Framework
https://beam.apache.org/roadmap/portability/
https://beam.apache.org/contribute/design-documents/#portability
Apache Beam
https://beam.apache.org
https://s.apache.org/slack-invite #beam #beam-portability
https://beam.apache.org/community/contact-us/
Follow @ApacheBeam on Twitter
31

More Related Content

What's hot (19)

PDF
Flink Forward Berlin 2017: Matt Zimmer - Custom, Complex Windows at Scale Usi...
Flink Forward
 
PDF
Functional programming in Scala
datamantra
 
PDF
Go at uber
Rob Skillington
 
PDF
Understanding Implicits in Scala
datamantra
 
PDF
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Flink Forward
 
PDF
Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking VN
 
PPTX
Introduction to GraalVM
SHASHI KUMAR
 
PDF
p4alu: Arithmetic Logic Unit in P4
Kentaro Ebisawa
 
PPTX
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward
 
PDF
JS introduction
Yi Tseng
 
PPTX
Extending Flux - Writing Your Own Functions by Adam Anthony
InfluxData
 
PPTX
Apache flink 1.0.0 overview
MapR Technologies
 
PDF
Native Java with GraalVM
Sylvain Wallez
 
PDF
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Igalia
 
PDF
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Jimin Hsieh
 
PDF
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly
 
PDF
Hkube
hkube
 
PDF
MuleSoft Manchester Meetup #3 slides 31st March 2020
Ieva Navickaite
 
PPT
Compilation
David Halliday
 
Flink Forward Berlin 2017: Matt Zimmer - Custom, Complex Windows at Scale Usi...
Flink Forward
 
Functional programming in Scala
datamantra
 
Go at uber
Rob Skillington
 
Understanding Implicits in Scala
datamantra
 
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Flink Forward
 
Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking VN
 
Introduction to GraalVM
SHASHI KUMAR
 
p4alu: Arithmetic Logic Unit in P4
Kentaro Ebisawa
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward
 
JS introduction
Yi Tseng
 
Extending Flux - Writing Your Own Functions by Adam Anthony
InfluxData
 
Apache flink 1.0.0 overview
MapR Technologies
 
Native Java with GraalVM
Sylvain Wallez
 
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Igalia
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Jimin Hsieh
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly
 
Hkube
hkube
 
MuleSoft Manchester Meetup #3 slides 31st March 2020
Ieva Navickaite
 
Compilation
David Halliday
 

Similar to Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019 (20)

PDF
Introduction to Apache Beam
Jean-Baptiste Onofré
 
PDF
The magic behind your Lyft ride prices: A case study on machine learning and ...
Karthik Murugesan
 
PDF
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward
 
PDF
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward
 
PDF
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
Timothy Spann
 
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
PDF
Hands on with CoAP and Californium
Julien Vermillard
 
PDF
nuclio Overview October 2017
iguazio
 
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
PDF
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
Timothy Spann
 
PPTX
Flink history, roadmap and vision
Stephan Ewen
 
PDF
iguazio - nuclio overview to CNCF (Sep 25th 2017)
Eran Duchan
 
PDF
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
PDF
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Timothy Spann
 
PDF
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
PDF
Porting a Streaming Pipeline from Scala to Rust
Evan Chan
 
PDF
DBCC 2021 - FLiP Stack for Cloud Data Lakes
Timothy Spann
 
PDF
Real-time Streaming Pipelines with FLaNK
Data Con LA
 
PDF
Data science online camp using the flipn stack for edge ai (flink, nifi, pu...
Timothy Spann
 
PDF
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward
 
Introduction to Apache Beam
Jean-Baptiste Onofré
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
Karthik Murugesan
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
Timothy Spann
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Hands on with CoAP and Californium
Julien Vermillard
 
nuclio Overview October 2017
iguazio
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
Timothy Spann
 
Flink history, roadmap and vision
Stephan Ewen
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
Eran Duchan
 
Cloud lunch and learn real-time streaming in azure
Timothy Spann
 
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Timothy Spann
 
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Porting a Streaming Pipeline from Scala to Rust
Evan Chan
 
DBCC 2021 - FLiP Stack for Cloud Data Lakes
Timothy Spann
 
Real-time Streaming Pipelines with FLaNK
Data Con LA
 
Data science online camp using the flipn stack for edge ai (flink, nifi, pu...
Timothy Spann
 
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward
 
Ad

Recently uploaded (20)

PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Ad

Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019

  • 1. 1
  • 2. 2 1. Why Beam at Lyft 2. Beam cross-language support 3. Python Streaming on Flink 4. What’s next Apache Beam Apache Flink
  • 4. 4
  • 5. 5
  • 6. 66 Stream / Schema Registry Deployment Tooling Metrics & Dashboards Alerts Logging Amazon EC2 Amazon S3 Wavefront Salt (Config / Orca) Docker
  • 7. 7 ● Many big data ecosystem projects are Java / JVM based ○ Barrier to entry for teams that want to adopt streaming.. but don’t have the Java skills ● Support use cases for different language environments ○ Python primary option for Machine Learning ● Cost of many API styles and runtime environments ● (Currently no good option for native Python + Streaming)
  • 8. 8 Unified model (Batch + strEAM) What / Where / When / How 2. SDKs (Java, Python, Go, ...) & DSLs (SQL, Scala, …) 3. Runners for Existing Distributed Processing Backends (Google Dataflow, Spark, Flink, …) 4. IOs: Data store Sources / Sinks Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 9. 9 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. Includes IOs: connectors to data stores. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution https://s.apache.org/apache-beam-project-overview
  • 10. 10 ● Started with Java SDK and Java Runners ● 2016: Initiate cross-language support effort ● 2017: Python SDK on Dataflow ● 2018: Go SDK (for portable runners) ● 2018: Python on Flink MVP ● Next: Cross-language pipelines, Samza and other (?) runners
  • 11. 11 p = beam.Pipeline(runner=runner, options=pipeline_options) (p | ReadFromText("/path/to/text*") | Map(lambda line: ...) | WindowInto(FixedWindows(120) trigger=AfterWatermark( early=AfterProcessingTime(60), late=AfterCount(1)) accumulation_mode=ACCUMULATING) | CombinePerKey(sum)) | WriteToText("/path/to/outputs") ) result = p.run() ( What, Where, When, How )
  • 13. 13 ⋮ input | Sum.PerKey() Python input.apply( Sum.integersPerKey()) Java SELECT key, SUM(value) FROM input GROUP BY key SQL (via Java) ⋮ Cloud Dataflow Apache Spark Apache Flink Apache Apex Gearpump Apache Samza Apache Nemo (incubating) IBM Streams Sum Per Key Java objects Sum Per Key Dataflow JSON API https://s.apache.org/state-of-beam-sfo-2018
  • 14. 14 ⋮ input | Sum.PerKey() Python stats.Sum(s, input) Go SELECT key, SUM(value) FROM input GROUP BY key SQL (via Java) ⋮ input.apply( Sum.integersPerKey()) Java Apache Spark Apache Flink Apache Apex Gearpump Cloud Dataflow Apache Samza Apache Nemo (incubating) IBM Streams Sum Per Key Java objects Sum Per Key Portable protos https://s.apache.org/state-of-beam-sfo-2018
  • 15. 15 Job Service Artifact Staging Job Manager Fn Services Provision Control Data Artifact Retrieval State Logging ClusterRunner Dependencies (optional) python -m apache_beam.examples.wordcount --input=/etc/profile --output=/tmp/py-wordcount-direct --runner=PortableRunner --job_endpoint=localhost:8099 --streaming SDK Worker (UDFs) SDK Worker (UDFs) SDK Worker (Python)
  • 16. 16 gRPC interfaces for communication between SDK harness and Runner https://s.apache.org/beam-fn-api ● Control: Used to tell the SDK which UDFs to execute and when to execute them. ● Data: Used to move data between the language specific SDK harness and the runner. ● State: Used to support user state, side inputs, and group by key reiteration. ● Logging: Used to aggregate logging information from the language specific SDK harness.
  • 17. 17 Bundle size matters! ● Amortize overhead over many elements ● Watermark hold effect on latency https://s.apache.org/beam-fn-api-processing-a-bundle
  • 20. 20 ● Provide Job Service endpoint (Job Management API) ● Translate portable pipeline representation to native (Flink) API ● Provide gRPC endpoints for control/data/logging/state plane ● Manage SDK worker processes that execute user code ● Manage bundle execution (with arbitrary user code) via Fn API ● Manage state for side inputs, user state/timers Reference runner provides common implementation baseline for JVM based runners (/runners/java-fn-execution) and we have a portable Validate Runner integration test suite in Python!
  • 21. 21 ● Job Server packaging (fat jar) ● Pipeline translators for batch (DataSet) and streaming (DataStream) ○ Translation/operators for primitive URNs: Impulse, Flatten, GBK, Assign Windows, Executable Stage, Reshuffle ● Side input handlers based on Flink State ● User State and Timer integration ● Flink Job Launch (same as old, non-portable runner)
  • 22. 22 ● Translator extension for streaming sources ○ Kinesis, Kafka consumers that we also use in Java Flink jobs ○ Message decoding, watermarking ● Python execution environment for SDK workers ○ Tailored to internal deployment tooling ○ Docker-free, frozen virtual envs https://github.com/lyft/beam/tree/release-2.10.0-lyft
  • 23. 23 Fn API ● Fn API Overhead 15% ? ● Fused stages ● Bundle size ● Parallelize SDK workers ● TODO: Cython, protobuf C++ bindings decode, …, window count (messages | 'reshuffle' >> beam.Reshuffle() | 'decode' >> beam.Map(lambda x: (__import__('random').randint(0, 511), 1)) | 'noop1' >> beam.Map(lambda x : x) | 'noop2' >> beam.Map(lambda x : x) | 'noop3' >> beam.Map(lambda x : x) | 'window' >> beam.WindowInto(window.GlobalWindows(), trigger=Repeatedly(AfterProcessingTime(5 * 1000)), accumulation_mode= AccumulationMode.DISCARDING) | 'group' >> beam.GroupByKey() | 'count' >> beam.Map(count) )
  • 24. 24 ● c5.4xlarge machines (16 vCPU, 32 GB) ● 16 SDK workers / machine ● 1000 ms or 1000 records / bundle ● ~ 17,500 transforms / second / worker ● Python user code will be gating factor
  • 26. 26 ● Pipelines written in non-JVM languages on JVM runners ○ Python, Go ● Full isolation of user code ○ Native CPython execution w/o library restrictions ● Configurable SDK worker execution ○ Docker, Process, Embedded, ... ● Multiple languages in a single pipeline (future) ○ Use Java Beam IO with Python ○ Use TFX with Java ○ ...
  • 27. 27 Feature Support Matrix (Beam 2.10.0) https://s.apache.org/apache-beam-portability-support-table
  • 29. 29 ● Streaming Connectors for Python SDK ○ Mixing and matching connectors written in different languages ? ○ Splittable DoFn (SDF) ● Python 3 ● User Documentation ● More portable runners
  • 30. 30 Beam Portability Framework https://beam.apache.org/roadmap/portability/ https://beam.apache.org/contribute/design-documents/#portability Apache Beam https://beam.apache.org https://s.apache.org/slack-invite #beam #beam-portability https://beam.apache.org/community/contact-us/ Follow @ApacheBeam on Twitter
  • 31. 31