SlideShare a Scribd company logo
www.parallelm.com
Python Streaming API
1
Zohar Mizrahi
Senior Software Architect
ParallelM
Flink Forward 2017,
Berlin, Germany
www.parallelm.com
What Will Be Covered
● ParallelM
● Python in Machines Learning
● Python Batch API
● Python Streaming API
● Live Demo
2
www.parallelm.com
ParallelM
ParallelM accelerates time to value of AI initiatives
by helping ML Ops and Data Science teams
deploy and manage Machine Learning (ML) in Production
We have put much effort in Flink, because of its
exceptional design and ability to handle high speed
real time and true stream processing
3
www.parallelm.com
Python In Machine Learning
● Popular in data analysis (NumPy, SciPy, Matplotlib,
Panda, etc.)
● Very easy to learn
● Very easy to read
● Does not require compilation
● Awesome online community
4
www.parallelm.com
Python Batch Processing Overview
5
Client (JVM)
Script
(Python)
HDFS
JobManager
(JVM)
TaskManager
(JVM)
Script
(Python)
MMAP
Files
www.parallelm.com
… but Flink is a Stream Processing framework ...
What about Python for Streaming Data?
6
www.parallelm.com
Proposed Python Streaming API Architecture
7
Jython,
Python Script
Java Streaming API
Client (JVM)
HDFS
JobManager
(JVM)
Jython,
Python
Script
UDFs
TaskManager (JVM)
Python Streaming API
Python Streaming
API
- Python naming
- Serialization/Deseri
alization
- Thin layer
www.parallelm.com
Jython
● Python engine in java (http://www.jython.org/)
● Possible to use CPython extensions like NumPy or SciPy
with JyNI (http://jyni.org/)
● Glitches
○ The latest supported Python version is 2.7
○ No official statement for coming support in Python 3.x
8
www.parallelm.com
What Jython Challenges I had to solve?
● Java class serial ID mismatch
○ Execution of the same script multiple times
● Different namespaces for python classes with same name
but different code
○ Execution of different scripts having the same class name
● Python paths and imports issues
○ A python script may import additional files and folder
9
www.parallelm.com
Performance Considerations
● Initialisation of the Jython framework impose a fixed
overhead of 2 ~ 5 seconds:
○ Client - whenever submitted
○ TaskManager - only once, on the first submitted job
● Java/Scala vs. Python
○ No high-scale tests were conducted yet
○ There’s a room for optimizations
10
www.parallelm.com
Python Script - main
def main():
env = PythonStreamExecutionEnvironment.get_execution_environment()
env.read_text_file(“/tmp/book.txt”) 
.flat_map(Tokenizer()) 
.key_by(Selector()) 
.time_window(milliseconds(50)) 
.reduce(Sum()) 
.print()
env.execute(True)
11
www.parallelm.com
Python Script - UDF
class Tokenizer(FlatMapFunction):
def flatMap(self, value, collector):
for word in value.lower().split():
collector.collect((1, word))
class Selector(KeySelector):
def getKey(self, input):
return input[1]
class Sum(ReduceFunction):
def reduce(self, input1, input2):
count1, word1 = input1
count2, word2 = input2
return (count1 + count2, word1)
12
www.parallelm.com
Status / API Coverage
● Pending pull request (#3838)
● New project under:
flink-libraries/flink-streaming-python
● Partial coverage of the whole streaming API (Beta)
13
www.parallelm.com
Tests / Examples
● Internal tests are under:
flink-libraries/flink-streaming-python/src/test/python/org/a
pache/flink/streaming/python/api
● One complete example:
flink-examples/flink-examples-streaming/src/main/python/fibo
nacci.py
14
www.parallelm.com
> ./bin/pyflink-stream.sh /tmp/fibonacci.py - --local
Notes:
● New command line tool: pyflink-stream.sh
● Command line arguments: after the dash(`-`)
● For local execution: env.execute(True)
● Cluster mode requires HDFS
How to execute
15
www.parallelm.com 16
● Fibonacci python example
● Functionality
○ Calculates fibonacci series up to an upper bound
● Input
○ “<x>, <y>” - stream of pairs of numbers
● Output
○ ((<x>, <y>), <#iters>) - the original pair along with the iterations
number
Demo Time

More Related Content

What's hot (20)

PPTX
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward
 
PDF
Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...
Flink Forward
 
PPTX
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
PDF
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward
 
PPTX
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward
 
PDF
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward
 
PDF
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Flink Forward
 
PPTX
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
PDF
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward
 
PDF
Monitoring Flink with Prometheus
Maximilian Bode
 
PPTX
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
PPTX
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
PDF
Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...
Flink Forward
 
PDF
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward
 
PDF
Stream Loops on Flink - Reinventing the wheel for the streaming era
Paris Carbone
 
PPTX
From Apache Flink® 1.3 to 1.4
Till Rohrmann
 
PPTX
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward
 
PPTX
Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...
Flink Forward
 
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward
 
Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...
Flink Forward
 
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward
 
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward
 
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward
 
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Flink Forward
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward
 
Monitoring Flink with Prometheus
Maximilian Bode
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...
Flink Forward
 
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward
 
Stream Loops on Flink - Reinventing the wheel for the streaming era
Paris Carbone
 
From Apache Flink® 1.3 to 1.4
Till Rohrmann
 
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward
 
Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...
Flink Forward
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 

Similar to Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API (20)

PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
PPTX
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
PPTX
Apache Flink: Past, Present and Future
Gyula Fóra
 
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
PPTX
Apache Flink Training: System Overview
Flink Forward
 
PPTX
Flink Streaming @BudapestData
Gyula Fóra
 
PDF
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Taiwan User Group
 
PDF
Scalable Parallel Programming in Python with Parsl
Globus
 
PDF
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
PPTX
Introduction to Apache Flink at Vienna Meet Up
Stefan Papp
 
PDF
Introduction to Flink Streaming
datamantra
 
PDF
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
HostedbyConfluent
 
PPTX
Introduction to Apache Flink
mxmxm
 
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
PDF
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Thomas Weise
 
PDF
Introduction to Apache Flink
datamantra
 
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
Apache Flink: Past, Present and Future
Gyula Fóra
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Apache Flink Training: System Overview
Flink Forward
 
Flink Streaming @BudapestData
Gyula Fóra
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Taiwan User Group
 
Scalable Parallel Programming in Python with Parsl
Globus
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Introduction to Apache Flink at Vienna Meet Up
Stefan Papp
 
Introduction to Flink Streaming
datamantra
 
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
HostedbyConfluent
 
Introduction to Apache Flink
mxmxm
 
Don't Cross The Streams - Data Streaming And Apache Flink
John Gorman (BSc, CISSP)
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Thomas Weise
 
Introduction to Apache Flink
datamantra
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
The Current State of Table API in 2022
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Ad

Recently uploaded (20)

PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 

Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API

  • 1. www.parallelm.com Python Streaming API 1 Zohar Mizrahi Senior Software Architect ParallelM Flink Forward 2017, Berlin, Germany
  • 2. www.parallelm.com What Will Be Covered ● ParallelM ● Python in Machines Learning ● Python Batch API ● Python Streaming API ● Live Demo 2
  • 3. www.parallelm.com ParallelM ParallelM accelerates time to value of AI initiatives by helping ML Ops and Data Science teams deploy and manage Machine Learning (ML) in Production We have put much effort in Flink, because of its exceptional design and ability to handle high speed real time and true stream processing 3
  • 4. www.parallelm.com Python In Machine Learning ● Popular in data analysis (NumPy, SciPy, Matplotlib, Panda, etc.) ● Very easy to learn ● Very easy to read ● Does not require compilation ● Awesome online community 4
  • 5. www.parallelm.com Python Batch Processing Overview 5 Client (JVM) Script (Python) HDFS JobManager (JVM) TaskManager (JVM) Script (Python) MMAP Files
  • 6. www.parallelm.com … but Flink is a Stream Processing framework ... What about Python for Streaming Data? 6
  • 7. www.parallelm.com Proposed Python Streaming API Architecture 7 Jython, Python Script Java Streaming API Client (JVM) HDFS JobManager (JVM) Jython, Python Script UDFs TaskManager (JVM) Python Streaming API Python Streaming API - Python naming - Serialization/Deseri alization - Thin layer
  • 8. www.parallelm.com Jython ● Python engine in java (http://www.jython.org/) ● Possible to use CPython extensions like NumPy or SciPy with JyNI (http://jyni.org/) ● Glitches ○ The latest supported Python version is 2.7 ○ No official statement for coming support in Python 3.x 8
  • 9. www.parallelm.com What Jython Challenges I had to solve? ● Java class serial ID mismatch ○ Execution of the same script multiple times ● Different namespaces for python classes with same name but different code ○ Execution of different scripts having the same class name ● Python paths and imports issues ○ A python script may import additional files and folder 9
  • 10. www.parallelm.com Performance Considerations ● Initialisation of the Jython framework impose a fixed overhead of 2 ~ 5 seconds: ○ Client - whenever submitted ○ TaskManager - only once, on the first submitted job ● Java/Scala vs. Python ○ No high-scale tests were conducted yet ○ There’s a room for optimizations 10
  • 11. www.parallelm.com Python Script - main def main(): env = PythonStreamExecutionEnvironment.get_execution_environment() env.read_text_file(“/tmp/book.txt”) .flat_map(Tokenizer()) .key_by(Selector()) .time_window(milliseconds(50)) .reduce(Sum()) .print() env.execute(True) 11
  • 12. www.parallelm.com Python Script - UDF class Tokenizer(FlatMapFunction): def flatMap(self, value, collector): for word in value.lower().split(): collector.collect((1, word)) class Selector(KeySelector): def getKey(self, input): return input[1] class Sum(ReduceFunction): def reduce(self, input1, input2): count1, word1 = input1 count2, word2 = input2 return (count1 + count2, word1) 12
  • 13. www.parallelm.com Status / API Coverage ● Pending pull request (#3838) ● New project under: flink-libraries/flink-streaming-python ● Partial coverage of the whole streaming API (Beta) 13
  • 14. www.parallelm.com Tests / Examples ● Internal tests are under: flink-libraries/flink-streaming-python/src/test/python/org/a pache/flink/streaming/python/api ● One complete example: flink-examples/flink-examples-streaming/src/main/python/fibo nacci.py 14
  • 15. www.parallelm.com > ./bin/pyflink-stream.sh /tmp/fibonacci.py - --local Notes: ● New command line tool: pyflink-stream.sh ● Command line arguments: after the dash(`-`) ● For local execution: env.execute(True) ● Cluster mode requires HDFS How to execute 15
  • 16. www.parallelm.com 16 ● Fibonacci python example ● Functionality ○ Calculates fibonacci series up to an upper bound ● Input ○ “<x>, <y>” - stream of pairs of numbers ● Output ○ ((<x>, <y>), <#iters>) - the original pair along with the iterations number Demo Time