SlideShare a Scribd company logo
Abstract
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua
franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing
robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms.
In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to
"run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its
architecture, including the portability model. We’ll focus on the present state of the community and the
current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss
where Beam is going next, including completion of the portability framework and the Streaming SQL.
Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the
glue that interconnects the big data ecosystem.
This session is a (Intermediate) talk in our IoT and Streaming track. It focuses on Apache Flink,
Apache Kafka, Apache Spark, Cloud, Other and is geared towards Architect, Data Scientist, Data
Analyst, Developer / Engineer, Operations / IT audiences.
Feel free to reuse some of these slides for your own talk
on Apache Beam!
If you do, please add a proper reference / quote / credit.
Present and future of
unified, portable and
efficient data processing
with Apache Beam
Davor Bonaci
PMC Chair, Apache Beam
Apache Beam: Open Source data processing APIs
● Expresses data-parallel batch and streaming
algorithms using one unified API
● Cleanly separates data processing logic
from runtime requirements
● Supports execution on multiple distributed
processing runtime environments
Apache Beam is
a unified programming model
designed to provide
efficient and portable
data processing pipelines
Agenda
1. Project timeline so far
2. Expressing data-parallel pipelines with the Beam model
3. The Beam vision for portability
a. Extensibility to integrate the Big Data ecosystem
4. Project roadmap
Apache Beam at DataWorks Summit
● Birds-of-a-feather: IoT, Streaming and Data Flow
○ Panel: Aldrin Piri, Davor Bonaci, Karthik Ramasamy, Jeremy Dyer
○ Yesterday @ 5:40 pm
● Foundations of streaming SQL: stream & table theory
○ Anton Kedin, Software Engineer @ Google
○ Today @ 11:30 am
What we accomplished so far?
02/01/2016
Enter Apache
Incubator
3/20/2018
Latest release
(2.4.0)
2016
Incubation
Early 2016
API stabilization
Late 2017 & 2018
Enterprise growth
01/10/2017
Graduation as a
top-level project
5/16/2017
First stable
release
Expressing
data-parallel pipelines
with the Beam model
A unified model for batch and
streaming
Processing time vs. event time
The Beam Model: asking the right questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
The Beam Model: What is being computed?
The Beam Model: What is being computed?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
The Beam Model: Where in event time?
The Beam Model: Where in event time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
The Beam Model: When in processing time?
The Beam Model: When in processing time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Beam Model: How do refinements relate?
The Beam Model: How do refinements relate?
Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch
The Beam vision for
portability
Write once,
run anywhere“
”
Beam Vision: mix and match SDKs and runtimes
● The Beam Model: the abstractions
at the core of Apache Beam
Runner 1 Runner 3Runner 2
● Choice of SDK: Users write their
pipelines in a language that’s
familiar and integrated with their
other tooling
● Choice of Runners: Users choose
the right runtime for their current
needs -- on-prem / cloud, open
source / not, fully managed / not
● Scalability for Developers: Clean
APIs allow developers to contribute
modules independently
The Beam Model
Language A Language CLanguage B
The Beam Model
Language A
SDK
Language C
SDK
Language B
SDK
● Beam’s Java SDK runs on multiple runtime
environments, including:
○ Apache Apex
○ Apache Flink
○ Apache Spark
○ Google Cloud Dataflow
○ Apache Gearpump (incubating)
● Cross-language infrastructure is in
progress.
○ Portable Flink runner is close!
○ Portable Spark runner is coming later
Beam Vision: as of June 2018
Beam Model: Fn Runners
Apache
Spark
Cloud
Dataflow
Beam Model: Pipeline Construction
Apache
Flink
Java
Java
Python
Python
Apache
Apex
Apache
Gearpump
Go
Example Beam Runners
Apache Spark
● Open-source
cluster-computing
framework
● Large ecosystem of
APIs and tools
● Runs on premise or in
the cloud
Apache Flink
● Open-source
distributed data
processing engine
● High-throughput and
low-latency stream
processing
● Runs on premise or in
the cloud
Google Cloud Dataflow
● Fully-managed service
for batch and stream
data processing
● Provides dynamic
auto-scaling,
monitoring tools, and
tight integration with
Google Cloud
Platform
How to think about Apache Beam?
How do you build an abstraction layer?
Apache
Spark
Cloud
Dataflow
Apache
Flink
????????
????????
Beam: the intersection of runner functionality?
Beam: the union of runner functionality?
Beam: the future!
Categorizing Runner Capabilities
https://beam.apache.org/
documentation/runners/capability-matrix/
Getting Started with Apache Beam
Quickstarts
● Java SDK
● Python SDK
Example walkthroughs
● Word Count
● Mobile Gaming
Extensive documentation
Extensibility to integrate the
entire Big Data ecosystem
Integrating
Up, Down, and
Sideways
“
”
Extensibility points
● Software Development Kits (SDKs)
● Runners
● Domain-specific extensions (DSLs)
● Libraries of transformations
● IOs
● File systems
Software Development Kits (SDKs)
Runner 1 Runner 3Runner 2
The Beam Model
Language A
SDK
Language C
SDK
Language B
SDK
Runners
Runner 1 Runner 3Runner 2
The Beam Model
Language A
SDK
Language C
SDK
Language B
SDK
Domain-specific extensions (DSLs)
The Beam Model
Language A
SDK
Language C
SDK
Language B
SDK
DSL 2 DSL 3DSL 1
Libraries of transformations
The Beam Model
Language A
SDK
Language C
SDK
Language B
SDK
Library 2 Library 3Library 1
IO connectors
The Beam Model
Language A
SDK
Language C
SDK
Language B
SDK
IO
connector
2
IO
connector
3
IO
connector
1
File systems
The Beam Model
Language A
SDK
Language C
SDK
Language B
SDK
File system
2
File system
3
File system
1
Ecosystem integration
● I have an engine
→ write a Beam runner
● I want to extend Beam to new languages
→ write an SDK
● I want to adopt an SDK to a target audience
→ write a DSL
● I want a component can be a part of a bigger data-processing pipeline
→ write a library of transformations
● I have a data storage or messaging system
→ write an IO connector or a file system connector
Apache Beam is
a glue that integrates
the big data ecosystem
Project roadmap
The future: usability & completion
of vision
● Beam’s Java SDK runs on multiple runtime
environments, including:
○ Apache Apex
○ Apache Flink
○ Apache Spark
○ Google Cloud Dataflow
○ Apache Gearpump (incubating)
● Cross-language infrastructure is in
progress.
○ Portable Flink runner is close!
○ Portable Spark runner is coming later
Beam Vision: as of June 2018
Beam Model: Fn Runners
Apache
Spark
Cloud
Dataflow
Beam Model: Pipeline Construction
Apache
Flink
Java
Java
Python
Python
Apache
Apex
Apache
Gearpump
Go
collection.apply(ParDo.of(new DoFn<MyType, MyType>() {
@ProcessElement void process(ProcessContext c, IntervalWindow w) {
}}))
collection.apply(ParDo.of(new DoFn<MyType, MyType>() {
@ProcessElement void process(@Element MyType element,
@Timestamp Instant instant,
IntervalWindow window,
PaneInfo paneInfo,
OutputReceiver<MyType> out) {
}}))
API usability improvements
Schemas
● Beam currently treats elements as opaque blobs.
● Understanding structure of elements enables simplification of
common tasks and optimizations!
Canonical streaming use cases
Extract-Transform-
Load
Transforming and cleaning
data as it arrives and loading
it into a long-term storage
layer.
Streaming
Analytics
Analysis and
aggregation of data
streams that produce a
table or a real-time
dashboard.
Real-time
Actions
Detecting situations
within the event stream
and triggering actions in
real-time.
3
2
1
Work in progress: Streaming analytics
PCollection<Row> filteredNames = testApps.apply(
BeamSql.query(
"SELECT appId, description, rowtime "
+ "FROM PCOLLECTION "
+ "WHERE id=1"));
Work in progress: Complex event processing
Other work in progress
● Performance testing infrastructure
● Build system improvements
Apache Beam is
a unified programming model
designed to provide
efficient and portable
data processing pipelines
Still coming up...
● Foundations of streaming SQL: stream & table theory
○ Anton Kedin, Software Engineer @ Google
○ Today @ 11:30 am

More Related Content

What's hot (20)

PPTX
Docker data science pipeline
DataWorks Summit
 
PPTX
SDLC with Apache NiFi
DataWorks Summit
 
PPTX
Graphene – Microsoft SCOPE on Tez
DataWorks Summit
 
PPTX
Bringing complex event processing to Spark streaming
DataWorks Summit
 
PPTX
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
PPTX
Accelerating TensorFlow with RDMA for high-performance deep learning
DataWorks Summit
 
PPTX
Apache deep learning 101
DataWorks Summit
 
PDF
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
DataWorks Summit
 
PPTX
Lessons learned from running Spark on Docker
DataWorks Summit
 
PPTX
Lessons learned running a container cloud on YARN
DataWorks Summit
 
PDF
Unified, Efficient, and Portable Data Processing with Apache Beam
DataWorks Summit/Hadoop Summit
 
PPTX
SAM—streaming analytics made easy
DataWorks Summit
 
PPTX
Quality for the Hadoop Zoo
DataWorks Summit
 
PPTX
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
PPTX
Why Kubernetes as a container orchestrator is a right choice for running spar...
DataWorks Summit
 
PDF
Performance tuning your Hadoop/Spark clusters to use cloud storage
DataWorks Summit
 
PDF
Migrating pipelines into Docker
DataWorks Summit/Hadoop Summit
 
PPTX
Enabling Modern Application Architecture using Data.gov open government data
DataWorks Summit
 
PPTX
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
DataWorks Summit
 
PPT
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
Docker data science pipeline
DataWorks Summit
 
SDLC with Apache NiFi
DataWorks Summit
 
Graphene – Microsoft SCOPE on Tez
DataWorks Summit
 
Bringing complex event processing to Spark streaming
DataWorks Summit
 
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
Accelerating TensorFlow with RDMA for high-performance deep learning
DataWorks Summit
 
Apache deep learning 101
DataWorks Summit
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
DataWorks Summit
 
Lessons learned from running Spark on Docker
DataWorks Summit
 
Lessons learned running a container cloud on YARN
DataWorks Summit
 
Unified, Efficient, and Portable Data Processing with Apache Beam
DataWorks Summit/Hadoop Summit
 
SAM—streaming analytics made easy
DataWorks Summit
 
Quality for the Hadoop Zoo
DataWorks Summit
 
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
DataWorks Summit
 
Performance tuning your Hadoop/Spark clusters to use cloud storage
DataWorks Summit
 
Migrating pipelines into Docker
DataWorks Summit/Hadoop Summit
 
Enabling Modern Application Architecture using Data.gov open government data
DataWorks Summit
 
Breathing New Life into Apache Oozie with Apache Ambari Workflow Manager
DataWorks Summit
 
Running Spark in Production
DataWorks Summit/Hadoop Summit
 

Similar to Present and future of unified, portable, and efficient data processing with Apache Beam (20)

PDF
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 
PDF
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Malo Denielou
 
PPTX
Portable Streaming Pipelines with Apache Beam
confluent
 
PDF
Realizing the promise of portability with Apache Beam
J On The Beach
 
PDF
The Next Generation of Data Processing and Open Source
DataWorks Summit/Hadoop Summit
 
PPTX
ApacheBeam_Google_Theater_TalendConnect2017.pptx
RAJA RAY
 
PDF
Introduction to Apache Beam
Knoldus Inc.
 
PDF
ApacheBeam_Google_Theater_TalendConnect2017.pdf
RAJA RAY
 
PDF
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward
 
PDF
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
PDF
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon
 
PDF
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Sergio Fernández
 
PDF
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward
 
PPTX
Apache Beam (incubating)
Apache Apex
 
PDF
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward
 
PPTX
Python Streaming Pipelines with Beam on Flink
Aljoscha Krettek
 
PDF
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Thomas Weise
 
PDF
Introduction to Apache Beam
Jean-Baptiste Onofré
 
PPTX
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Aljoscha Krettek
 
PDF
Maximilian Michels - Flink and Beam
Flink Forward
 
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Malo Denielou
 
Portable Streaming Pipelines with Apache Beam
confluent
 
Realizing the promise of portability with Apache Beam
J On The Beach
 
The Next Generation of Data Processing and Open Source
DataWorks Summit/Hadoop Summit
 
ApacheBeam_Google_Theater_TalendConnect2017.pptx
RAJA RAY
 
Introduction to Apache Beam
Knoldus Inc.
 
ApacheBeam_Google_Theater_TalendConnect2017.pdf
RAJA RAY
 
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon
 
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Sergio Fernández
 
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward
 
Apache Beam (incubating)
Apache Apex
 
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward
 
Python Streaming Pipelines with Beam on Flink
Aljoscha Krettek
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Thomas Weise
 
Introduction to Apache Beam
Jean-Baptiste Onofré
 
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Aljoscha Krettek
 
Maximilian Michels - Flink and Beam
Flink Forward
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Digital Circuits, important subject in CS
contactparinay1
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 

Present and future of unified, portable, and efficient data processing with Apache Beam

  • 1. Abstract The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere." This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem. This session is a (Intermediate) talk in our IoT and Streaming track. It focuses on Apache Flink, Apache Kafka, Apache Spark, Cloud, Other and is geared towards Architect, Data Scientist, Data Analyst, Developer / Engineer, Operations / IT audiences. Feel free to reuse some of these slides for your own talk on Apache Beam! If you do, please add a proper reference / quote / credit.
  • 2. Present and future of unified, portable and efficient data processing with Apache Beam Davor Bonaci PMC Chair, Apache Beam
  • 3. Apache Beam: Open Source data processing APIs ● Expresses data-parallel batch and streaming algorithms using one unified API ● Cleanly separates data processing logic from runtime requirements ● Supports execution on multiple distributed processing runtime environments
  • 4. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 5. Agenda 1. Project timeline so far 2. Expressing data-parallel pipelines with the Beam model 3. The Beam vision for portability a. Extensibility to integrate the Big Data ecosystem 4. Project roadmap
  • 6. Apache Beam at DataWorks Summit ● Birds-of-a-feather: IoT, Streaming and Data Flow ○ Panel: Aldrin Piri, Davor Bonaci, Karthik Ramasamy, Jeremy Dyer ○ Yesterday @ 5:40 pm ● Foundations of streaming SQL: stream & table theory ○ Anton Kedin, Software Engineer @ Google ○ Today @ 11:30 am
  • 7. What we accomplished so far? 02/01/2016 Enter Apache Incubator 3/20/2018 Latest release (2.4.0) 2016 Incubation Early 2016 API stabilization Late 2017 & 2018 Enterprise growth 01/10/2017 Graduation as a top-level project 5/16/2017 First stable release
  • 8. Expressing data-parallel pipelines with the Beam model A unified model for batch and streaming
  • 9. Processing time vs. event time
  • 10. The Beam Model: asking the right questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  • 11. PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); The Beam Model: What is being computed?
  • 12. The Beam Model: What is being computed?
  • 13. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey()); The Beam Model: Where in event time?
  • 14. The Beam Model: Where in event time?
  • 15. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); The Beam Model: When in processing time?
  • 16. The Beam Model: When in processing time?
  • 17. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings( AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey()); The Beam Model: How do refinements relate?
  • 18. The Beam Model: How do refinements relate?
  • 19. Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  • 20. The Beam vision for portability Write once, run anywhere“ ”
  • 21. Beam Vision: mix and match SDKs and runtimes ● The Beam Model: the abstractions at the core of Apache Beam Runner 1 Runner 3Runner 2 ● Choice of SDK: Users write their pipelines in a language that’s familiar and integrated with their other tooling ● Choice of Runners: Users choose the right runtime for their current needs -- on-prem / cloud, open source / not, fully managed / not ● Scalability for Developers: Clean APIs allow developers to contribute modules independently The Beam Model Language A Language CLanguage B The Beam Model Language A SDK Language C SDK Language B SDK
  • 22. ● Beam’s Java SDK runs on multiple runtime environments, including: ○ Apache Apex ○ Apache Flink ○ Apache Spark ○ Google Cloud Dataflow ○ Apache Gearpump (incubating) ● Cross-language infrastructure is in progress. ○ Portable Flink runner is close! ○ Portable Spark runner is coming later Beam Vision: as of June 2018 Beam Model: Fn Runners Apache Spark Cloud Dataflow Beam Model: Pipeline Construction Apache Flink Java Java Python Python Apache Apex Apache Gearpump Go
  • 23. Example Beam Runners Apache Spark ● Open-source cluster-computing framework ● Large ecosystem of APIs and tools ● Runs on premise or in the cloud Apache Flink ● Open-source distributed data processing engine ● High-throughput and low-latency stream processing ● Runs on premise or in the cloud Google Cloud Dataflow ● Fully-managed service for batch and stream data processing ● Provides dynamic auto-scaling, monitoring tools, and tight integration with Google Cloud Platform
  • 24. How to think about Apache Beam?
  • 25. How do you build an abstraction layer? Apache Spark Cloud Dataflow Apache Flink ???????? ????????
  • 26. Beam: the intersection of runner functionality?
  • 27. Beam: the union of runner functionality?
  • 30. Getting Started with Apache Beam Quickstarts ● Java SDK ● Python SDK Example walkthroughs ● Word Count ● Mobile Gaming Extensive documentation
  • 31. Extensibility to integrate the entire Big Data ecosystem Integrating Up, Down, and Sideways “ ”
  • 32. Extensibility points ● Software Development Kits (SDKs) ● Runners ● Domain-specific extensions (DSLs) ● Libraries of transformations ● IOs ● File systems
  • 33. Software Development Kits (SDKs) Runner 1 Runner 3Runner 2 The Beam Model Language A SDK Language C SDK Language B SDK
  • 34. Runners Runner 1 Runner 3Runner 2 The Beam Model Language A SDK Language C SDK Language B SDK
  • 35. Domain-specific extensions (DSLs) The Beam Model Language A SDK Language C SDK Language B SDK DSL 2 DSL 3DSL 1
  • 36. Libraries of transformations The Beam Model Language A SDK Language C SDK Language B SDK Library 2 Library 3Library 1
  • 37. IO connectors The Beam Model Language A SDK Language C SDK Language B SDK IO connector 2 IO connector 3 IO connector 1
  • 38. File systems The Beam Model Language A SDK Language C SDK Language B SDK File system 2 File system 3 File system 1
  • 39. Ecosystem integration ● I have an engine → write a Beam runner ● I want to extend Beam to new languages → write an SDK ● I want to adopt an SDK to a target audience → write a DSL ● I want a component can be a part of a bigger data-processing pipeline → write a library of transformations ● I have a data storage or messaging system → write an IO connector or a file system connector
  • 40. Apache Beam is a glue that integrates the big data ecosystem
  • 41. Project roadmap The future: usability & completion of vision
  • 42. ● Beam’s Java SDK runs on multiple runtime environments, including: ○ Apache Apex ○ Apache Flink ○ Apache Spark ○ Google Cloud Dataflow ○ Apache Gearpump (incubating) ● Cross-language infrastructure is in progress. ○ Portable Flink runner is close! ○ Portable Spark runner is coming later Beam Vision: as of June 2018 Beam Model: Fn Runners Apache Spark Cloud Dataflow Beam Model: Pipeline Construction Apache Flink Java Java Python Python Apache Apex Apache Gearpump Go
  • 43. collection.apply(ParDo.of(new DoFn<MyType, MyType>() { @ProcessElement void process(ProcessContext c, IntervalWindow w) { }})) collection.apply(ParDo.of(new DoFn<MyType, MyType>() { @ProcessElement void process(@Element MyType element, @Timestamp Instant instant, IntervalWindow window, PaneInfo paneInfo, OutputReceiver<MyType> out) { }})) API usability improvements
  • 44. Schemas ● Beam currently treats elements as opaque blobs. ● Understanding structure of elements enables simplification of common tasks and optimizations!
  • 45. Canonical streaming use cases Extract-Transform- Load Transforming and cleaning data as it arrives and loading it into a long-term storage layer. Streaming Analytics Analysis and aggregation of data streams that produce a table or a real-time dashboard. Real-time Actions Detecting situations within the event stream and triggering actions in real-time. 3 2 1
  • 46. Work in progress: Streaming analytics PCollection<Row> filteredNames = testApps.apply( BeamSql.query( "SELECT appId, description, rowtime " + "FROM PCOLLECTION " + "WHERE id=1"));
  • 47. Work in progress: Complex event processing
  • 48. Other work in progress ● Performance testing infrastructure ● Build system improvements
  • 49. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 50. Still coming up... ● Foundations of streaming SQL: stream & table theory ○ Anton Kedin, Software Engineer @ Google ○ Today @ 11:30 am