Realtime Visibility with Flink
Rafi Aroch | @RafiAroch
Data Architect
Create a competitive mobility market, in
order to achieve urban
mobility efficiency
OurVision: Become the Marketplace of Mobility
Demand Supply
Hotel
Airlines
Private Hire
Airplane
Bicycle
Retail
Healthcare
End-Users
Car Rental
Bus
Train
TaxiMarketplace
Consumer
Application
B2C
SDK
B2D
Partners
B2B2C
Our Solution: The One-Stop-Shop Mobility Marketplace
Example Use Case
https://thevault.amsterdam/
Datalake Architecture
Requirements
BI
Support
Data
Scientists
real-time view of what’s
going on
build models offline & make
decisions in real-time
easy-to-work-with data to avoid complex ETLs
Limitations
• Processing is batch based
• Data freshness limited to 1-2 hours
• Raw data was too complex to work with
• Data scattered in many tables requires many ETLs & JOINS
Real Time Visibility with Flink
We needed higher quality data, served in low latency
Data Fusion
Data fusion is the process of integrating multiple data sources to produce more consistent,
accurate, and useful information than that provided by any individual data source.
Wikipedia
Meet Apache Flink
Apache Flink® - Stateful Computations over Data Streams
Streaming Architecture
Ride Fusion - Result
• Processing is done in stream, on every event
• fused-ride
• Topic is getting published on every update
• Events go back to the Datalake for analytical use
• fused-ride-agg
• Topic is updated every 1 min with active rides
• Events are inserted into Postgres for user-facing applications
• Data is better arranged, flattened, transformed
• Many RAW topics are fused into one topic. All required data should
be contained in the fused-ride event, without JOINs
• Prediction services can evaluate and respond in real-time
Ride Fusion -Topology
Ride Fusion
SessionWindows
s
SessionWindow JOIN
• INNER JOIN – may not emit any results!
• Pairwise combinations are passed on to the JoinFunction (N * M)
• Enables custom JOIN logic
Union + keyBy + ProcessFunction
dA Platform Streaming Ledger SDK & CoGroupedStreams[1] inspired from the
• FULL JOIN behavior
• Event emitted per update (N + M)
• Good for simple JOIN logic
• Requires extra work
Protobuf
• Protobuf is not natively supported
• Only custom Kryo serializer is an option
• Performance hit
• No state migration
• Implemented ProtoTypeInfo + ProtoTypeSerializer
• Need to specify TypeInformation where needed to avoid Kryo fallback
• Planned to be added in Flink 1.9 - FLINK-11333
Unified Batch & Stream
• We wanted the ability to re-use the same streaming code for batch.
Used for backfill from/to S3
• Unified Batch & Stream is not straight-forward (yet)
• StreamingFileSink does not support:
• Bounded streams
• Protobuf to Parquet writer
• Using the “older” BucketingSink with custom Parquet writer
• Getting a lot of focus for next releases - FLINK-11875
Dockerized Flink on ECS
• No online resources (Only K8S, compose & standalone)
• Binding to a host from within Docker
• Port collisions when multiple containers sit on the same host
• Dynamic port-mapping
• Using custom entrypoint script for host and port setup
• Jenkins pipeline + Terraform for CI/CD
Q&A
ThankYou
Rafi Aroch
@RafiAroch

More Related Content

PPTX
ATO 2018 - What is Serverless Useful For?
PDF
Making Wallstreet talk with GO (GO India Conference 2015)
ODP
Summit 2019: "Submarine" initiative
PPTX
Cloud hub - Overview
PPTX
RedisConf17- How Redis Saved Us a Boatload of Money and Boosted Efficiency
PDF
Process Automation: an Update from the Trenches
PDF
Achieving end-to-end visibility into complex event-sourcing transactions usin...
PDF
Knative from an Enterprise Perspective
ATO 2018 - What is Serverless Useful For?
Making Wallstreet talk with GO (GO India Conference 2015)
Summit 2019: "Submarine" initiative
Cloud hub - Overview
RedisConf17- How Redis Saved Us a Boatload of Money and Boosted Efficiency
Process Automation: an Update from the Trenches
Achieving end-to-end visibility into complex event-sourcing transactions usin...
Knative from an Enterprise Perspective

What's hot (20)

PDF
Insights on Knative and how it changes the serverless landscape
PPTX
Integrate 2017 unlock azure hybrid integration with biz talk - ws
PDF
The what, why and how of knative
PDF
What’s New with Flowable?
PDF
AWS Community Day - Amy Negrette - Gateways to Gateways
PDF
Knative, Serverless on Kubernetes, and Openshift
PDF
Meetup talk about the Red Hat OpenShift Service Mesh
ODP
Case management applications with BPM
PDF
Knative Meetup
PDF
Function as a Service with Knative and riff
PPTX
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
PPTX
My graduation project on Opsschool ('the elephant')
PPTX
GIB 2017 - Azure function and logic apps better together
PDF
„GitOps with Flux and Flagger“
PDF
A Primer to Containerization & Microservices
PDF
TechTalk Webinar Series - Getting Started with Apache OpenWhisk
PDF
Serverless, oui mais pour quels usages ?
PDF
Our journey to aws - Maylin Leal
PPTX
Orchestrating Complex Multi Cloud Enterprise Applications
PPTX
biz talk orchestration
Insights on Knative and how it changes the serverless landscape
Integrate 2017 unlock azure hybrid integration with biz talk - ws
The what, why and how of knative
What’s New with Flowable?
AWS Community Day - Amy Negrette - Gateways to Gateways
Knative, Serverless on Kubernetes, and Openshift
Meetup talk about the Red Hat OpenShift Service Mesh
Case management applications with BPM
Knative Meetup
Function as a Service with Knative and riff
RedisConf17 - Redfin - The Real Estate Brokerage and the In-memory Database
My graduation project on Opsschool ('the elephant')
GIB 2017 - Azure function and logic apps better together
„GitOps with Flux and Flagger“
A Primer to Containerization & Microservices
TechTalk Webinar Series - Getting Started with Apache OpenWhisk
Serverless, oui mais pour quels usages ?
Our journey to aws - Maylin Leal
Orchestrating Complex Multi Cloud Enterprise Applications
biz talk orchestration
Ad

Similar to Real Time Visibility with Flink (20)

PDF
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
PDF
Apache Flink - a Gentle Start
PPTX
Apache Flink: Past, Present and Future
PPTX
Streaming SQL to unify batch and stream processing: Theory and practice with ...
PPTX
Flink Streaming @BudapestData
PPTX
From Apache Flink® 1.3 to 1.4
PDF
Santander Stream Processing with Apache Flink
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
PDF
Rivivi il Data in Motion Tour Milano 2024
PDF
Apache Flink 101 - the rise of stream processing and beyond
PDF
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
PDF
A look at Flink 1.2
PDF
The Power of Distributed Snapshots in Apache Flink
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
PPTX
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
PDF
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
PPTX
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
PPTX
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
PPTX
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
Apache Flink - a Gentle Start
Apache Flink: Past, Present and Future
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Flink Streaming @BudapestData
From Apache Flink® 1.3 to 1.4
Santander Stream Processing with Apache Flink
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Rivivi il Data in Motion Tour Milano 2024
Apache Flink 101 - the rise of stream processing and beyond
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
A look at Flink 1.2
The Power of Distributed Snapshots in Apache Flink
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Ad

Recently uploaded (20)

PPTX
Business_Capability_Map_Collection__pptx
PPTX
Machine Learning and working of machine Learning
PPTX
ifsm.pptx, institutional food service management
PDF
Grey Minimalist Professional Project Presentation (1).pdf
PPTX
recommendation Project PPT with details attached
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
technical specifications solar ear 2025.
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
PDF
A biomechanical Functional analysis of the masitary muscles in man
PPTX
Fundementals of R Programming_Class_2.pptx
PDF
The Role of Pathology AI in Translational Cancer Research and Education
PPTX
ch20 Database System Architecture by Rizvee
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PDF
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
PPTX
machinelearningoverview-250809184828-927201d2.pptx
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPTX
SET 1 Compulsory MNH machine learning intro
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Business_Capability_Map_Collection__pptx
Machine Learning and working of machine Learning
ifsm.pptx, institutional food service management
Grey Minimalist Professional Project Presentation (1).pdf
recommendation Project PPT with details attached
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
technical specifications solar ear 2025.
DU, AIS, Big Data and Data Analytics.ppt
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
A biomechanical Functional analysis of the masitary muscles in man
Fundementals of R Programming_Class_2.pptx
The Role of Pathology AI in Translational Cancer Research and Education
ch20 Database System Architecture by Rizvee
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
machinelearningoverview-250809184828-927201d2.pptx
1 hour to get there before the game is done so you don’t need a car seat for ...
SET 1 Compulsory MNH machine learning intro
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf

Real Time Visibility with Flink

  • 1. Realtime Visibility with Flink Rafi Aroch | @RafiAroch Data Architect
  • 2. Create a competitive mobility market, in order to achieve urban mobility efficiency OurVision: Become the Marketplace of Mobility
  • 3. Demand Supply Hotel Airlines Private Hire Airplane Bicycle Retail Healthcare End-Users Car Rental Bus Train TaxiMarketplace Consumer Application B2C SDK B2D Partners B2B2C Our Solution: The One-Stop-Shop Mobility Marketplace
  • 6. Requirements BI Support Data Scientists real-time view of what’s going on build models offline & make decisions in real-time easy-to-work-with data to avoid complex ETLs
  • 7. Limitations • Processing is batch based • Data freshness limited to 1-2 hours • Raw data was too complex to work with • Data scattered in many tables requires many ETLs & JOINS
  • 9. We needed higher quality data, served in low latency
  • 10. Data Fusion Data fusion is the process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source. Wikipedia
  • 11. Meet Apache Flink Apache Flink® - Stateful Computations over Data Streams
  • 13. Ride Fusion - Result • Processing is done in stream, on every event • fused-ride • Topic is getting published on every update • Events go back to the Datalake for analytical use • fused-ride-agg • Topic is updated every 1 min with active rides • Events are inserted into Postgres for user-facing applications • Data is better arranged, flattened, transformed • Many RAW topics are fused into one topic. All required data should be contained in the fused-ride event, without JOINs • Prediction services can evaluate and respond in real-time
  • 17. SessionWindow JOIN • INNER JOIN – may not emit any results! • Pairwise combinations are passed on to the JoinFunction (N * M) • Enables custom JOIN logic
  • 18. Union + keyBy + ProcessFunction dA Platform Streaming Ledger SDK & CoGroupedStreams[1] inspired from the • FULL JOIN behavior • Event emitted per update (N + M) • Good for simple JOIN logic • Requires extra work
  • 19. Protobuf • Protobuf is not natively supported • Only custom Kryo serializer is an option • Performance hit • No state migration • Implemented ProtoTypeInfo + ProtoTypeSerializer • Need to specify TypeInformation where needed to avoid Kryo fallback • Planned to be added in Flink 1.9 - FLINK-11333
  • 20. Unified Batch & Stream • We wanted the ability to re-use the same streaming code for batch. Used for backfill from/to S3 • Unified Batch & Stream is not straight-forward (yet) • StreamingFileSink does not support: • Bounded streams • Protobuf to Parquet writer • Using the “older” BucketingSink with custom Parquet writer • Getting a lot of focus for next releases - FLINK-11875
  • 21. Dockerized Flink on ECS • No online resources (Only K8S, compose & standalone) • Binding to a host from within Docker • Port collisions when multiple containers sit on the same host • Dynamic port-mapping • Using custom entrypoint script for host and port setup • Jenkins pipeline + Terraform for CI/CD
  • 22. Q&A

Editor's Notes

  • #5: add bikes / train
  • #6: There are different products & multiple kafka topics each
  • #7: We have 3 main stakeholders They have typical requirements from their data
  • #8: What we’re the limitations with our current architecture?
  • #9: It’s challenging to do BI like that
  • #10: We came to the conclusion…
  • #11: We needed to take our raw data and fuse it into something better
  • #12: A few words on Flink to whoever is hearing about Flink for the first time
  • #13: We added dockerized Flink on ECS Consuming from the RAW topics, fusing the events into a fused-ride event Fused-ride-agg is a big aggregated event containing all updates so far
  • #14: what were we able to achieve
  • #15: add custom trigger when "close event" arrives + 5h inactivity check if aggregate cleans up after 1 minut (make process funstion with state) maybe add a join explanation chart maybe add chart from raw events to fused event
  • #16: what were we able to achieve
  • #17: Let's go over some of the main concepts required in this job
  • #18: Let’s look at 2 approaches to implement Data Fusion with Flink First one is by using Session Window JOIN
  • #19: add custom trigger when "close event" arrives + 5h inactivity check if aggregate cleans up after 1 minut (make process funstion with state) maybe add a join explanation chart maybe add chart from raw events to fused event