Orchestrating
big data and ML
pipelines at Lyft
Lyft Talks #4
MODERATOR
Olexii Verkunich
Technical Recruiter
overkunich@lyft.com
Join the Ride!
AND GET $5K SIGN-ON
BONUS
(Applicable for ENG openings in Minsk only.
Must apply and/or get a job offer between
Nov 18 and Jan 30 to qualify)
#PrimerosPasos
Constantine Slisenka
SWE (Software Engineer)
cslisenka@lyft.com
4
3
2
Agenda
Orchestration
Airflow, Flyte
Use-cases
big data, ML
Big data at Lyft
scale, ecosystem
?
Flyte live demo Q&A
1
Use-cases
#big data
#ml
Keeping map data accurate and fresh
We have high quality curated map
data from various sources including
open data like OpenStreetMap
We improve the map to make it more
accurate by processing imagery and
gps telemetry for recognizing objects
on the road like road signs, closures,
traffic cones
The impact is a more optimal routes
calculation, better ETA, and trip cost
estimation
Calculating and suggesting pick up spots
We have previous pick up history
We recommend users best nearby
options for pickups
The impact is a better user experience
with less driver friction: pick up in the
optimal locations, more rides due to
fewer cancellations
Detecting missing and inaccurate destinations
We have anomalies in ride
We detect when our data shows
patterns suggesting inaccurate
destinations
The impact is a better user experience,
users are able to effectively find their
destinations, more rides
Route calculation, ETA/price estimation
We have rich map data and
telemetry from driver mobile
devices
We generate real-time speed
profiles and build a
probabilistic model of routes
The impact is better routes,
ETA and price estimates
Forecasting of traffic, demand, and supply
We have driver location data,
information about events, rides
history
We forecast demand and supply,
get understanding of market
balance
The impact is efficient pricing to
serve more rides, more informed
decisions around which incentives
to give drivers (i.e. bonus zones)
Big data at Lyft
#ecosystem
#scale
infrastructure compute engines stream processing
infrastructure compute engines stream processing
development, reporting orchestration, ETL
storage and metadata
?
DATA IN S3
?
JOBS LAST MONTH
?
CONTAINERS LAST MONTH
?
LAST MONTH
ETL ?
PIPELINE RUNS
?
TASK EXECUTIONS
?
LAST MONTH
analytical
events
?
DATA IN S3
?
JOBS LAST MONTH
?
CONTAINERS LAST MONTH
?
LAST MONTH
ETL ?
PIPELINE RUNS
?
TASK EXECUTIONS
8 773 789 938 145
LAST MONTH
analytical
events
50PB
DATA IN S3
?
JOBS LAST MONTH
?
CONTAINERS LAST MONTH
+400GB
LAST MONTH
ETL ?
PIPELINE RUNS
?
TASK EXECUTIONS
8 773 789 938 145
LAST MONTH
analytical
events
50PB
DATA IN S3
376K
JOBS LAST MONTH
5M
CONTAINERS LAST MONTH
+400GB
LAST MONTH
ETL ?
PIPELINE RUNS
?
TASK EXECUTIONS
8 773 789 938 145
LAST MONTH
analytical
events
50PB
DATA IN S3
376K
JOBS LAST MONTH
5M
CONTAINERS LAST MONTH
+400GB
LAST MONTH
ETL 650K
PIPELINE RUNS
24M
TASK EXECUTIONS
8 773 789 938 145
LAST MONTH
analytical
events
Orchestration
Lyft talks #4 Orchestrating big data and ML pipelines at Lyft
- Run pipelines (scheduled and ad-hoc)
- Provide python DSL
- Provide integrations with third party
systems (hive, presto, spark, ...)
- Not compute engines
- Good for batch execution
- Not for data streaming
Orchestration engines at Lyft
- Run pipelines
- Provide integrations with thirt
party systems (hive, presto, …)
- Not a compute engines
- Good for batch execution
- Not for data streaming
Orchestration engines at Lyft
What is the difference between Flyte and Airflow?
Why created Flyte?
Why do we use both?
Should I use Flyte or Airflow for my project?
Lyft talks #4 Orchestrating big data and ML pipelines at Lyft
Airflow DAGs
starting from 1.10.0
- Quick and simple to start
- Many integrations (operators)
- Good support for sensor tasks
- Monolithic
- Fixed set of workers
- Does not manage infrastructure
No multi tenancy
- No environment separation:
DEV, STAGE, PROD
- Impossible to set up a
custom libraries and
dependencies per DAG
Limited functionality
- No versioning of DAGs
(no way to compare
outputs of version A vs B)
- No caching of task results
(Airflow is not data aware)
Monolithic scheduler
- Centralized scheduler
becomes a bottleneck
No resource management
- Heavy tasks may
overwhelm worker
- Impossible to set resource
quotas per task like max
memory/CPU
TARS
- Airflow development environment
for testing and backfilling
- Kubernetes pod with ETL software
and libraries and CLI tools
(TARS is also Interstellar movie robot)
Good tool for classic ETLs
using a standard set of operators and
orchestrating third-party systems when
custom environment and multi tenancy
are not required
Lyft talks #4 Orchestrating big data and ML pipelines at Lyft
Q1 2021
Flyte was donated to LF AI &
Data Foundation
Union.ai started
Q2 2020
Spotify and Freenome join
Flyte as collaborators
Q3 2021
15 collaborator
organizations
100+ contributors
Spotify contributes
flytekit-java
Nov, 2019
Flyte was open sourced
at Kubecon!
flyte.org
Nov, 2016
Flyte V0 built for
ETA team at Lyft
Workspace is organized into
projects
Projects or individual tasks have
different environment
Projects are organized into
domains: development,
staging, production
Multi tenancy
Flyte workflows
Language agnostic: can be written in
python and java, any docker image can
be a task
Versioned: each version is a separate
docker image
Data aware: strong typing for inputs and
outputs, Flyte executes tasks based on
data dependencies, results can be cached
Flytekit (Flyte SDK)
operatorframework.io
Lyft talks #4 Orchestrating big data and ML pipelines at Lyft
- Task execution and resource
isolation is managed by Kubernetes
- Throttling and queueing is handled
by Flyte propeller
- Multi tenant
- Allows to have isolated
environment per task or project
- Supports workflow versioning
- Data aware, can cache task results
Overhead
- Ephemeral infrastructure
brings a startup time
overhead
- Teams needs to support
their docker images
Anti patterns
- Table sensing is done much
more elegant in Airflow (use
event-driven approach)
- Not suited for a complex
parallel computation
Good tool if you need a multi tenant
environment, custom dependencies
per task or project, workflow versioning,
and compute isolation is required
- Good for classic ETL jobs
- Quick and simple to start
- Good support for table sensing
- No multi tenancy
- No workflow versioning
- Monolithic, fixed set of workers
- No infrastructure and
environment isolation
- Good for a custom jobs (like ML)
- Multi tenant, api-friendly
- Supports workflow versioning
- Overhead with ephemeral
infrastructure and image
maintenance
- Infrastructure and environment
isolation based on Kubernetes
Choose the right tool for the right job*
(Some cases can be implemented well on both engines)
Flyte
# live demo
Please ask questions in the
chat!
The best question gets
something special from
our speakers
#PrimerosPasos
Raffle Time!
Now Hiring in Minsk and Kyiv!
Backend, Data, ML Engineers
And many more on our careers page! (lyft.com/careers)
Join the Ride!
Connect with us!
LYFT.COM/CAREERS

More Related Content

DOCX
Netflix client brief
PDF
Apple vs Samsung
PDF
Vans : étude et lancement d'une nouvelle gamme de produit
DOCX
Big data et marketing :Vers une analyse prédictif de d'acte d'achat
PPTX
Présentation - Amazon
PDF
Get started with Dialogflow & Contact Center AI on Google Cloud
PPTX
GA2 Génération engagée ? Les nouvelles mobilisations des jeunes
PPTX
Netflix Case Study
Netflix client brief
Apple vs Samsung
Vans : étude et lancement d'une nouvelle gamme de produit
Big data et marketing :Vers une analyse prédictif de d'acte d'achat
Présentation - Amazon
Get started with Dialogflow & Contact Center AI on Google Cloud
GA2 Génération engagée ? Les nouvelles mobilisations des jeunes
Netflix Case Study

What's hot (20)

PPTX
Analyse stratégique 7UP et Philip Morris
PDF
L'émergence de nouveaux modèles de bibliothèques ?
PPTX
Étude de cas Netflix
PDF
Netflix | Success Diaries
PDF
Generative AI Potential
PPTX
Ppt orange
PPT
Etude de cas Adidas
PPTX
Etude de Cas Marketing Levi's
DOCX
Netflix Inc
PPTX
Slides Natura
PPTX
Nike du site web au magasin
PDF
E commerce ouvrage pdf
PPTX
Portfolio management 101
PPTX
Stratégie de Management : LEGO
PPTX
Piloter sa campagne digital marketing de A à Z
PPTX
Redbull marketing
PDF
Intégrez un chatbot à votre relation client
PPTX
Les tiers lieux, de nouveaux espaces pour travailler autrement
PDF
La marque
PPTX
Benchmarking orange
Analyse stratégique 7UP et Philip Morris
L'émergence de nouveaux modèles de bibliothèques ?
Étude de cas Netflix
Netflix | Success Diaries
Generative AI Potential
Ppt orange
Etude de cas Adidas
Etude de Cas Marketing Levi's
Netflix Inc
Slides Natura
Nike du site web au magasin
E commerce ouvrage pdf
Portfolio management 101
Stratégie de Management : LEGO
Piloter sa campagne digital marketing de A à Z
Redbull marketing
Intégrez un chatbot à votre relation client
Les tiers lieux, de nouveaux espaces pour travailler autrement
La marque
Benchmarking orange
Ad

Similar to Lyft talks #4 Orchestrating big data and ML pipelines at Lyft (20)

PDF
Lyft data Platform - 2019 slides
PDF
The Lyft data platform: Now and in the future
PPT
GeoKettle: A powerful open source spatial ETL tool
PPTX
LeedsSharp May 2023 - Azure Integration Services
PPTX
Lipstick On Pig
PPTX
Putting Lipstick on Apache Pig at Netflix
PPTX
Netflix - Pig with Lipstick by Jeff Magnusson
PPTX
Webinar september 2013
PDF
Enterprise Data Lakes
PDF
AGIT 2015 - Hans Viehmann: "Big Data and Smart Cities"
PDF
Real time analytics on deep learning @ strata data 2019
DOC
AnilKumarT_Resume_latest
PDF
Cloud-Scale BGP and NetFlow Analysis
PDF
Big Data Meetup #7
PPTX
Modern Monitoring
PDF
Near real-time anomaly detection at Lyft
PPTX
Big data at United Airlines
DOC
CV_Gervano_Fernandes
PDF
DevOps as a Contract
PDF
Discover How Volvo Cars Uses a Time Series Database to Become Data-Driven
Lyft data Platform - 2019 slides
The Lyft data platform: Now and in the future
GeoKettle: A powerful open source spatial ETL tool
LeedsSharp May 2023 - Azure Integration Services
Lipstick On Pig
Putting Lipstick on Apache Pig at Netflix
Netflix - Pig with Lipstick by Jeff Magnusson
Webinar september 2013
Enterprise Data Lakes
AGIT 2015 - Hans Viehmann: "Big Data and Smart Cities"
Real time analytics on deep learning @ strata data 2019
AnilKumarT_Resume_latest
Cloud-Scale BGP and NetFlow Analysis
Big Data Meetup #7
Modern Monitoring
Near real-time anomaly detection at Lyft
Big data at United Airlines
CV_Gervano_Fernandes
DevOps as a Contract
Discover How Volvo Cars Uses a Time Series Database to Become Data-Driven
Ad

More from Constantine Slisenka (11)

PDF
Unlocking the secrets of successful architects: what skills and traits do you...
PDF
What does it take to be architect (for Cjicago JUG)
PDF
What does it take to be an architect
PDF
VoxxedDays Minsk - Building scalable WebSocket backend
PDF
Building scalable web socket backend
PDF
Latency tracing in distributed Java applications
PDF
Distributed transactions in SOA and Microservices
PPTX
Best practices of building data streaming API
PDF
Database transaction isolation and locking in Java
PDF
Networking in Java with NIO and Netty
PDF
Profiling distributed Java applications
Unlocking the secrets of successful architects: what skills and traits do you...
What does it take to be architect (for Cjicago JUG)
What does it take to be an architect
VoxxedDays Minsk - Building scalable WebSocket backend
Building scalable web socket backend
Latency tracing in distributed Java applications
Distributed transactions in SOA and Microservices
Best practices of building data streaming API
Database transaction isolation and locking in Java
Networking in Java with NIO and Netty
Profiling distributed Java applications

Recently uploaded (20)

PPTX
PPT for Diseases (1)-2, types of diseases.pptx
PDF
Grey Minimalist Professional Project Presentation (1).pdf
PPTX
865628565-Pertemuan-2-chapter-03-NUMERICAL-MEASURES.pptx
PPTX
GPS sensor used agriculture land for automation
PPTX
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
PDF
Teal Blue Futuristic Metaverse Presentation.pdf
PPTX
transformers as a tool for understanding advance algorithms in deep learning
PPTX
Hushh.ai: Your Personal Data, Your Business
PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PDF
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPTX
ifsm.pptx, institutional food service management
PDF
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
PPTX
Chapter security of computer_8_v8.1.pptx
PDF
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PPTX
PPT for Diseases.pptx, there are 3 types of diseases
PPTX
Stats annual compiled ipd opd ot br 2024
PDF
technical specifications solar ear 2025.
PPTX
Hushh Hackathon for IIT Bombay: Create your very own Agents
PDF
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
PPT for Diseases (1)-2, types of diseases.pptx
Grey Minimalist Professional Project Presentation (1).pdf
865628565-Pertemuan-2-chapter-03-NUMERICAL-MEASURES.pptx
GPS sensor used agriculture land for automation
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
Teal Blue Futuristic Metaverse Presentation.pdf
transformers as a tool for understanding advance algorithms in deep learning
Hushh.ai: Your Personal Data, Your Business
AI AND ML PROPOSAL PRESENTATION MUST.pptx
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
ifsm.pptx, institutional food service management
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
Chapter security of computer_8_v8.1.pptx
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PPT for Diseases.pptx, there are 3 types of diseases
Stats annual compiled ipd opd ot br 2024
technical specifications solar ear 2025.
Hushh Hackathon for IIT Bombay: Create your very own Agents
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.

Lyft talks #4 Orchestrating big data and ML pipelines at Lyft

  • 1. Orchestrating big data and ML pipelines at Lyft Lyft Talks #4
  • 3. Join the Ride! AND GET $5K SIGN-ON BONUS (Applicable for ENG openings in Minsk only. Must apply and/or get a job offer between Nov 18 and Jan 30 to qualify)
  • 5. 4 3 2 Agenda Orchestration Airflow, Flyte Use-cases big data, ML Big data at Lyft scale, ecosystem ? Flyte live demo Q&A 1
  • 7. Keeping map data accurate and fresh We have high quality curated map data from various sources including open data like OpenStreetMap We improve the map to make it more accurate by processing imagery and gps telemetry for recognizing objects on the road like road signs, closures, traffic cones The impact is a more optimal routes calculation, better ETA, and trip cost estimation
  • 8. Calculating and suggesting pick up spots We have previous pick up history We recommend users best nearby options for pickups The impact is a better user experience with less driver friction: pick up in the optimal locations, more rides due to fewer cancellations
  • 9. Detecting missing and inaccurate destinations We have anomalies in ride We detect when our data shows patterns suggesting inaccurate destinations The impact is a better user experience, users are able to effectively find their destinations, more rides
  • 10. Route calculation, ETA/price estimation We have rich map data and telemetry from driver mobile devices We generate real-time speed profiles and build a probabilistic model of routes The impact is better routes, ETA and price estimates
  • 11. Forecasting of traffic, demand, and supply We have driver location data, information about events, rides history We forecast demand and supply, get understanding of market balance The impact is efficient pricing to serve more rides, more informed decisions around which incentives to give drivers (i.e. bonus zones)
  • 12. Big data at Lyft #ecosystem #scale
  • 13. infrastructure compute engines stream processing
  • 14. infrastructure compute engines stream processing development, reporting orchestration, ETL storage and metadata
  • 15. ? DATA IN S3 ? JOBS LAST MONTH ? CONTAINERS LAST MONTH ? LAST MONTH ETL ? PIPELINE RUNS ? TASK EXECUTIONS ? LAST MONTH analytical events
  • 16. ? DATA IN S3 ? JOBS LAST MONTH ? CONTAINERS LAST MONTH ? LAST MONTH ETL ? PIPELINE RUNS ? TASK EXECUTIONS 8 773 789 938 145 LAST MONTH analytical events
  • 17. 50PB DATA IN S3 ? JOBS LAST MONTH ? CONTAINERS LAST MONTH +400GB LAST MONTH ETL ? PIPELINE RUNS ? TASK EXECUTIONS 8 773 789 938 145 LAST MONTH analytical events
  • 18. 50PB DATA IN S3 376K JOBS LAST MONTH 5M CONTAINERS LAST MONTH +400GB LAST MONTH ETL ? PIPELINE RUNS ? TASK EXECUTIONS 8 773 789 938 145 LAST MONTH analytical events
  • 19. 50PB DATA IN S3 376K JOBS LAST MONTH 5M CONTAINERS LAST MONTH +400GB LAST MONTH ETL 650K PIPELINE RUNS 24M TASK EXECUTIONS 8 773 789 938 145 LAST MONTH analytical events
  • 22. - Run pipelines (scheduled and ad-hoc) - Provide python DSL - Provide integrations with third party systems (hive, presto, spark, ...) - Not compute engines - Good for batch execution - Not for data streaming Orchestration engines at Lyft
  • 23. - Run pipelines - Provide integrations with thirt party systems (hive, presto, …) - Not a compute engines - Good for batch execution - Not for data streaming Orchestration engines at Lyft
  • 24. What is the difference between Flyte and Airflow? Why created Flyte? Why do we use both? Should I use Flyte or Airflow for my project?
  • 28. - Quick and simple to start - Many integrations (operators) - Good support for sensor tasks - Monolithic - Fixed set of workers - Does not manage infrastructure
  • 29. No multi tenancy - No environment separation: DEV, STAGE, PROD - Impossible to set up a custom libraries and dependencies per DAG Limited functionality - No versioning of DAGs (no way to compare outputs of version A vs B) - No caching of task results (Airflow is not data aware)
  • 30. Monolithic scheduler - Centralized scheduler becomes a bottleneck No resource management - Heavy tasks may overwhelm worker - Impossible to set resource quotas per task like max memory/CPU
  • 31. TARS - Airflow development environment for testing and backfilling - Kubernetes pod with ETL software and libraries and CLI tools (TARS is also Interstellar movie robot)
  • 32. Good tool for classic ETLs using a standard set of operators and orchestrating third-party systems when custom environment and multi tenancy are not required
  • 34. Q1 2021 Flyte was donated to LF AI & Data Foundation Union.ai started Q2 2020 Spotify and Freenome join Flyte as collaborators Q3 2021 15 collaborator organizations 100+ contributors Spotify contributes flytekit-java Nov, 2019 Flyte was open sourced at Kubecon! flyte.org Nov, 2016 Flyte V0 built for ETA team at Lyft
  • 35. Workspace is organized into projects Projects or individual tasks have different environment Projects are organized into domains: development, staging, production Multi tenancy
  • 36. Flyte workflows Language agnostic: can be written in python and java, any docker image can be a task Versioned: each version is a separate docker image Data aware: strong typing for inputs and outputs, Flyte executes tasks based on data dependencies, results can be cached
  • 40. - Task execution and resource isolation is managed by Kubernetes - Throttling and queueing is handled by Flyte propeller - Multi tenant - Allows to have isolated environment per task or project - Supports workflow versioning - Data aware, can cache task results
  • 41. Overhead - Ephemeral infrastructure brings a startup time overhead - Teams needs to support their docker images Anti patterns - Table sensing is done much more elegant in Airflow (use event-driven approach) - Not suited for a complex parallel computation
  • 42. Good tool if you need a multi tenant environment, custom dependencies per task or project, workflow versioning, and compute isolation is required
  • 43. - Good for classic ETL jobs - Quick and simple to start - Good support for table sensing - No multi tenancy - No workflow versioning - Monolithic, fixed set of workers - No infrastructure and environment isolation - Good for a custom jobs (like ML) - Multi tenant, api-friendly - Supports workflow versioning - Overhead with ephemeral infrastructure and image maintenance - Infrastructure and environment isolation based on Kubernetes
  • 44. Choose the right tool for the right job* (Some cases can be implemented well on both engines)
  • 46. Please ask questions in the chat! The best question gets something special from our speakers
  • 48. Now Hiring in Minsk and Kyiv! Backend, Data, ML Engineers And many more on our careers page! (lyft.com/careers) Join the Ride!

Editor's Notes

  • #8: Basemap Collection (imagery): Yury Kunitski Lev Dragunov Map delivery: Thom Dedecko There is lots of good detection imagery for closures here: https://docs.google.com/document/d/123rpa5nop5OtjLhq-YvUcF6V9jAYgqSu39ayYb7SQ0k/edit and you can find some systems overview stuff here, including image of our fleetview camera: https://docs.google.com/presentation/d/11jWVR_EZ8Vgar2yBAwHmc3CwZLRbdmfjq20sycDFCno/edit#slide=id.g2d1a3323cd_0_58
  • #9: Journey Rendezvous Xiaomeng Chen
  • #10: Journey Data Andrey Kravtsov
  • #11: Mapping (Localization) Artsem Semianenka
  • #12: Market signals, forecasting Matthew Smith
  • #14: Market signals, forecasting Matthew Smith
  • #15: Market signals, forecasting Matthew Smith
  • #27: https://confluence.lyft.net/pages/viewpage.action?pageId=207326078
  • #32: https://blog.container-solutions.com/kubernetes-operators-explained ‘Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components. Operators follow Kubernetes principles, notably the control loop’. https://kubernetes.io/docs/concepts/extend-kubernetes/operator/