ML Model Serving at Twitter

1 like•214 views

ML models are core to many Twitter products and services. The ML infrastructure supports billions of predictions daily across ads, recommendations, safety and other systems. Model serving faces challenges around performance, robustness, real-time changes and scaling. Twitter addresses these through optimizations like batching, shared transformations, and load balancing. Models are updated online and resilient to traffic spikes. A parameter server architecture allows incremental sharing of model updates across large, distributed serving groups.

Engineering

ML Model Serving @Twitter
Joe Xie, Yue Lu and Jack Guo
Twitter: @Joe_Xie, @Yue, @JackGuo8

Outline
• ML Infra Overview
• Model Serving Challenges
• Deep Dive in Solutions
– Performance Optimization
– Robust & Resilient
– Real-time Online Learning
– Scaling with Parameter Server
• Model Serving Scenarios

ML Infra - Overview
• ML is increasingly at the core of
everything we build at Twitter
• ML infra supports many product teams
– ads ranking, ads targeting, timeline ranking,
product safety, recommendation, moments
ranking, trends

ML Infra – Core Prediction Engine
• Large scale online SGD
learning
• Architecture
– Transform: MDL, Decision tree
– Feature crossing
– Logistic Regression: In-house
JVM learner or Vowpal Wabbit

Model Serving Challenges
• Performant: Trillions of predictions served daily
• Robust & Resilient: Traffic spike during events, etc.
Super bowl, Oscar award, world cup
• Real-time: news, events, trends, hash tags, ads.
Dynamically adapt to changes spanning as short as
a few hours even minutes
• Scalability: Horizontal scaling to handle organic
growth, new features and advanced modeling

Performant – Prediction Engine
Optimization
• Reduce serialization cost
– Model collocation
– Batch request API
• Reduce compute cost
– Feature id instead of string name
– Transform sharing across models
– Feature cross done on the fly

Robust & Resilient
• Resilient
– Load factor to control the traffic based on the
success rate of the requests
• Robust
– Snapshot models at fixed interval
– Abnormal traffic detection

Real time – Online Learning
Training Traffic
Client Read
Requests
Prediction Service Instance
Model
Training Traffic
Training Traffic

Scaling – Challenges
• Network fan-out: Each prediction service has to
receive all training traffic
• Limit to Training Traffic Size: Training throughput
limited by the capacity of a single instance
• Inefficient serving : A big portion of the resource is
allocated for training

Scaling – Parameter Server
• Incremental model updates instead of
integrated training
Training
Traffic
‘Server Node’
Model Updates
Serving GroupServer Group
Client Read
Requests
Model
Model
‘Worker Node’
Model
Model Updates
Model Updates

Model Serving Scenarios
• Static model in-memory integration
• Static model standalone service
• Online learning service with integrated
training
• Parameter server with incremental model
updates

More Related Content

PDF

ML Platform Q1 Meetup: An introduction to LinkedIn's Ranking and Federation L...Fei Chen

PDF

ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...Fei Chen

PPTX

Advanced Machine Learning with Amazon SageMakerJulien SIMON

PPTX

Data platform at Samsung (Big Learning)ZhuanzhuanDing

PDF

Accelerate your Machine Learning workflows with Amazon SageMakerJulien SIMON

PPTX

Build, train, and deploy Machine Learning models at scale (May 2018)Julien SIMON

PDF

Amazon SageMaker workshopJulien SIMON

PDF

Speed up your Machine Learning workflows with build-in algorithmsJulien SIMON

ML Platform Q1 Meetup: An introduction to LinkedIn's Ranking and Federation L...Fei Chen

ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...Fei Chen

Advanced Machine Learning with Amazon SageMakerJulien SIMON

Data platform at Samsung (Big Learning)ZhuanzhuanDing

Accelerate your Machine Learning workflows with Amazon SageMakerJulien SIMON

Build, train, and deploy Machine Learning models at scale (May 2018)Julien SIMON

Amazon SageMaker workshopJulien SIMON

Speed up your Machine Learning workflows with build-in algorithmsJulien SIMON

Similar to ML Model Serving at Twitter (20)

PPTX

ICML'16 Scaling ML System@TwitterJack Xiaojiang Guo

PDF

Parameter Server Approach for Online Learning at TwitterZhiyong (Joe) Xie

PDF

Operationalizing Machine Learning at Scale at StarbucksDatabricks

PDF

Productionising Machine Learning ModelsTash Bickley

PDF

Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit

PPTX

Web Performance BootCamp 2013Daniel Austin

PPTX

Web Performance Bootcamp 2014Daniel Austin

PPT

System center seminar presentationC/D/H Technology Consultants

PDF

WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2

PPTX

Practical soa for business and researchersMustafa Gamal

PDF

A survey on Machine Learning In Production (July 2018)Arnab Biswas

PDF

Microsoft DevOps for AI with GoDataDrivenGoDataDriven

PPTX

Ops Jumpstart: MongoDB Administration 101MongoDB

PPTX

Comparing Cloud platforms and toolssameerabrol

PPTX

Comparing Cloud Providers, Platforms and ToolsInnoTech

PDF

Five Early Challenges Of Building Streaming Fast Data ApplicationsLightbend

PPTX

Comparing Legacy and Modern e-commerce solutionsMike Ensor

PDF

Case study value of it strategy in hi tech industryiasaglobal

PPTX

Assessing New Databases– Translytical Use CasesDATAVERSITY

PDF

Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Databricks

ICML'16 Scaling ML System@TwitterJack Xiaojiang Guo

Parameter Server Approach for Online Learning at TwitterZhiyong (Joe) Xie

Operationalizing Machine Learning at Scale at StarbucksDatabricks

Productionising Machine Learning ModelsTash Bickley

Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit

Web Performance BootCamp 2013Daniel Austin

Web Performance Bootcamp 2014Daniel Austin

System center seminar presentationC/D/H Technology Consultants

WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2

Practical soa for business and researchersMustafa Gamal

A survey on Machine Learning In Production (July 2018)Arnab Biswas

Microsoft DevOps for AI with GoDataDrivenGoDataDriven

Ops Jumpstart: MongoDB Administration 101MongoDB

Comparing Cloud platforms and toolssameerabrol

Comparing Cloud Providers, Platforms and ToolsInnoTech

Five Early Challenges Of Building Streaming Fast Data ApplicationsLightbend

Comparing Legacy and Modern e-commerce solutionsMike Ensor

Case study value of it strategy in hi tech industryiasaglobal

Assessing New Databases– Translytical Use CasesDATAVERSITY

Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Databricks

Recently uploaded (20)

PPTX

MT Chapter 1.pptx- Magnetic particle testingABCAnyBodyCanRelax

PDF

Chad Ayach - A Versatile Aerospace ProfessionalChad Ayach

PDF

20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026Mohanumar S

PPTX

Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptxbineetmishra1990

PPT

1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.pptzilow058

PDF

The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...Partho Prosad

PDF

CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdfshailendrapratap2002

PDF

Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...2208441

PPTX

22PCOAM21 Session 2 Understanding Data Source.pptxGuru Nanak Technical Institutions

PDF

Machine Learning All topics Covers In This Single SlidesAmritTiwari19

PPTX

Inventory management chapter in automation and robotics.atisht0104

PDF

Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)publication11

PPTX

business incubation centre aaaaaaaaaaaaaahodeeesite4

PPTX

FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1MikkiliSuresh

DOCX

SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docxKanimozhi676285

PDF

Packaging Tips for Stainless Steel Tubes and Pipesheavymetalsandtubes

PDF

settlement FOR FOUNDATION ENGINEERS.pdfEndalkazene

PDF

top-5-use-cases-for-splunk-security-analytics.pdfyaghutialireza

PPTX

Online Cab Booking and Management System.pptxdiptipaneri80

PDF

Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time DataSoufiane Sejjari

MT Chapter 1.pptx- Magnetic particle testingABCAnyBodyCanRelax

Chad Ayach - A Versatile Aerospace ProfessionalChad Ayach

20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026Mohanumar S

Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptxbineetmishra1990

1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.pptzilow058

The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...Partho Prosad

CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdfshailendrapratap2002

Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...2208441

22PCOAM21 Session 2 Understanding Data Source.pptxGuru Nanak Technical Institutions

Machine Learning All topics Covers In This Single SlidesAmritTiwari19

Inventory management chapter in automation and robotics.atisht0104

Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)publication11

business incubation centre aaaaaaaaaaaaaahodeeesite4

FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1MikkiliSuresh

SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docxKanimozhi676285

Packaging Tips for Stainless Steel Tubes and Pipesheavymetalsandtubes

settlement FOR FOUNDATION ENGINEERS.pdfEndalkazene

top-5-use-cases-for-splunk-security-analytics.pdfyaghutialireza

Online Cab Booking and Management System.pptxdiptipaneri80

Advanced LangChain & RAG: Building a Financial AI Assistant with Real-Time DataSoufiane Sejjari

ML Model Serving at Twitter

1. ML Model Serving @Twitter Joe Xie, Yue Lu and Jack Guo Twitter: @Joe_Xie, @Yue, @JackGuo8

2. Outline • ML Infra Overview • Model Serving Challenges • Deep Dive in Solutions – Performance Optimization – Robust & Resilient – Real-time Online Learning – Scaling with Parameter Server • Model Serving Scenarios

3. ML Infra - Overview • ML is increasingly at the core of everything we build at Twitter • ML infra supports many product teams – ads ranking, ads targeting, timeline ranking, product safety, recommendation, moments ranking, trends

4. ML Infra – Product Examples Ad Recap

5. ML Infra - High-level Architecture

6. ML Infra – Core Prediction Engine • Large scale online SGD learning • Architecture – Transform: MDL, Decision tree – Feature crossing – Logistic Regression: In-house JVM learner or Vowpal Wabbit

7. Model Serving Challenges • Performant: Trillions of predictions served daily • Robust & Resilient: Traffic spike during events, etc. Super bowl, Oscar award, world cup • Real-time: news, events, trends, hash tags, ads. Dynamically adapt to changes spanning as short as a few hours even minutes • Scalability: Horizontal scaling to handle organic growth, new features and advanced modeling

8. Performant – Prediction Engine Optimization • Reduce serialization cost – Model collocation – Batch request API • Reduce compute cost – Feature id instead of string name – Transform sharing across models – Feature cross done on the fly

9. Robust & Resilient • Resilient – Load factor to control the traffic based on the success rate of the requests • Robust – Snapshot models at fixed interval – Abnormal traffic detection

10. Real time – Online Learning Training Traffic Client Read Requests Prediction Service Instance Model Training Traffic Training Traffic

11. Scaling – Challenges • Network fan-out: Each prediction service has to receive all training traffic • Limit to Training Traffic Size: Training throughput limited by the capacity of a single instance • Inefficient serving : A big portion of the resource is allocated for training

12. Scaling – Parameter Server • Incremental model updates instead of integrated training Training Traffic ‘Server Node’ Model Updates Serving GroupServer Group Client Read Requests Model Model ‘Worker Node’ Model Model Updates Model Updates

13. Model Serving Scenarios • Static model in-memory integration • Static model standalone service • Online learning service with integrated training • Parameter server with incremental model updates

14. Thank you