SlideShare a Scribd company logo
1
1
Flink-powered stream processing platform at Pinterest
Rainie Li
Software engineer@Pinterest
Kanchi Masalia
Software engineer@Pinterest
Agenda
1. Introduction
2. Challenges & Use cases
3. Platform missions & Frameworks
4. Ongoing Work
5. Q&A
Flink powered stream processing platform at Pinterest
Confidential
|
©
Pinterest
Confidential
|
©
Pinterest
Streaming use cases on Xenon platform
OKR
promised
OKR
delivered
~2x
over
~3x
scale
Confidential
|
©
Pinterest
Why Real Time Stream Processing
● Ads real-time spend and reporting - Calculate spend against budget limits in near real time
to quickly adjust budget pacing and update advertisers with more timely reporting results
● Fast User Signals - Make user content signals available quickly after content creation and use
these signals in ML pipelines for a personalized and fresh user experience
● Realtime Trust & Safety - Reduce levels of unsafe content as close to content creation time
● Fast Insights (Content activation) - Distribute fresh Creator content and surface engagement
metrics to Creators so they can refine their content with minimal feedback delay
● Product Authority (Shopping) - Deliver a trustworthy shopping product experience for users
by updating product metadata in near real time
● Fast Experimentation - Accurately deliver metrics to engineers for faster experiment setup,
verification, and evaluation
Confidential
|
©
Pinterest
Existing Issues
● Fragmented technologies
○ Self-managed Kafka Streams jobs (Ads Infra)
○ Overwatch platform for small batch Spark jobs (Ads Data,
Measurement)
● Lack of developer support
● Availability & scalability issues
Confidential
|
©
Pinterest
Who are we?
● We are a team of engineers, SREs, PM and EM that builds the
stateful stream data processing platform called Xenon at Pinterest.
● We support around 100 engineers build and operate 100+ Flink
Applications.
● We run (near) real time applications with at 300M messages per
second and process 150TB data per second.
● We have enabled 10+ top level company KRs in the past 3 years.
Confidential
|
©
Pinterest
Xenon platform Mission
● Stability: reliably host all deployed Flink-based stream processing
applications
● Dev Velocity: quickly productionize new use cases / features to
meet business and product needs
● Cloud Efficiency: efficiently operate infras and strive for best
practices
Confidential
|
©
Pinterest
Xenon - Pinterest stream processing platform
Cluster
Management
(YARN)
NRTG
Common
Libraries and
Connectors
Flink SQL
The Resource Management & Job Execution Layer
The Developer APIs
Job State
Management
(Checkpoints,
Backups,
Restores, Edits)
Security /
Auth
(PII/FGAC)
Job Health &
Diagnosis
(Dr. Squirrel)
CI/CD Hermez
The Deployment Stack
Job
Management
Service
+
PinStats Analytic
Use case
“Overall, users … cited that currently
they have difficulties monitoring content
performance due to a lack of real-time
data being available, which they find
frustrating.”
Creator Content
Use cases
Fast user signals: Make user content
signals available quickly after content
creation
Safety: Reduce levels of unsafe content
as close to content creation time
Content Creation
Audience
Targeting
Content
Understanding
Quality
Interests &
Annotations
Embeddings
Performance
Ads real-time
spend and
reporting
Calculate spend against budget limits in
real time to quickly adjust budget and
update advertisers with more timely
results
Confidential
|
©
Pinterest
Xenon platform Mission No.1 - stability
● Xenon Stability Strategy
● Job Deployment Framework - Hermez and Job Submission service
● Job Management Service - Pinterest stateful streaming application
runtime monitoring and auto failure to different AZ service.
Repo Jenkins
Artifactory
S3
Hermez
Job Submission
Service
Yarn
Clusters
1
2
4
5
6
7
8
Xenon Job Deployment Framework
3
Xenon Jobs / Hermez workloads
154
Production Xenon use cases
>90
179
Deployments everyday
Highlights
Stability and Tier 1 support
● Enhanced JSS State Machine
● Supported job level dedicated S3 buckets
User experience
● Hermez supported most recent checkpoint deployment
● Hermez supported kill job and distributed shell
● Enriched savepoint information on Hermez
● Track daily & monthly deployment success rate
Metrics
● Job submission latency
Xenon Job Management Service
Monitoring
● Job Status
● Critical metrics (QPS)
● Checkpointing health
● Job/task health
● Notify users
Auto Recovery
Auto recover failed jobs
from:
● Last completed
checkpoint
● Most recent savepoint
● Fresh State
AZ Failure
Resilience
Auto failover jobs to
backup clusters in different
AZs when primary
cluster/AZ goes down
Xenon JMS
Statsboard
ZK Clusters
Hermez
JSS
Auto Recovery
Monitoring
Deployment
Yarn Clusters
AZ-a
Yarn Clusters
AZ-b
Yarn Clusters
AZ-c
Failover
JMS Architecture
Flink API
user
Jobs under management Faster recovery time
>90
Jobs get recovered
every week
10X
>7
Confidential
|
©
Pinterest
Xenon platform Mission No. 2 - Developer Velocity
● Near Real Time Galaxy - Pinterest stateful streaming application Job
development framework
● CICD - Pinterest stateful streaming application change rollout flow
● Dr.Squirrel - Pinterest self-served streaming application
troubleshooting portal
● Working model - New Use Case Onboarding Process
Confidential
|
©
Pinterest
NRTG
Definition:
● Pinterest stateful streaming application Job development framework
History:
● Galaxy: a high-level managed execution platform for producing and
consuming signals (e.g. Entity features) about Pinterest entities (such
as pins, board, users).
● NRTG (Near Real Time Galaxy): It follows the same Galaxy dataflow
API used in Batch, extends it to streaming applications.
Confidential
|
©
Pinterest
NRTG components (khaki boxes below)
VIP Navboost Signal (Map Transforms, Async RPC calls, Backfill)
● User code focuses only on Business logic. ✅
● Tune flink operators using configs. ✅
● ROI: Kappa architecture - roadmap to shutting down an $800K double compute GPU cluster for visual-search batch. 🚧
Xenon
Flink
Application
Code Config
Confidential
|
©
Pinterest
Xenon CICD framework - big picture
● Bring the CICD practice from stateless online services to stateful streaming world
● Leverage the same CICD infrastructure
● Customize the CICD pipeline for validating and deploying flink-based stream
application
● Achieve the goal of safely rolling out xenon user / platform changes with minimal
human efforts involved in validation
Confidential
|
©
Pinterest
Confidential
|
©
Pinterest
Xenon CICD pipelines - details
● auto-triggered based on cron rule and availability of new artifacts
● stability checks
○ job submission success
○ no restart-loop
○ savepoint generation success
○ ACA metrics validation
○ auto-recovery from TM/JM failure
● Prod deploy: decider-controlled, safe operations on prod job during
business hours
Confidential
|
©
Pinterest
Xenon CICD Pipeline UI
● Pipeline execution history
● Pipeline operation: disable / enable /
trigger
● Links to Pipeline YAML and Spinnaker
Spinnaker UI
● Pipeline parameters
● Pipeline execution status
● Details about each Stage
Xenon CICD framework - User Interface
Confidential
|
©
Pinterest
Job Debugging tool - Dr. Squirrel
Definition:
● One-stop shop for Flink job troubleshooting
Features:
● Surface suspicious stats to Xenon users instead of users searching for them
○ GC, CPU, memory, backpressure, exceptions, bad config...
● Provide instructions on top of suspicious stats
Goal:
● Cut down troubleshooting time, lower the required Flink internal knowledge for
troubleshooting, increase the dev velocity
Dr. Squirrel UI
Architecture - Part 1
Architecture - Part 2
Confidential
|
©
Pinterest
Working model - New Use Case Onboarding Process
● Xenon team provides managed bootstrap of new use case:
○ best practices in terms of choosing framework and deciding job graph
○ Dev environment setup
○ a buildable and deployable skeleton project (bazel, java, test, configs)
○ Hermez workloads creation
○ CICD pipeline
○ YARN queue
○ dashboard / alerts with default settings
● Xenon developers write and test business logic code
● Support auto-generation NRTG and Flink SQL based project
Outcome: reduce the onboarding time by 3+ weeks
Confidential
|
©
Pinterest
Xenon platform Mission No. 2 - Cloud efficiency (ongoing)
● Auto Scaling - Auto tuning & Auto scaling up/down flink applications
● Cluster upgrade - Automatic job migration during platform upgrade
● Resource Optimization - Load balance Xenon clusters
● Evaluate k8s
Confidential
|
©
Pinterest Auto Scaling
● Service to dynamically job parallelism based on the metrics - Kafka Lag, CPU utilization and
Backpressure.
Questions?
Anumol Sebastian
Chenqi Liu
Hannah Chen
Divye Kapoor
Kanchi Masalia
Lu Niu Rainie Li
Teja Thotapalli
Nishant More
Samuel Bahr
Heng Zhang
Kevin Browne
Sergii Marchenko
Ashish Jhaveri Dinesh Kumar Sekar
Chen Qin
Shaowen Wang YOU?!
Q & A
Thank you

More Related Content

What's hot (20)

PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PPTX
Apache Flink and what it is used for
Aljoscha Krettek
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PDF
Apache flink
pranay kumar
 
PPTX
Extending Flink SQL for stream processing use cases
Flink Forward
 
PDF
Deploying Flink on Kubernetes - David Anderson
Ververica
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PPSX
Cloud Architecture - Multi Cloud, Edge, On-Premise
Araf Karsh Hamid
 
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
Stream processing with Apache Flink (Timo Walther - Ververica)
KafkaZone
 
PDF
Elastic Observability keynote
Elasticsearch
 
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PDF
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Seunghyun Lee
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Apache Flink internals
Kostas Tzoumas
 
Apache Flink and what it is used for
Aljoscha Krettek
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Apache flink
pranay kumar
 
Extending Flink SQL for stream processing use cases
Flink Forward
 
Deploying Flink on Kubernetes - David Anderson
Ververica
 
Making Apache Spark Better with Delta Lake
Databricks
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Cloud Architecture - Multi Cloud, Edge, On-Premise
Araf Karsh Hamid
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Stream processing with Apache Flink (Timo Walther - Ververica)
KafkaZone
 
Elastic Observability keynote
Elasticsearch
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Seunghyun Lee
 

Similar to Flink powered stream processing platform at Pinterest (20)

PDF
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
HostedbyConfluent
 
PDF
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Bowen Li
 
PPTX
Un-clouding the cloud
Davinder Kohli
 
PDF
Accelerating Digital Transformation: It's About Digital Enablement
Joshua Gossett
 
PPTX
Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx
VirtusLab
 
PPTX
Hewlett Packard Entreprise | Stormrunner load | Game Changer
Jeffrey Nunn
 
DOC
DeepakSingh
Deepak Singh
 
PDF
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Databricks
 
PDF
Netflix Architecture and Open Source
All Things Open
 
PDF
Spark Development Lifecycle at Workday - ApacheCon 2020
Pavel Hardak
 
PDF
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Eren Avşaroğulları
 
PDF
Software engineering with Softjourn
Emmy Gengler
 
PDF
Scaling up uber's real time data analytics
Xiang Fu
 
PPTX
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
confluent
 
PDF
Ahmed El Mawaziny CV
Ahmed El Mawaziny
 
PDF
The Kubernetes Effect
Bilgin Ibryam
 
PPTX
The differing ways to monitor and instrument
Jonah Kowall
 
PDF
Cisco project ideas
VIT University
 
PDF
Nayeem shaik resume
Nayeem Shaik
 
PDF
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Josef Adersberger
 
Why Serverless Flink Matters - Blazing Fast Stream Processing Made Scalable
HostedbyConfluent
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Bowen Li
 
Un-clouding the cloud
Davinder Kohli
 
Accelerating Digital Transformation: It's About Digital Enablement
Joshua Gossett
 
Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx
VirtusLab
 
Hewlett Packard Entreprise | Stormrunner load | Game Changer
Jeffrey Nunn
 
DeepakSingh
Deepak Singh
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Databricks
 
Netflix Architecture and Open Source
All Things Open
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Pavel Hardak
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Eren Avşaroğulları
 
Software engineering with Softjourn
Emmy Gengler
 
Scaling up uber's real time data analytics
Xiang Fu
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
confluent
 
Ahmed El Mawaziny CV
Ahmed El Mawaziny
 
The Kubernetes Effect
Bilgin Ibryam
 
The differing ways to monitor and instrument
Jonah Kowall
 
Cisco project ideas
VIT University
 
Nayeem shaik resume
Nayeem Shaik
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Josef Adersberger
 
Ad

More from Flink Forward (15)

PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
The Current State of Table API in 2022
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PPTX
Welcome to the Flink Community!
Flink Forward
 
PPTX
Using Queryable State for Fun and Profit
Flink Forward
 
PDF
Changelog Stream Processing with Apache Flink
Flink Forward
 
PPTX
Large Scale Real Time Fraudulent Web Behavior Detection
Flink Forward
 
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
PPTX
Near real-time statistical modeling and anomaly detection using Flink!
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Welcome to the Flink Community!
Flink Forward
 
Using Queryable State for Fun and Profit
Flink Forward
 
Changelog Stream Processing with Apache Flink
Flink Forward
 
Large Scale Real Time Fraudulent Web Behavior Detection
Flink Forward
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Near real-time statistical modeling and anomaly detection using Flink!
Flink Forward
 
Ad

Recently uploaded (20)

PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 

Flink powered stream processing platform at Pinterest

  • 1. 1 1
  • 2. Flink-powered stream processing platform at Pinterest Rainie Li Software engineer@Pinterest Kanchi Masalia Software engineer@Pinterest
  • 3. Agenda 1. Introduction 2. Challenges & Use cases 3. Platform missions & Frameworks 4. Ongoing Work 5. Q&A
  • 6. Confidential | © Pinterest Streaming use cases on Xenon platform OKR promised OKR delivered ~2x over ~3x scale
  • 7. Confidential | © Pinterest Why Real Time Stream Processing ● Ads real-time spend and reporting - Calculate spend against budget limits in near real time to quickly adjust budget pacing and update advertisers with more timely reporting results ● Fast User Signals - Make user content signals available quickly after content creation and use these signals in ML pipelines for a personalized and fresh user experience ● Realtime Trust & Safety - Reduce levels of unsafe content as close to content creation time ● Fast Insights (Content activation) - Distribute fresh Creator content and surface engagement metrics to Creators so they can refine their content with minimal feedback delay ● Product Authority (Shopping) - Deliver a trustworthy shopping product experience for users by updating product metadata in near real time ● Fast Experimentation - Accurately deliver metrics to engineers for faster experiment setup, verification, and evaluation
  • 8. Confidential | © Pinterest Existing Issues ● Fragmented technologies ○ Self-managed Kafka Streams jobs (Ads Infra) ○ Overwatch platform for small batch Spark jobs (Ads Data, Measurement) ● Lack of developer support ● Availability & scalability issues
  • 9. Confidential | © Pinterest Who are we? ● We are a team of engineers, SREs, PM and EM that builds the stateful stream data processing platform called Xenon at Pinterest. ● We support around 100 engineers build and operate 100+ Flink Applications. ● We run (near) real time applications with at 300M messages per second and process 150TB data per second. ● We have enabled 10+ top level company KRs in the past 3 years.
  • 10. Confidential | © Pinterest Xenon platform Mission ● Stability: reliably host all deployed Flink-based stream processing applications ● Dev Velocity: quickly productionize new use cases / features to meet business and product needs ● Cloud Efficiency: efficiently operate infras and strive for best practices
  • 11. Confidential | © Pinterest Xenon - Pinterest stream processing platform Cluster Management (YARN) NRTG Common Libraries and Connectors Flink SQL The Resource Management & Job Execution Layer The Developer APIs Job State Management (Checkpoints, Backups, Restores, Edits) Security / Auth (PII/FGAC) Job Health & Diagnosis (Dr. Squirrel) CI/CD Hermez The Deployment Stack Job Management Service +
  • 12. PinStats Analytic Use case “Overall, users … cited that currently they have difficulties monitoring content performance due to a lack of real-time data being available, which they find frustrating.”
  • 13. Creator Content Use cases Fast user signals: Make user content signals available quickly after content creation Safety: Reduce levels of unsafe content as close to content creation time Content Creation Audience Targeting Content Understanding Quality Interests & Annotations Embeddings Performance
  • 14. Ads real-time spend and reporting Calculate spend against budget limits in real time to quickly adjust budget and update advertisers with more timely results
  • 15. Confidential | © Pinterest Xenon platform Mission No.1 - stability ● Xenon Stability Strategy ● Job Deployment Framework - Hermez and Job Submission service ● Job Management Service - Pinterest stateful streaming application runtime monitoring and auto failure to different AZ service.
  • 17. Xenon Jobs / Hermez workloads 154 Production Xenon use cases >90 179 Deployments everyday
  • 18. Highlights Stability and Tier 1 support ● Enhanced JSS State Machine ● Supported job level dedicated S3 buckets User experience ● Hermez supported most recent checkpoint deployment ● Hermez supported kill job and distributed shell ● Enriched savepoint information on Hermez ● Track daily & monthly deployment success rate Metrics ● Job submission latency
  • 19. Xenon Job Management Service Monitoring ● Job Status ● Critical metrics (QPS) ● Checkpointing health ● Job/task health ● Notify users Auto Recovery Auto recover failed jobs from: ● Last completed checkpoint ● Most recent savepoint ● Fresh State AZ Failure Resilience Auto failover jobs to backup clusters in different AZs when primary cluster/AZ goes down
  • 20. Xenon JMS Statsboard ZK Clusters Hermez JSS Auto Recovery Monitoring Deployment Yarn Clusters AZ-a Yarn Clusters AZ-b Yarn Clusters AZ-c Failover JMS Architecture Flink API user
  • 21. Jobs under management Faster recovery time >90 Jobs get recovered every week 10X >7
  • 22. Confidential | © Pinterest Xenon platform Mission No. 2 - Developer Velocity ● Near Real Time Galaxy - Pinterest stateful streaming application Job development framework ● CICD - Pinterest stateful streaming application change rollout flow ● Dr.Squirrel - Pinterest self-served streaming application troubleshooting portal ● Working model - New Use Case Onboarding Process
  • 23. Confidential | © Pinterest NRTG Definition: ● Pinterest stateful streaming application Job development framework History: ● Galaxy: a high-level managed execution platform for producing and consuming signals (e.g. Entity features) about Pinterest entities (such as pins, board, users). ● NRTG (Near Real Time Galaxy): It follows the same Galaxy dataflow API used in Batch, extends it to streaming applications.
  • 25. VIP Navboost Signal (Map Transforms, Async RPC calls, Backfill) ● User code focuses only on Business logic. ✅ ● Tune flink operators using configs. ✅ ● ROI: Kappa architecture - roadmap to shutting down an $800K double compute GPU cluster for visual-search batch. 🚧 Xenon Flink Application Code Config
  • 26. Confidential | © Pinterest Xenon CICD framework - big picture ● Bring the CICD practice from stateless online services to stateful streaming world ● Leverage the same CICD infrastructure ● Customize the CICD pipeline for validating and deploying flink-based stream application ● Achieve the goal of safely rolling out xenon user / platform changes with minimal human efforts involved in validation
  • 28. Confidential | © Pinterest Xenon CICD pipelines - details ● auto-triggered based on cron rule and availability of new artifacts ● stability checks ○ job submission success ○ no restart-loop ○ savepoint generation success ○ ACA metrics validation ○ auto-recovery from TM/JM failure ● Prod deploy: decider-controlled, safe operations on prod job during business hours
  • 29. Confidential | © Pinterest Xenon CICD Pipeline UI ● Pipeline execution history ● Pipeline operation: disable / enable / trigger ● Links to Pipeline YAML and Spinnaker Spinnaker UI ● Pipeline parameters ● Pipeline execution status ● Details about each Stage Xenon CICD framework - User Interface
  • 30. Confidential | © Pinterest Job Debugging tool - Dr. Squirrel Definition: ● One-stop shop for Flink job troubleshooting Features: ● Surface suspicious stats to Xenon users instead of users searching for them ○ GC, CPU, memory, backpressure, exceptions, bad config... ● Provide instructions on top of suspicious stats Goal: ● Cut down troubleshooting time, lower the required Flink internal knowledge for troubleshooting, increase the dev velocity
  • 34. Confidential | © Pinterest Working model - New Use Case Onboarding Process ● Xenon team provides managed bootstrap of new use case: ○ best practices in terms of choosing framework and deciding job graph ○ Dev environment setup ○ a buildable and deployable skeleton project (bazel, java, test, configs) ○ Hermez workloads creation ○ CICD pipeline ○ YARN queue ○ dashboard / alerts with default settings ● Xenon developers write and test business logic code ● Support auto-generation NRTG and Flink SQL based project Outcome: reduce the onboarding time by 3+ weeks
  • 35. Confidential | © Pinterest Xenon platform Mission No. 2 - Cloud efficiency (ongoing) ● Auto Scaling - Auto tuning & Auto scaling up/down flink applications ● Cluster upgrade - Automatic job migration during platform upgrade ● Resource Optimization - Load balance Xenon clusters ● Evaluate k8s
  • 36. Confidential | © Pinterest Auto Scaling ● Service to dynamically job parallelism based on the metrics - Kafka Lag, CPU utilization and Backpressure.
  • 37. Questions? Anumol Sebastian Chenqi Liu Hannah Chen Divye Kapoor Kanchi Masalia Lu Niu Rainie Li Teja Thotapalli Nishant More Samuel Bahr Heng Zhang Kevin Browne Sergii Marchenko Ashish Jhaveri Dinesh Kumar Sekar Chen Qin Shaowen Wang YOU?!
  • 38. Q & A