SlideShare a Scribd company logo
Reducing MTTR and False Escalations:
Event Correlation at LinkedIn
Michael Kehoe
Staff Site Reliability Engineer
LinkedIn
Have you ever?
2
False Escalations
• Been woken because your service is unhealthy because of a dependency?
• Been woken because someone believes your service is responsible?
• Spent hours trying to work out why your service is broken?
3
Agenda
• Project Problem Statement
• Project Goals
• Architecture Considerations
• Correlation Engine Overview
• Results & Takeaways
• Questions
$ whoami
4
Michael Kehoe
• Staff Site Reliability Engineer (SRE) @ LinkedIn
• Production-SRE team
• Funny accent = Australian + 3 years American
$ whatis PROD-SRE
5
Michael Kehoe
• Production-SRE
• Develop applications to improve MTTD and
MTTR
• Build tools for efficient site issue
troubleshooting, issue detection & correlation
• Provide direction on site monitoring
• Assist in restoring stability to services during site
critical issues
6
Problem Statement
Service Complexity
Learning Curve MTTR
Reliability
Project Technical Goal
7
Problem Statement
Find problem with a service between a given time period (or ongoing) using:
Unified API Web Frontend
Project Success Criteria
8
Problem Statement
• Reduce MTTR on incidents
• Reduce false/ needless escalations
Expected Use-Cases
9
Problem Statement
Applicable use-cases:
• A service has high latency or error rates
• Find the problematic service(s)
Non-applicable use-cases:
• External monitoring services show slow page-load times
10
Architecture Considerations
Real-Time metrics analytics
(stream processing)
Ad-Hoc metrics Analytics Alert Correlation
Evaluation
11
Architecture Considerations
• Real-Time metrics analytics (stream processing)
• Pros
• Fast response time
• Ability to do advanced analytics in real-time
• Cons
• Resource intensive (especially at LinkedIn scale)
Evaluation
12
Architecture Considerations
• Ad-Hoc metric analytics
• Pros
• Smaller resource footprint
• Cons
• Analysis time is slow
Evaluation
13
Architecture Considerations
• Alert Correlation
• Pros
• Leverage already existing alerts
• Strong signal-to-noise ratio
• Cons
• Analysis constrained to alerts only (boolean state)
Evaluation
14
Architecture Considerations
• Real-time analytics is expensive, but useful
• Ad-Hoc metric analytics is slower, but cheaper
• Alert Correlation gives us strong signal
15
Correlation Engine Overview
At LinkedIn, we had two smaller projects that we could leverage
Drilldown + Site-Stabilizer
Near-Time metric analytics & event correlation
Invisualize
Alert Correlation
Existing knowledge available
Where to get started
16
Correlation Engine Overview
The ability to correlate is great!
But you need to understand dependencies
Build a callgraph!
Callgraph
17
Correlation Engine Overview
LinkedIn applications emit metrics on a per-API and per-dependency basis
Map metrics to understand dependencies
Simple to build callgraph platform!
Callgraph
18
Correlation Engine Overview
Callgraph-be
Voldemort
(RO Datastore)
Espresso
(RW Datastore)
Collect:
● Call count
● Latency
drilldown (Near-Time analytics)
19
Correlation Engine Overview
Using callgraph, identifies high-value dependencies (and the associated metrics)
In 5min chunks, analyses high-value metrics
Using a k-means unsupervised algorithm, find similar trends between service metrics
Queryable API
Outputs correlation confidence scores
Normalised between 0-100
inVisualize (Alert Correlation)
20
Correlation Engine Overview
inVisualize analyses alerts (in realtime) from each service
Use callgraph to calculate the unhealthy service and affected services
Queryable API
Results normalised between 0-100
Visualizes impact
inVisualize
21
Correlation Engine Overview
Site-Stabilizer
22
Correlation Engine Overview
Backend service
Collates recommendations from Drilldown & inVisualize
Decorates recommendations with:
Scheduled changes
Deployment events
A/B experiment changes
Architecture
23
Correlation Engine Overview
Callgraph-api
Callgraph-be
drilldown invisualize
site-stabilizer
Correlate-fe
24
Correlation Engine Overview
API for automation
Auto-remediation
Alert suppressing
UI for manual introspection
Correlate-fe
25
Correlation Engine Overview
User Interfaces gives
Responsible service
Correlation Confidence
Root cause
SRE team
Analysis
Architecture
26
Correlation Engine Overview
Callgraph-api
Callgraph-be
correlate-fe
drilldown invisualize
site-stabilizer
Latency Alert
NURSE
Nurse Plan arguments
• service-name: my-frontend
• req_confidence = 85
• escalate = True
Escalate to
correct SRE
Find what’s wrong with
‘my-frontend’ in
DatacenterB
IrisAlert Correlation API
Service: Service-C
Confidence: 91%
Reason: ‘Service-C’ has high latency after a deploy
Service Owner: SRE
28
Early Results
Siteops (NOC) has greater visibility on the site
Reducing MTTR
Reducing false escalations
29
Conclusion
Understand what correlation approach makes sense for you
Understand your dependencies
Build, Integrate and benefit!
30
Team
Govindaluri
Kishore
Renjith
Rajan
Reynold
Perumpilly
Rusty
Wickell
Michael
Kehoe
31
Questions?
©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.
Callgraph
33
Correlation Engine Overview
Callgraph-be
RestLi
(Internal API’s)
Voldemort
(RO Datastore)
Espresso
(RW Datastore)
Call count
Latency
Architecture
34
Correlation Engine Overview
Callgraph-api
Callgraph-be
correlate-fe
drilldown invisualize
site-stabilizer

More Related Content

What's hot (20)

PPTX
Consolidating services with middleware - NDC London 2017
Christian Horsdal
 
PDF
Deep Dive Into Elasticsearch: Establish A Powerful Log Analysis System With E...
Tyler Nguyen
 
PPTX
[Webinar] AWS Monitoring with Site24x7
Site24x7
 
PPTX
Azkaban - WorkFlow Scheduler/Automation Engine
Praveen Thirukonda
 
PDF
Gwava redline3.5
GWAVA
 
PPTX
End user-experience monitoring
Site24x7
 
PPTX
Microsoft Azure and Windows Application monitoring
Site24x7
 
PDF
David Max SATURN 2018 - Migrating from Oracle to Espresso
David Max
 
PPTX
VMware Monitoring-Discover And Monitor Your Virtual Environment
Site24x7
 
PPTX
Intro to.net core 20170111
Christian Horsdal
 
PPTX
Server Monitoring from the Cloud
Site24x7
 
PDF
JIRA Data Center Implementation at Pitney Bowes - Peter Strickland
Atlassian
 
PPTX
Micro Services Architecture
Ranjan Baisak
 
PDF
10 Tips to Pump Up Your Atlassian Performance
Atlassian
 
PPTX
Enabling DevOps to optimize application and server performance
ManageEngine, Zoho Corporation
 
PDF
Getting into the flow building applications with reactive streams
Tim van Eijndhoven
 
PDF
Database ingest with Apache NiFi and MiNiFi
Lucian Neghina
 
PPTX
Application Performance Monitoring (APM)
Site24x7
 
PPTX
Site24x7 Plugins - Monitor your entire server stack
Site24x7
 
PPTX
Modernizing Cloud and Hyperconverged Infrastructure monitoring
ManageEngine, Zoho Corporation
 
Consolidating services with middleware - NDC London 2017
Christian Horsdal
 
Deep Dive Into Elasticsearch: Establish A Powerful Log Analysis System With E...
Tyler Nguyen
 
[Webinar] AWS Monitoring with Site24x7
Site24x7
 
Azkaban - WorkFlow Scheduler/Automation Engine
Praveen Thirukonda
 
Gwava redline3.5
GWAVA
 
End user-experience monitoring
Site24x7
 
Microsoft Azure and Windows Application monitoring
Site24x7
 
David Max SATURN 2018 - Migrating from Oracle to Espresso
David Max
 
VMware Monitoring-Discover And Monitor Your Virtual Environment
Site24x7
 
Intro to.net core 20170111
Christian Horsdal
 
Server Monitoring from the Cloud
Site24x7
 
JIRA Data Center Implementation at Pitney Bowes - Peter Strickland
Atlassian
 
Micro Services Architecture
Ranjan Baisak
 
10 Tips to Pump Up Your Atlassian Performance
Atlassian
 
Enabling DevOps to optimize application and server performance
ManageEngine, Zoho Corporation
 
Getting into the flow building applications with reactive streams
Tim van Eijndhoven
 
Database ingest with Apache NiFi and MiNiFi
Lucian Neghina
 
Application Performance Monitoring (APM)
Site24x7
 
Site24x7 Plugins - Monitor your entire server stack
Site24x7
 
Modernizing Cloud and Hyperconverged Infrastructure monitoring
ManageEngine, Zoho Corporation
 

Viewers also liked (20)

PPTX
Feedback loops: How SREs benefit and what is needed to realize their potential
Pooja Tangi
 
PDF
How to Become a Thought Leader in Your Niche
Leslie Samuel
 
PDF
The servicescore card - Gamifying Operational Excellence - SRECON
Daniel ( Danny ) ☃ Lawrence
 
PPTX
Couchbase Meetup Jan 2016
Michael Kehoe
 
PDF
SRECon USA 2016: Growing your Entry Level Talent
Michael Kehoe
 
PPTX
CouchbasetoHadoop_Matt_Michael_Justin v4
Michael Kehoe
 
PPTX
How TPM saves the day
Pooja Tangi
 
PDF
HBase: How to get MTTR below 1 minute
Hortonworks
 
PPTX
How to Reduce your MTTI/MTTR with a Single Click
Sumo Logic
 
PDF
MTTR
cryptodpc
 
PDF
MTBF / MTTR - Energized Work TekTalk, Mar 2012
Energized Work
 
PDF
White Belt DMAIC Project Line G MTTR
Irfan Rasheed Rana
 
PDF
Introducing libpd -Pdをアプリのサウンドエンジンに-
Yoichi Hirata
 
PDF
ゼロから始めるSparkSQL徹底活用!
Nagato Kasaki
 
PDF
Reliability Centered Maintenance Made Simple
Ricky Smith CMRP, CMRT
 
PPT
Reliability centered maintenance
Rodolfo Stonner, PMP, RMP
 
PPTX
Similan dive center diving liveaboards
Similan Diving
 
PDF
conservation and rewarding biodiversity conservation Trondheim 05-10-gupta-...
Dr Anil Gupta
 
PDF
Grâce aux tags Varnish, j'ai switché ma prod sur Raspberry Pi
Jérémy Derussé
 
PPTX
Presentación - Estudio Anual Comercio Electrónico 2016
Cámara Argentina de Comercio Electrónico
 
Feedback loops: How SREs benefit and what is needed to realize their potential
Pooja Tangi
 
How to Become a Thought Leader in Your Niche
Leslie Samuel
 
The servicescore card - Gamifying Operational Excellence - SRECON
Daniel ( Danny ) ☃ Lawrence
 
Couchbase Meetup Jan 2016
Michael Kehoe
 
SRECon USA 2016: Growing your Entry Level Talent
Michael Kehoe
 
CouchbasetoHadoop_Matt_Michael_Justin v4
Michael Kehoe
 
How TPM saves the day
Pooja Tangi
 
HBase: How to get MTTR below 1 minute
Hortonworks
 
How to Reduce your MTTI/MTTR with a Single Click
Sumo Logic
 
MTTR
cryptodpc
 
MTBF / MTTR - Energized Work TekTalk, Mar 2012
Energized Work
 
White Belt DMAIC Project Line G MTTR
Irfan Rasheed Rana
 
Introducing libpd -Pdをアプリのサウンドエンジンに-
Yoichi Hirata
 
ゼロから始めるSparkSQL徹底活用!
Nagato Kasaki
 
Reliability Centered Maintenance Made Simple
Ricky Smith CMRP, CMRT
 
Reliability centered maintenance
Rodolfo Stonner, PMP, RMP
 
Similan dive center diving liveaboards
Similan Diving
 
conservation and rewarding biodiversity conservation Trondheim 05-10-gupta-...
Dr Anil Gupta
 
Grâce aux tags Varnish, j'ai switché ma prod sur Raspberry Pi
Jérémy Derussé
 
Presentación - Estudio Anual Comercio Electrónico 2016
Cámara Argentina de Comercio Electrónico
 
Ad

Similar to Reducing MTTR and False Escalations: Event Correlation at LinkedIn (20)

PPTX
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
Michael Kehoe
 
PPTX
Performance Metrics Driven CI/CD - Introduction to Continuous Innovation and ...
Mike Villiger
 
PPTX
Refining Your API Design - Architecture and Modeling Learning Event
LaunchAny
 
PDF
Unlock your core business assets for the hybrid cloud with addi webinar dec...
Sherri Hanna
 
PPTX
Technical Webinar with AWS - Everything You Need to Measure in Your Migration
New Relic
 
PDF
nginxcontrollerapimanagementwebinar-190123215258.pdf
Andri Wahyudi
 
PPTX
APIdays Singapore 2019 - Business of APIs: From Integration to Monetisation, ...
apidays
 
PDF
Introduction to Event-Driven Architecture
Solace
 
PDF
Get the Message Across: Seamlessly Transport Data to Apps, Anywhere
VMware Tanzu
 
PPTX
Achieve Full API Lifecycle Management Using NGINX Controller
NGINX, Inc.
 
PPTX
The Need for Speed
Capgemini
 
PDF
Connect Ops and Security with Flexible Web App and API Protection
DevOps.com
 
PDF
Microservices: Organizing Large Teams for Rapid Delivery
VMware Tanzu
 
PPTX
Kochi mulesoft meetup 02
sumitahuja94
 
PPTX
Emvigo Data Visualization - E Commerce Deck
Emvigo Technologies
 
PDF
DevOps in the Cloud with Microsoft Azure
gjuljo
 
PDF
Improve_Application_Availability_and_Performance_Sales_Crib_Sheet.pdf
منیزہ ہاشمی
 
PPTX
11 Ways Microservices & Dynamic Clouds Break Your Monitoring
Abner Germanow
 
PDF
5 Pillars of Building Enterprise0grade APIs
WSO2
 
PDF
Moving to Agile Methods and DevOps on IBM i with ARCAD Pack for Rational 1479...
Philippe Krief
 
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
Michael Kehoe
 
Performance Metrics Driven CI/CD - Introduction to Continuous Innovation and ...
Mike Villiger
 
Refining Your API Design - Architecture and Modeling Learning Event
LaunchAny
 
Unlock your core business assets for the hybrid cloud with addi webinar dec...
Sherri Hanna
 
Technical Webinar with AWS - Everything You Need to Measure in Your Migration
New Relic
 
nginxcontrollerapimanagementwebinar-190123215258.pdf
Andri Wahyudi
 
APIdays Singapore 2019 - Business of APIs: From Integration to Monetisation, ...
apidays
 
Introduction to Event-Driven Architecture
Solace
 
Get the Message Across: Seamlessly Transport Data to Apps, Anywhere
VMware Tanzu
 
Achieve Full API Lifecycle Management Using NGINX Controller
NGINX, Inc.
 
The Need for Speed
Capgemini
 
Connect Ops and Security with Flexible Web App and API Protection
DevOps.com
 
Microservices: Organizing Large Teams for Rapid Delivery
VMware Tanzu
 
Kochi mulesoft meetup 02
sumitahuja94
 
Emvigo Data Visualization - E Commerce Deck
Emvigo Technologies
 
DevOps in the Cloud with Microsoft Azure
gjuljo
 
Improve_Application_Availability_and_Performance_Sales_Crib_Sheet.pdf
منیزہ ہاشمی
 
11 Ways Microservices & Dynamic Clouds Break Your Monitoring
Abner Germanow
 
5 Pillars of Building Enterprise0grade APIs
WSO2
 
Moving to Agile Methods and DevOps on IBM i with ARCAD Pack for Rational 1479...
Philippe Krief
 
Ad

More from Michael Kehoe (16)

PPTX
eBPF Workshop
Michael Kehoe
 
PPTX
eBPF Basics
Michael Kehoe
 
PPTX
Code Yellow: Helping operations top-heavy teams the smart way
Michael Kehoe
 
PDF
QConSF 2018: Building Production-Ready Applications
Michael Kehoe
 
PPTX
Helping operations top-heavy teams the smart way
Michael Kehoe
 
PDF
AllDayDevops: What the NTSB teaches us about incident management & postmortems
Michael Kehoe
 
PPTX
Linux Container Basics
Michael Kehoe
 
PPTX
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Michael Kehoe
 
PDF
What the NTSB teaches us about incident management & postmortems
Michael Kehoe
 
PPTX
PyBay 2018: Production-Ready Python Applications
Michael Kehoe
 
PPTX
Helping operations top-heavy teams the smart way
Michael Kehoe
 
PPTX
The Next Wave of Reliability Engineering
Michael Kehoe
 
PPTX
Building Production-Ready Microservices: DevopsExchangeSF
Michael Kehoe
 
PPTX
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
Michael Kehoe
 
PPTX
SRECon-Europe-2017: Networks for SREs
Michael Kehoe
 
PPTX
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Michael Kehoe
 
eBPF Workshop
Michael Kehoe
 
eBPF Basics
Michael Kehoe
 
Code Yellow: Helping operations top-heavy teams the smart way
Michael Kehoe
 
QConSF 2018: Building Production-Ready Applications
Michael Kehoe
 
Helping operations top-heavy teams the smart way
Michael Kehoe
 
AllDayDevops: What the NTSB teaches us about incident management & postmortems
Michael Kehoe
 
Linux Container Basics
Michael Kehoe
 
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Michael Kehoe
 
What the NTSB teaches us about incident management & postmortems
Michael Kehoe
 
PyBay 2018: Production-Ready Python Applications
Michael Kehoe
 
Helping operations top-heavy teams the smart way
Michael Kehoe
 
The Next Wave of Reliability Engineering
Michael Kehoe
 
Building Production-Ready Microservices: DevopsExchangeSF
Michael Kehoe
 
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
Michael Kehoe
 
SRECon-Europe-2017: Networks for SREs
Michael Kehoe
 
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Michael Kehoe
 

Recently uploaded (20)

PPTX
Cost_of_Quality_Presentation_Software_Engineering.pptx
farispalayi
 
PDF
Technical Guide to Build a Successful Shopify Marketplace from Scratch.pdf
CartCoders
 
PPT
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
PPTX
本科硕士学历佛罗里达大学毕业证(UF毕业证书)24小时在线办理
Taqyea
 
PPTX
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
PDF
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
Zilliz
 
PDF
Web Hosting for Shopify WooCommerce etc.
Harry_Phoneix Harry_Phoneix
 
PPTX
Template Timeplan & Roadmap Product.pptx
ImeldaYulistya
 
PPTX
PE introd.pptxfrgfgfdgfdgfgrtretrt44t444
nepmithibai2024
 
PPTX
英国假毕业证诺森比亚大学成绩单GPA修改UNN学生卡网上可查学历成绩单
Taqyea
 
PDF
Digital Security in 2025 with Adut Angelina
The ClarityDesk
 
PDF
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
PPTX
ZARA-Case.pptx djdkkdjnddkdoodkdxjidjdnhdjjdjx
RonnelPineda2
 
PDF
The Complete Guide to Chrome Net Internals DNS – 2025
Orage Technologies
 
PPTX
一比一原版(LaTech毕业证)路易斯安那理工大学毕业证如何办理
Taqyea
 
PPTX
一比一原版(SUNY-Albany毕业证)纽约州立大学奥尔巴尼分校毕业证如何办理
Taqyea
 
PDF
123546568reb2024-Linux-remote-logging.pdf
lafinedelcinghiale
 
PPTX
ONLINE BIRTH CERTIFICATE APPLICATION SYSYTEM PPT.pptx
ShyamasreeDutta
 
PPTX
ipv6 very very very very vvoverview.pptx
eyala75
 
PDF
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
Cost_of_Quality_Presentation_Software_Engineering.pptx
farispalayi
 
Technical Guide to Build a Successful Shopify Marketplace from Scratch.pdf
CartCoders
 
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
本科硕士学历佛罗里达大学毕业证(UF毕业证书)24小时在线办理
Taqyea
 
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
Zilliz
 
Web Hosting for Shopify WooCommerce etc.
Harry_Phoneix Harry_Phoneix
 
Template Timeplan & Roadmap Product.pptx
ImeldaYulistya
 
PE introd.pptxfrgfgfdgfdgfgrtretrt44t444
nepmithibai2024
 
英国假毕业证诺森比亚大学成绩单GPA修改UNN学生卡网上可查学历成绩单
Taqyea
 
Digital Security in 2025 with Adut Angelina
The ClarityDesk
 
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
ZARA-Case.pptx djdkkdjnddkdoodkdxjidjdnhdjjdjx
RonnelPineda2
 
The Complete Guide to Chrome Net Internals DNS – 2025
Orage Technologies
 
一比一原版(LaTech毕业证)路易斯安那理工大学毕业证如何办理
Taqyea
 
一比一原版(SUNY-Albany毕业证)纽约州立大学奥尔巴尼分校毕业证如何办理
Taqyea
 
123546568reb2024-Linux-remote-logging.pdf
lafinedelcinghiale
 
ONLINE BIRTH CERTIFICATE APPLICATION SYSYTEM PPT.pptx
ShyamasreeDutta
 
ipv6 very very very very vvoverview.pptx
eyala75
 
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 

Reducing MTTR and False Escalations: Event Correlation at LinkedIn