Observability & More
Alon Fliess
Chief Architect
alonf@codevalue.net
@alon_fliess
http://alonfliess.me
http://codevalue.net
Cloudflare blames ‘bad software’ deployment for today’s outage
About Me
 Alon Fliess:
 Chief Software Architect & Co-Founder at OzCode & CodeValue
 More than 30 years of hands-on experience
 Microsoft Regional Director & Microsoft Azure MVP
 Spend most of my time in project analysis, architecture, design
 Code at night
Azure Israel
 https://www.meetup.com/AzureIsrael
4
Agenda
 DevOps, the true story
 Microservice Architecture, the complexity shift
 Ops & Monitoring
 Site Reliable Managers
 Developers & Observability
 Business (marketing, sales, management) and
observability
 Application Performance Monitoring
 How does it work?
 Distributed Tracing
 Production problem solving
5
The Essence of DevOps
 Better Software, Faster! When Development and Operations Synergize
 Covers the *entire* Application Lifecycle
6
Microservice Architecture == Complexity Shift
7
Ops  Vital Signs: Heartbeat, Blood Pressure, Temperature
8
What Do Site Reliability Managers (SRE) Want?
9
What Do Developers Want?
10
What Do Marketing & Sales Teams Want?
11
What is Observability? (Twitter 2013)
12
Gartner
Critical Capabilities for APM (May 2019)
13
Business
Analysis
Anomaly
Detection
IT Operations
DevOps Release
Application Support
Application Development
Application Owner
Use Cases
14
APM Players
Dynatrace
AppDynamics (Cisco)
Datadog
Splunk
Broadcom (CA Technologies)
New Relic
Riverbed
IBM
Instana
Oracle
Tingyun
SolarWinds
ManageEngine
Micro Focus
15
How Does Monitoring & Tracing Work?
16
Operating Systems
APM system tracking agent installed on the machine
CPU, Memory, I/O, Network
Code Tracing
Instrumentation
Manual
Auto
Runtime data collection
Instrumentation – Original Pseudo Code
17
Function AddToBasket(var productId, var quantity)
if (quantity < 0)
return false
var product = Dal.GetProductById(productId)
BasketService.Add(product, quantity)
return true
Instrumentation – Add Logging on Errors
18
Function AddToBasket(var productId, var quantity)
if (quantity < 0)
Log(“Error: Negative quantity value”)
return false
var product = Dal.GetProductById(productId)
BasketService.Add(product, quantity)
return true
Instrumentation – Add Metrics of Usage and Errors
19
Function AddToBasket(var productId, var quantity)
metrics.Count(“AddToBasket”, 1)
if (quantity < 0)
Log(“Error: Negative quantity value”)
metrics.Count(“AddToBasketFailure”, 1)
return false
var product = Dal.GetProductById(productId)
BasketService.Add(product, quantity)
return true
Instrumentation – Measure Latency
20
Function AddToBasket(var productId, var quantity)
metrics.Count(“AddToBasket”, 1)
start = time()
if (quantity < 0)
Log(“Error: Negative quantity value”)
metrics.Count(“AddToBasketFailure”, 1)
return false
var product = Dal.GetProductById(productId);
BasketService.Add(product, quantity);
metrics.Measure(“AddToBasket”, time() – start);
return true;
Instrumentation – Measure Latency Everywhere
21
Function AddToBasket(var productId, var quantity)
metrics.Count(“AddToBasket”, 1)
start = time()
if (quantity < 0)
Log(“Error: Negative quantity value”)
metrics.Count(“AddToBasketFailure”, 1)
return false
var product = Dal.GetProductById(productId)
metrics.Measure(“AddToBasket_GetProductById”, time() – start)
BasketService.Add(product, quantity)
metrics.Measure(“AddToBasket”, time() – start)
return true
Instrumentation – Add Debugging Information
22
Function AddToBasket(var productId, var quantity)
debug.AddParameters(“AddToBasket”, [[“ProductId”, productid],[“quantity”, quantity]])
metrics.Count(“AddToBasket”, 1)
start = time()
if (quantity < 0)
Log(“Error: Negative quantity value”)
metrics.Count(“AddToBasketFailure”, 1)
debug.AddError(“AddToBasket”, GetErrorData())
return false
var product = Dal.GetProductById(productId)
debug.AddValue(“AddToBasket”, [[“product”, product]])
metrics.Measure(“AddToBasket_GetProductById”, time() – start)
BasketService.Add(product, quantity)
metrics.Measure(“AddToBasket”, time() – start)
return true
Instrumentation – Original vs. Instrumented Code
23
Function AddToBasket(var productId, var quantity)
debug.AddParameters(“AddToBasket”, [[“ProductId”, productid],[“quantity”, quantity]])
metrics.Count(“AddToBasket”, 1)
start = time()
if (quantity < 0)
Log(“Error: Negative quantity value”)
metrics.Count(“AddToBasketFailure”, 1)
debug.AddError(“AddToBasket”, GetErrorData())
return false
var product = Dal.GetProductById(productId)
debug.AddValue(“AddToBasket”, [[“product”, product]])
metrics.Measure(“AddToBasket_GetProductById”, time() – start)
BasketService.Add(product, quantity)
metrics.Measure(“AddToBasket”, time() – start)
return true
Instrumentation and Tracing Automation
 Aspect Oriented Approach
 Communication level instrumentation
 Pipeline interception – technology depended
 Resource performance counters – DB statistics for example
 Code Instrumentation
 Manual – deploy a package and call it
 Automatic – bytecode instrumentation libraries and tools
 Distributed Tracing
 Passing call context between services
24
Distributed Tracing
25
Id:123
Application
A
Service A
B
Service B
Span
Span
Span
Instrumentation – Call Context
26
Function AddToBasket(var productId, var quantity, var context)
debug.AddParameters(context, “AddToBasket”, [[“ProductId”, productid],[“quantity”, quantity]])
metrics.Count(context, “AddToBasket”, 1)
start = time()
if (quantity < 0)
Log(context, “Error: Negative quantity value”)
metrics.Count(context, “AddToBasketFailure”, 1)
debug.AddError(context, “AddToBasket”, GetErrorData())
return false
var product = Dal.GetProductById(context, productId)
debug.AddValue(context, “AddToBasket”, [[“product”, product]])
metrics.Measure(context, “AddToBasket_GetProductById”, time() – start)
BasketService.Add(context, product, quantity)
metrics.Measure(context, “AddToBasket”, time() – start)
return true
Context:
Call Id
URL
HTTP Method
DB Host
User Info
Timing Info
Instrumentation – Using Span
27
Function AddToBasket(var productId, var quantity, var context)
span = trace.BeginSpan(context, {“AddToBasket”, productid, quantity})
if (quantity < 0)
span.Error(“Negative quantity value”)
return false;
var product = Dal.GetProductById(context, productId)
span.AddValue(“product”, product)
BasketService.Add(context, product, quantity)
span.End()
return true;
Span:
Call Id
URL
HTTP Method
DB Host
User Info
Timing Info
OpenTracing & OpenCencus
28
What Do SREs & Developers Want – From Each Other?
29
New Relic APM Dashboard
APM Error Analysis – Not Enough Information
Error Rate
Request information
Stack trace
 APM systems can assist in health monitoring and fault first aid
Production Problem Solving Challenges
10kg
Can’t mess with
data
10kg
No Debugging
tools
10kg
Code is
optimized
10kg
Older source
code version
10kg
Can’t impact
performance
10kg
Data must stay in
a secure env.
10kg
Data is private and
contains PII
10kg
Very hard to
reproduce the bug
Problem Solving With APM
33
Production Problem Solving Platforms
 OzCode
 OverOps
 Rookout
 Application Insights
34
Problem Solving With a Production Debugger
35
OzCode Production Debugger
36
Summary
37
Q
A
38
Alon Fliess
Chief Architect
alonf@codevalue.net
@alon_fliess
http://alonfliess.me
http://codevalue.net

More Related Content

PPTX
Solving the Hidden Costs of Kubernetes with Observability
PDF
Combining Logs, Metrics, and Traces for Unified Observability
PPTX
Observability – the good, the bad, and the ugly
PPTX
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
PDF
Why Distributed Tracing is Essential for Performance and Reliability
PDF
Testing the intelligent digital mesh final
PPTX
Sumo Logic Cert Jam - Security Analytics
PPTX
Sumo Logic Cert Jam - Advanced Metrics with Kubernetes
Solving the Hidden Costs of Kubernetes with Observability
Combining Logs, Metrics, and Traces for Unified Observability
Observability – the good, the bad, and the ugly
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
Why Distributed Tracing is Essential for Performance and Reliability
Testing the intelligent digital mesh final
Sumo Logic Cert Jam - Security Analytics
Sumo Logic Cert Jam - Advanced Metrics with Kubernetes

What's hot (20)

PDF
FIWARE Complex Event Processing
PDF
Introduction to Open Telemetry as Observability Library
PDF
Find Out What's New With WhiteSource September 2018- A WhiteSource Webinar
PPTX
Data Obfuscation in Splunk Enterprise
PDF
IBM Index Conference - 10 steps to build token based API Security
PDF
Cybersecurity with Apache Metron and Apache Solr - Ward Bekker, Hortonworks &...
PPTX
What Questions Do Programmers Ask About Configuration as Code?
PDF
Vulnerability Detection Based on Git History
PPTX
Shhh!: Secret Management Practices for Infrastructure as Code
PPTX
Sumo Logic Cert Jam - Security & Compliance
PPTX
Fiware, the future internet
PDF
apidays LIVE London 2021 - API Security in Highly Volatile Threat Landscapes ...
PDF
Security Certification: Security Analytics using Sumo Logic - Oct 2018
PPTX
Providence Future of Data Meetup - Apache Metron Open Source Cybersecurity Pl...
PPTX
Sumo Logic Cert Jam - Advanced Metrics with Kubernetes
PDF
(SACON) Pradyumn Nand & Mrinal Pande - Metron & Blitz, Building and scaling y...
PDF
Improving Automated Tests with Fluent Assertions
PDF
Conf2014_SplunkSecurityNinjutsu
PDF
SplunkLive! Frankfurt 2018 - Customer Presentation: Bosch Cyber Defense Center
PPTX
Sumo Logic Cert Jam - Metrics Mastery
FIWARE Complex Event Processing
Introduction to Open Telemetry as Observability Library
Find Out What's New With WhiteSource September 2018- A WhiteSource Webinar
Data Obfuscation in Splunk Enterprise
IBM Index Conference - 10 steps to build token based API Security
Cybersecurity with Apache Metron and Apache Solr - Ward Bekker, Hortonworks &...
What Questions Do Programmers Ask About Configuration as Code?
Vulnerability Detection Based on Git History
Shhh!: Secret Management Practices for Infrastructure as Code
Sumo Logic Cert Jam - Security & Compliance
Fiware, the future internet
apidays LIVE London 2021 - API Security in Highly Volatile Threat Landscapes ...
Security Certification: Security Analytics using Sumo Logic - Oct 2018
Providence Future of Data Meetup - Apache Metron Open Source Cybersecurity Pl...
Sumo Logic Cert Jam - Advanced Metrics with Kubernetes
(SACON) Pradyumn Nand & Mrinal Pande - Metron & Blitz, Building and scaling y...
Improving Automated Tests with Fluent Assertions
Conf2014_SplunkSecurityNinjutsu
SplunkLive! Frankfurt 2018 - Customer Presentation: Bosch Cyber Defense Center
Sumo Logic Cert Jam - Metrics Mastery
Ad

Similar to Observability and more architecture next 2020 (20)

PPTX
Bootstrapping an App for Launch
PPTX
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
PDF
Just enough web ops for web developers
PPTX
MongoDB.local Dallas 2019: MongoDB Stitch Tutorial
PDF
Composable and streamable Play apps
PDF
Building a Pyramid: Symfony Testing Strategies
PDF
Bootiful Development with Spring Boot and React - Dublin JUG 2018
PPT
Google Web Toolkit
PDF
Google Analytics for Developers
PPTX
Building Push Triggers for Logic Apps
PPTX
What is going on - Application diagnostics on Azure - TechDays Finland
PDF
Creating web api and consuming- part 1
PDF
Google Analytics for Developers
PDF
Streamlining data analysis through environmental alerts how to integrate ambe...
PDF
Cómo tener analíticas en tu app y no volverte loco
PDF
Architecting for change: LinkedIn's new data ecosystem
PDF
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
PDF
Bootiful Development with Spring Boot and React - UberConf 2018
PDF
Pragmatic Code Coverage
PPTX
MSFT Dumaguete 061616 - Building High Performance Apps
Bootstrapping an App for Launch
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
Just enough web ops for web developers
MongoDB.local Dallas 2019: MongoDB Stitch Tutorial
Composable and streamable Play apps
Building a Pyramid: Symfony Testing Strategies
Bootiful Development with Spring Boot and React - Dublin JUG 2018
Google Web Toolkit
Google Analytics for Developers
Building Push Triggers for Logic Apps
What is going on - Application diagnostics on Azure - TechDays Finland
Creating web api and consuming- part 1
Google Analytics for Developers
Streamlining data analysis through environmental alerts how to integrate ambe...
Cómo tener analíticas en tu app y no volverte loco
Architecting for change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Bootiful Development with Spring Boot and React - UberConf 2018
Pragmatic Code Coverage
MSFT Dumaguete 061616 - Building High Performance Apps
Ad

More from Alon Fliess (11)

PPTX
Generative AI in CSharp with Semantic Kernel.pptx
PPTX
Generating cross platform .NET based azure IoTdevice
PPTX
Zionet Overview
PPTX
C# Production Debugging Made Easy
PPTX
We Make Debugging Sucks Less
PPTX
Architecting io t solutions with microisoft azure ignite tour version
PPTX
To microservice or not to microservice - ignite version
PPTX
Net core microservice development made easy with azure dev spaces
PPTX
DWX2018 IoT lecture
PPTX
Architecting IoT solutions with Microsoft Azure
PPTX
Azure Internet of Things
Generative AI in CSharp with Semantic Kernel.pptx
Generating cross platform .NET based azure IoTdevice
Zionet Overview
C# Production Debugging Made Easy
We Make Debugging Sucks Less
Architecting io t solutions with microisoft azure ignite tour version
To microservice or not to microservice - ignite version
Net core microservice development made easy with azure dev spaces
DWX2018 IoT lecture
Architecting IoT solutions with Microsoft Azure
Azure Internet of Things

Recently uploaded (20)

DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
PDF
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
PPTX
Cybersecurity: Protecting the Digital World
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
AI Guide for Business Growth - Arna Softech
PDF
BoxLang Dynamic AWS Lambda - Japan Edition
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PDF
Visual explanation of Dijkstra's Algorithm using Python
PPTX
Full-Stack Developer Courses That Actually Land You Jobs
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PPTX
Matchmaking for JVMs: How to Pick the Perfect GC Partner
PDF
E-Commerce Website Development Companyin india
PDF
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
PPTX
Tech Workshop Escape Room Tech Workshop
PDF
Practical Indispensable Project Management Tips for Delivering Successful Exp...
PDF
Website Design Services for Small Businesses.pdf
PPTX
Airline CRS | Airline CRS Systems | CRS System
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
Salesforce Agentforce AI Implementation.pdf
How to Use SharePoint as an ISO-Compliant Document Management System
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
Cybersecurity: Protecting the Digital World
GSA Content Generator Crack (2025 Latest)
AI Guide for Business Growth - Arna Softech
BoxLang Dynamic AWS Lambda - Japan Edition
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
Weekly report ppt - harsh dattuprasad patel.pptx
Visual explanation of Dijkstra's Algorithm using Python
Full-Stack Developer Courses That Actually Land You Jobs
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
Matchmaking for JVMs: How to Pick the Perfect GC Partner
E-Commerce Website Development Companyin india
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
Tech Workshop Escape Room Tech Workshop
Practical Indispensable Project Management Tips for Delivering Successful Exp...
Website Design Services for Small Businesses.pdf
Airline CRS | Airline CRS Systems | CRS System
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
Salesforce Agentforce AI Implementation.pdf

Observability and more architecture next 2020

  • 1. Observability & More Alon Fliess Chief Architect [email protected] @alon_fliess http://alonfliess.me http://codevalue.net
  • 2. Cloudflare blames ‘bad software’ deployment for today’s outage
  • 3. About Me  Alon Fliess:  Chief Software Architect & Co-Founder at OzCode & CodeValue  More than 30 years of hands-on experience  Microsoft Regional Director & Microsoft Azure MVP  Spend most of my time in project analysis, architecture, design  Code at night
  • 5. Agenda  DevOps, the true story  Microservice Architecture, the complexity shift  Ops & Monitoring  Site Reliable Managers  Developers & Observability  Business (marketing, sales, management) and observability  Application Performance Monitoring  How does it work?  Distributed Tracing  Production problem solving 5
  • 6. The Essence of DevOps  Better Software, Faster! When Development and Operations Synergize  Covers the *entire* Application Lifecycle 6
  • 7. Microservice Architecture == Complexity Shift 7
  • 8. Ops  Vital Signs: Heartbeat, Blood Pressure, Temperature 8
  • 9. What Do Site Reliability Managers (SRE) Want? 9
  • 10. What Do Developers Want? 10
  • 11. What Do Marketing & Sales Teams Want? 11
  • 12. What is Observability? (Twitter 2013) 12
  • 13. Gartner Critical Capabilities for APM (May 2019) 13 Business Analysis Anomaly Detection IT Operations DevOps Release Application Support Application Development Application Owner Use Cases
  • 14. 14
  • 15. APM Players Dynatrace AppDynamics (Cisco) Datadog Splunk Broadcom (CA Technologies) New Relic Riverbed IBM Instana Oracle Tingyun SolarWinds ManageEngine Micro Focus 15
  • 16. How Does Monitoring & Tracing Work? 16 Operating Systems APM system tracking agent installed on the machine CPU, Memory, I/O, Network Code Tracing Instrumentation Manual Auto Runtime data collection
  • 17. Instrumentation – Original Pseudo Code 17 Function AddToBasket(var productId, var quantity) if (quantity < 0) return false var product = Dal.GetProductById(productId) BasketService.Add(product, quantity) return true
  • 18. Instrumentation – Add Logging on Errors 18 Function AddToBasket(var productId, var quantity) if (quantity < 0) Log(“Error: Negative quantity value”) return false var product = Dal.GetProductById(productId) BasketService.Add(product, quantity) return true
  • 19. Instrumentation – Add Metrics of Usage and Errors 19 Function AddToBasket(var productId, var quantity) metrics.Count(“AddToBasket”, 1) if (quantity < 0) Log(“Error: Negative quantity value”) metrics.Count(“AddToBasketFailure”, 1) return false var product = Dal.GetProductById(productId) BasketService.Add(product, quantity) return true
  • 20. Instrumentation – Measure Latency 20 Function AddToBasket(var productId, var quantity) metrics.Count(“AddToBasket”, 1) start = time() if (quantity < 0) Log(“Error: Negative quantity value”) metrics.Count(“AddToBasketFailure”, 1) return false var product = Dal.GetProductById(productId); BasketService.Add(product, quantity); metrics.Measure(“AddToBasket”, time() – start); return true;
  • 21. Instrumentation – Measure Latency Everywhere 21 Function AddToBasket(var productId, var quantity) metrics.Count(“AddToBasket”, 1) start = time() if (quantity < 0) Log(“Error: Negative quantity value”) metrics.Count(“AddToBasketFailure”, 1) return false var product = Dal.GetProductById(productId) metrics.Measure(“AddToBasket_GetProductById”, time() – start) BasketService.Add(product, quantity) metrics.Measure(“AddToBasket”, time() – start) return true
  • 22. Instrumentation – Add Debugging Information 22 Function AddToBasket(var productId, var quantity) debug.AddParameters(“AddToBasket”, [[“ProductId”, productid],[“quantity”, quantity]]) metrics.Count(“AddToBasket”, 1) start = time() if (quantity < 0) Log(“Error: Negative quantity value”) metrics.Count(“AddToBasketFailure”, 1) debug.AddError(“AddToBasket”, GetErrorData()) return false var product = Dal.GetProductById(productId) debug.AddValue(“AddToBasket”, [[“product”, product]]) metrics.Measure(“AddToBasket_GetProductById”, time() – start) BasketService.Add(product, quantity) metrics.Measure(“AddToBasket”, time() – start) return true
  • 23. Instrumentation – Original vs. Instrumented Code 23 Function AddToBasket(var productId, var quantity) debug.AddParameters(“AddToBasket”, [[“ProductId”, productid],[“quantity”, quantity]]) metrics.Count(“AddToBasket”, 1) start = time() if (quantity < 0) Log(“Error: Negative quantity value”) metrics.Count(“AddToBasketFailure”, 1) debug.AddError(“AddToBasket”, GetErrorData()) return false var product = Dal.GetProductById(productId) debug.AddValue(“AddToBasket”, [[“product”, product]]) metrics.Measure(“AddToBasket_GetProductById”, time() – start) BasketService.Add(product, quantity) metrics.Measure(“AddToBasket”, time() – start) return true
  • 24. Instrumentation and Tracing Automation  Aspect Oriented Approach  Communication level instrumentation  Pipeline interception – technology depended  Resource performance counters – DB statistics for example  Code Instrumentation  Manual – deploy a package and call it  Automatic – bytecode instrumentation libraries and tools  Distributed Tracing  Passing call context between services 24
  • 26. Instrumentation – Call Context 26 Function AddToBasket(var productId, var quantity, var context) debug.AddParameters(context, “AddToBasket”, [[“ProductId”, productid],[“quantity”, quantity]]) metrics.Count(context, “AddToBasket”, 1) start = time() if (quantity < 0) Log(context, “Error: Negative quantity value”) metrics.Count(context, “AddToBasketFailure”, 1) debug.AddError(context, “AddToBasket”, GetErrorData()) return false var product = Dal.GetProductById(context, productId) debug.AddValue(context, “AddToBasket”, [[“product”, product]]) metrics.Measure(context, “AddToBasket_GetProductById”, time() – start) BasketService.Add(context, product, quantity) metrics.Measure(context, “AddToBasket”, time() – start) return true Context: Call Id URL HTTP Method DB Host User Info Timing Info
  • 27. Instrumentation – Using Span 27 Function AddToBasket(var productId, var quantity, var context) span = trace.BeginSpan(context, {“AddToBasket”, productid, quantity}) if (quantity < 0) span.Error(“Negative quantity value”) return false; var product = Dal.GetProductById(context, productId) span.AddValue(“product”, product) BasketService.Add(context, product, quantity) span.End() return true; Span: Call Id URL HTTP Method DB Host User Info Timing Info
  • 29. What Do SREs & Developers Want – From Each Other? 29
  • 30. New Relic APM Dashboard
  • 31. APM Error Analysis – Not Enough Information Error Rate Request information Stack trace  APM systems can assist in health monitoring and fault first aid
  • 32. Production Problem Solving Challenges 10kg Can’t mess with data 10kg No Debugging tools 10kg Code is optimized 10kg Older source code version 10kg Can’t impact performance 10kg Data must stay in a secure env. 10kg Data is private and contains PII 10kg Very hard to reproduce the bug
  • 34. Production Problem Solving Platforms  OzCode  OverOps  Rookout  Application Insights 34
  • 35. Problem Solving With a Production Debugger 35

Editor's Notes

  • #8: MSA – many small parts deployed and communicate Simple components, Complex combination Very hard to follow a request that spans many services Must have automation process to overcome the complexity Must have health monitor, performance monitor and cross-services error handling TOOLS!!!
  • #9: More than CI/CD Ops  First aid medic, take vital signs CPU, Network, IO, Memory Request throughput and latency
  • #10: Wants easy life. Eats the meal that the Dev team cooked. The customer of the Dev team Bugs, Problem Solving Need to know the current situation with the current problems For example, can role back to a previous version, but need to know the status of the bug fix
  • #11: Information, Debuggability Reproduce the problem
  • #12: Analytics Business Insights Usage
  • #13: As Twitter has moved from a monolithic to a distributed architecture, our scalability has increased dramatically. Zipking – a distributed tracing system (https://zipkin.io/)
  • #14: Business Analysis - business related KPI IT Service Monitoring - health of Key Services Root Cause Analysis - a failure or degradation Anomaly Detection identifying system observations that do not conform to an expected behavior Distributed Profiling track transactions across a mesh of interconnected nodes, followed by detection of where along the path the degradation appears to be happening Application Debugging production debugging capabilities, based on distributed date collection
  • #20: Enable saying: 15% of our request fails
  • #26: Errors & problems root cause may be the result of
  • #27: Problem happens only with a specific user or URLs
  • #28: Problem happens only with a specific user or URLs
  • #29: OpenTelemetry makes robust, portable telemetry a built-in feature of cloud-native software. OpenTelemetry provides a single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application. You can analyze them using Prometheus, Jaeger, and other observability tools.
  • #38: DevOps, the true story Microservice Architecture, the complexity shift Ops & Monitoring Site Reliable Managers Developers & Observability Business (marketing, sales, management) and observability Application Performance Monitoring How does it work? Distributed Tracing Production problem solving