SlideShare a Scribd company logo
DOSE: Deployment and Operations
for Software Engineers
Postproduction
© Len Bass 2019 2
Open Application Model
© Len Bass 2019 3
Key concepts
• Incident - an event that could lead to
loss of, or disruption to, an
organization's operations, services or
functions.
• May be minor, such as running out of disk
space
• May be major , such as data breach
• Telemetry – collection of information for
monitoring environmental conditions
© Len Bass 2019 4
Overview
• Telemetry
• Incident response
• Live testing
© Len Bass 2019 5
Scenario
• It is 3:00AM and your pager goes off.
• There is a problem with your service!
• You get out of bed and log onto the production
environment and look at the services dashboard.
• One instance of your service has high latency
• You drill down and discover the problem is a slow disk
• You move temporary files for your service to another
disk and place the message “replace disk” on the
operators queue.
© Len Bass 2019 6
Troubleshooting process
• First step is to isolate problem
• Current service
• Upstream service (too many requests)
• Downstream service (too slow)
• Second step is to decide whether it is a hardware or
software problem
• What has changed in the software?
• Has hardware shown signs of problems with other
services?
• If a single instance of multiple instances has
problems, look for hardware first. © Len Bass 2015 6
© Len Bass 2019 7
Single service – single server
• Look at following data
• CPU
• Memory
• I/O activity
• Number of requests
• Response time to inbound requests
• Response time for outbound requests
• Error rates
• Look for abnormal values
© Len Bass 2015 7
© Len Bass 2019 8
Single service – multiple
servers
• Multiple servers served through a load balancer
• Look at same set of data as for single server
• CPU
• Memory
• I/O activity
• Number of requests
• Response time to inbound requests
• Response time for outbound requests
• Error rates
• Look at aggregate values over multiple servers
© Len Bass 2015 8
© Len Bass 2019 9
Isolating problem
• Is problem with this service or client or
dependent services?
• If problem is with this service is it
manifested across all servers or just
one. I.e. drill down into aggregates to
get individual values
© Len Bass 2015 9
© Len Bass 2019 10
Multiple services – multiple
servers
• Same basic strategy
• Isolate problem through identifying
problem by looking at aggregates
• Drill down to decide service and server
that contributes to problem
• Look at what has changed in software and
whether hardware has manifested
problems earlier
© Len Bass 2015 10
© Len Bass 2019 11
Overall requirements from this
sequence of trouble shooting
• Gather variety of different kinds of data
• Either resource usage or things that
contribute to resource usage
• Ensure each data item can be traced as to
source and activity
• Collect data into a location where it can
be queried and drilled into.
© Len Bass 2015 11
© Len Bass 2019 12
Information needs
• Metrics collected by infrastructure
• Logs from instance with relevant information
• Central repository for logs
• Dashboard that displays metrics
• Alerting system
• Monitoring latency of instances
• Rule: if high latency then alarm
© Len Bass 2019 13
Architecture of Monitoring
System
Configuration
Management
System
OperatorUser tracking
Operation logs
Monitoring System
Monitoring data
storage
Visualization Alarm evaluation
Big Data
Analytics
Traditional BI
Intrusion
Detection
Other
applications
Other
systems
Alerts
System 1
Application
Middleware
OS
Agent
System 2
Application
Middleware
OS
...
agent-based
agentless
Health
checks
© Len Bass 2019 14
Logs
• A log is an append only data structure
• Written by each software system.
• Located in a fixed directory within the operating
system
• Enumerates events from within software system
• Entry/exit
• Troubleshooting
• DB modifications
• …
© Len Bass 2019 15
Instance Log
Configuration
Management
System
OperatorUser tracking
Operation logs
Monitoring System
Monitoring data
storage
Visualization Alarm evaluation
Big Data
Analytics
Traditional BI
Intrusion
Detection
Other
applications
Other
systems
Alerts
System 1
Application
Middleware
OS
Agent
System 2
Application
Middleware
OS
...
agent-based
agentless
Health
checks
Daemon on
instance copies
logs to central
repository
© Len Bass 2019 16
Logs on Entry/Exit
• Recall that Protocol Buffers automatically
generate procedures that are called on
entry/exit to a service
• These procedures can be made to call
logging service with parameters and
identification information.
• Logs on entry/exit can be made without
additional developer activity
© Len Bass 2019 17
Metrics
• Metrics are measures of activity over
some period of time
• Collected automatically by infrastructure
over externally visible activities of VM
• CPU
• I/O
• etc
© Len Bass 2019
Configuration
Management
System
OperatorUser tracking
Operation logs
Monitoring System
Monitoring data
storage
Visualization Alarm evaluation
Big Data
Analytics
Traditional BI
Intrusion
Detection
Other
applications
Other
systems
Alerts
System 1
Application
Middleware
OS
Agent
System 2
Application
Middleware
OS
...
agent-based
agentless
Health
checks
Metrics collected by infrastructure
© Len Bass 2019
Repository
• Logs and metrics are placed in central repository
• Repository generates alarms based on rules
• Provides central location for examination when
problem occurs
• Displays information in dashboard that allows for
drilling down to understand source of particular
readings.
© Len Bass 2019
Configuration
Management
System
OperatorUser tracking
Operation logs
Monitoring System
Monitoring data
storage
Visualization Alarm evaluation
Big Data
Analytics
Traditional BI
Intrusion
Detection
Other
applications
Other
systems
Alerts
System 1
Application
Middleware
OS
Agent
System 2
Application
Middleware
OS
...
agent-based
agentless
Health
checks
Central Repository with alerting and dashboard
© Len Bass 2019 21
Overview
• Telemetry
• Incident response
• Live testing
© Len Bass 2019 22
Incident response
• Incident occurs
• can be a result of telemetry data or externally
caused.
• Incident response is the managing of the aftermath of
the incident.
• Ideal response:
• Restore the system to production
• Analyze cause of incident
• Prevent the incident from re-occurring
© Len Bass 2019 23
Two incident response
philosophies
• You build it, you run it (originated by
Amazon)
• Site Reliability Engineers (SRE)
(originated by Google)
© Len Bass 2019 24
You build it, you run it
“There is another lesson here: Giving developers operational
responsibilities has greatly enhanced the quality of the services, both
from a customer and a technology point of view. The traditional model is
that you take your software to the wall that separates development and
operations and throw it over and then forget about it. Not at Amazon.
You build it, you run it. This brings developers into contact with the day-
to-day operation of their software. It also brings them into day-to-day
contact with the customer. This customer feedback loop is essential for
improving the quality of the service.”
-Wener Vogels
https://queue.acm.org/detail.cfm?id=1142065
© Len Bass 2019 25
SRE
• Separate organizational unit whose
responsibility is to manage incidents.
• Coordination enables detection of system
outage patterns
• SRE team rotates pager duty
• Term for an SRE is ~2-3 years. High stress
and they burn out. Ex SREers go back to
production unit.
© Len Bass 2019 26
SRE mindset
• "Here’s what you do when someone breaks
something or finds something very difficult to
debug: You say thank you. Thank you for
finding this edge case. Thank you for
highlighting this overcomplicated part of our
system. Thank you for pointing out this gap in
our docs. And then you go make it so nobody
can break it the same way again.“
• Tanya Reilly https://landing.google.com/sre/
© Len Bass 2019 27
Overview
• Telemetry
• Incident response
• Live testing
© Len Bass 2019 28
Live testing
• Netflix has a “Simian Army” to perform testing after a service
is in production.
• Chaos Monkey kills production processes
• Latency Monkey introduces extra latency into the
network.
• Various other monkeys perform janitor services
• Looking for certificates or licenses about to expire
• Ensuring appropriate localization
• Cleaning up unused resources
• Ensuring security groups are appropriately used.
28
© Len Bass 2019 29
Summary
• Developers may carry pagers and be first
responders
• Determining problem requires access to a
wide variety of data
• Logs
• Metrics
• Postproduction testing may introduce errors
or provide janitorial services

More Related Content

PDF
11 secure development
Len Bass
 
PDF
8 pipeline
Len Bass
 
PDF
1 virtual machines
Len Bass
 
PDF
10 disaster recovery
Len Bass
 
PDF
7 configuration management
Len Bass
 
PDF
3 the cloud
Len Bass
 
PDF
6 microservice architecture
Len Bass
 
PDF
2 networking
Len Bass
 
11 secure development
Len Bass
 
8 pipeline
Len Bass
 
1 virtual machines
Len Bass
 
10 disaster recovery
Len Bass
 
7 configuration management
Len Bass
 
3 the cloud
Len Bass
 
6 microservice architecture
Len Bass
 
2 networking
Len Bass
 

What's hot (20)

PPTX
4 container management
Len Bass
 
PDF
5 infrastructure security
Len Bass
 
PPT
Chapter09
Muhammad Ahad
 
PPTX
System center 2012 configurations manager
Belarmino Tomicha
 
PPTX
Citrix xenapp training
Yuvaraj1986
 
PDF
CloudBridge and Repeater Datasheet
Nuno Alves
 
PDF
How to Balance System Speed and Risk for Multi-Platform Innovation
Claudia Ring
 
PPTX
V mware thin app 4.5 customer presentation
solarisyourep
 
PPTX
Latency - The King of the Mobile Experience
WardTechTalent
 
PPTX
Application hardening, Secure Socket Layer(SSL) & Secure Electronic Transacti...
Jayesh Naik
 
PPTX
RPASS - Ricoh Proactive ServiceS for Remote Monitoring & Backup
Ricoh India Limited
 
PPT
The bits bytes and business benefits of securing your mq environment and mess...
Leif Davidsen
 
DOC
Paul-Resume
Paul Adkins
 
PPTX
Continuous Delivery of Cloud Applications: Blue/Green and Canary Deployments
Praveen Yalagandula
 
PPTX
Using ibm mq in managed file transfer environments final
Leif Davidsen
 
PPT
How-To: Linux Performance Monitoring & Management for your Multi-Vendor Network
SolarWinds
 
PDF
Tech 2 Tech: Network performance
Jisc
 
PPTX
MMS2012-HP VirtualSystem-The Ideal Foundation for a Microsoft Private Cloud
Harold Sriver
 
PDF
GWAVACon 2013: Gain Control - ZENworks
GWAVA
 
4 container management
Len Bass
 
5 infrastructure security
Len Bass
 
Chapter09
Muhammad Ahad
 
System center 2012 configurations manager
Belarmino Tomicha
 
Citrix xenapp training
Yuvaraj1986
 
CloudBridge and Repeater Datasheet
Nuno Alves
 
How to Balance System Speed and Risk for Multi-Platform Innovation
Claudia Ring
 
V mware thin app 4.5 customer presentation
solarisyourep
 
Latency - The King of the Mobile Experience
WardTechTalent
 
Application hardening, Secure Socket Layer(SSL) & Secure Electronic Transacti...
Jayesh Naik
 
RPASS - Ricoh Proactive ServiceS for Remote Monitoring & Backup
Ricoh India Limited
 
The bits bytes and business benefits of securing your mq environment and mess...
Leif Davidsen
 
Paul-Resume
Paul Adkins
 
Continuous Delivery of Cloud Applications: Blue/Green and Canary Deployments
Praveen Yalagandula
 
Using ibm mq in managed file transfer environments final
Leif Davidsen
 
How-To: Linux Performance Monitoring & Management for your Multi-Vendor Network
SolarWinds
 
Tech 2 Tech: Network performance
Jisc
 
MMS2012-HP VirtualSystem-The Ideal Foundation for a Microsoft Private Cloud
Harold Sriver
 
GWAVACon 2013: Gain Control - ZENworks
GWAVA
 
Ad

Similar to 9 postproduction (20)

PPTX
RuSIEM overview (english version)
Olesya Shelestova
 
PDF
What is onTune for management
TeemStone Pty Ltd
 
PPTX
Government and Education Webinar: There's More Than One Way to Monitor SQL Da...
SolarWinds
 
PDF
Visualizing Your Network Health - Know your Network
DellNMS
 
PPTX
V center operations standard presentation
solarisyourep
 
PPTX
SplunkLive! Austin Customer Presentation - Dell
Splunk
 
PPTX
Ojoconsulting Oy Nimbus Monitoring Service description v1.2 public
Ojoconsulting Oy
 
PPTX
Dynamic datacenter planning and design
Yeonki Choi
 
PDF
Introduction to dev ops
Len Bass
 
PPTX
Event Driven Architecture – Enabling Microservices
Bradley Irby
 
PPTX
Introducing Ironstream Support for ServiceNow Event Management
Precisely
 
PPTX
Performance Testing
Anu Shaji
 
PPTX
Quick and dirty performance analysis
Chris Kernaghan
 
PPTX
Top 5 Java Performance Metrics, Tips & Tricks
AppDynamics
 
PPTX
Sys track customer facing-terminal server-updated
Syntax Inc.
 
PDF
IBM IT Operations Analytics for z systems
IBM z Systems Software - IT Service Management
 
PDF
IBM IT Operations Analytics for z Systems
IBM z Systems Software - IT Service Management
 
PDF
Ims02 automics and modernization - IMS UG May 2014 Sydney & Melbourne
Robert Hain
 
PPTX
Applications Performance Monitoring with Applications Manager part 1
ManageEngine, Zoho Corporation
 
PPTX
Government and Education: IT Tools to Support Your Hybrid Workforce
SolarWinds
 
RuSIEM overview (english version)
Olesya Shelestova
 
What is onTune for management
TeemStone Pty Ltd
 
Government and Education Webinar: There's More Than One Way to Monitor SQL Da...
SolarWinds
 
Visualizing Your Network Health - Know your Network
DellNMS
 
V center operations standard presentation
solarisyourep
 
SplunkLive! Austin Customer Presentation - Dell
Splunk
 
Ojoconsulting Oy Nimbus Monitoring Service description v1.2 public
Ojoconsulting Oy
 
Dynamic datacenter planning and design
Yeonki Choi
 
Introduction to dev ops
Len Bass
 
Event Driven Architecture – Enabling Microservices
Bradley Irby
 
Introducing Ironstream Support for ServiceNow Event Management
Precisely
 
Performance Testing
Anu Shaji
 
Quick and dirty performance analysis
Chris Kernaghan
 
Top 5 Java Performance Metrics, Tips & Tricks
AppDynamics
 
Sys track customer facing-terminal server-updated
Syntax Inc.
 
IBM IT Operations Analytics for z systems
IBM z Systems Software - IT Service Management
 
IBM IT Operations Analytics for z Systems
IBM z Systems Software - IT Service Management
 
Ims02 automics and modernization - IMS UG May 2014 Sydney & Melbourne
Robert Hain
 
Applications Performance Monitoring with Applications Manager part 1
ManageEngine, Zoho Corporation
 
Government and Education: IT Tools to Support Your Hybrid Workforce
SolarWinds
 
Ad

More from Len Bass (19)

PDF
Devops syllabus
Len Bass
 
PDF
DevOps Syllabus summer 2020
Len Bass
 
PDF
Quantum talk
Len Bass
 
PDF
Icsa2018 blockchain tutorial
Len Bass
 
PDF
Experience in teaching devops
Len Bass
 
PDF
Understanding blockchains
Len Bass
 
PDF
What is a blockchain
Len Bass
 
PDF
Dev ops and safety critical systems
Len Bass
 
PDF
My first deployment pipeline
Len Bass
 
PDF
Packaging tool options
Len Bass
 
PDF
Securing deployment pipeline
Len Bass
 
PDF
Deployability
Len Bass
 
PDF
Architecture for the cloud deployment case study future
Len Bass
 
PDF
Architecting for the cloud cloud providers
Len Bass
 
PDF
Architecting for the cloud storage build test
Len Bass
 
PDF
Architecting for the cloud map reduce creating
Len Bass
 
PDF
Architecting for the cloud storage misc topics
Len Bass
 
PDF
Architecting for the cloud elasticity security
Len Bass
 
PDF
Architecting for the cloud scability-availability
Len Bass
 
Devops syllabus
Len Bass
 
DevOps Syllabus summer 2020
Len Bass
 
Quantum talk
Len Bass
 
Icsa2018 blockchain tutorial
Len Bass
 
Experience in teaching devops
Len Bass
 
Understanding blockchains
Len Bass
 
What is a blockchain
Len Bass
 
Dev ops and safety critical systems
Len Bass
 
My first deployment pipeline
Len Bass
 
Packaging tool options
Len Bass
 
Securing deployment pipeline
Len Bass
 
Deployability
Len Bass
 
Architecture for the cloud deployment case study future
Len Bass
 
Architecting for the cloud cloud providers
Len Bass
 
Architecting for the cloud storage build test
Len Bass
 
Architecting for the cloud map reduce creating
Len Bass
 
Architecting for the cloud storage misc topics
Len Bass
 
Architecting for the cloud elasticity security
Len Bass
 
Architecting for the cloud scability-availability
Len Bass
 

Recently uploaded (20)

PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
vAdobe Premiere Pro 2025 (v25.2.3.004) Crack Pre-Activated Latest
imang66g
 
PDF
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
PPTX
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
PDF
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
DOCX
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
PDF
Exploring AI Agents in Process Industries
amoreira6
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PDF
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
PDF
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PDF
Protecting the Digital World Cyber Securit
dnthakkar16
 
PPTX
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
vAdobe Premiere Pro 2025 (v25.2.3.004) Crack Pre-Activated Latest
imang66g
 
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
Exploring AI Agents in Process Industries
amoreira6
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
Protecting the Digital World Cyber Securit
dnthakkar16
 
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 

9 postproduction

  • 1. DOSE: Deployment and Operations for Software Engineers Postproduction
  • 2. © Len Bass 2019 2 Open Application Model
  • 3. © Len Bass 2019 3 Key concepts • Incident - an event that could lead to loss of, or disruption to, an organization's operations, services or functions. • May be minor, such as running out of disk space • May be major , such as data breach • Telemetry – collection of information for monitoring environmental conditions
  • 4. © Len Bass 2019 4 Overview • Telemetry • Incident response • Live testing
  • 5. © Len Bass 2019 5 Scenario • It is 3:00AM and your pager goes off. • There is a problem with your service! • You get out of bed and log onto the production environment and look at the services dashboard. • One instance of your service has high latency • You drill down and discover the problem is a slow disk • You move temporary files for your service to another disk and place the message “replace disk” on the operators queue.
  • 6. © Len Bass 2019 6 Troubleshooting process • First step is to isolate problem • Current service • Upstream service (too many requests) • Downstream service (too slow) • Second step is to decide whether it is a hardware or software problem • What has changed in the software? • Has hardware shown signs of problems with other services? • If a single instance of multiple instances has problems, look for hardware first. © Len Bass 2015 6
  • 7. © Len Bass 2019 7 Single service – single server • Look at following data • CPU • Memory • I/O activity • Number of requests • Response time to inbound requests • Response time for outbound requests • Error rates • Look for abnormal values © Len Bass 2015 7
  • 8. © Len Bass 2019 8 Single service – multiple servers • Multiple servers served through a load balancer • Look at same set of data as for single server • CPU • Memory • I/O activity • Number of requests • Response time to inbound requests • Response time for outbound requests • Error rates • Look at aggregate values over multiple servers © Len Bass 2015 8
  • 9. © Len Bass 2019 9 Isolating problem • Is problem with this service or client or dependent services? • If problem is with this service is it manifested across all servers or just one. I.e. drill down into aggregates to get individual values © Len Bass 2015 9
  • 10. © Len Bass 2019 10 Multiple services – multiple servers • Same basic strategy • Isolate problem through identifying problem by looking at aggregates • Drill down to decide service and server that contributes to problem • Look at what has changed in software and whether hardware has manifested problems earlier © Len Bass 2015 10
  • 11. © Len Bass 2019 11 Overall requirements from this sequence of trouble shooting • Gather variety of different kinds of data • Either resource usage or things that contribute to resource usage • Ensure each data item can be traced as to source and activity • Collect data into a location where it can be queried and drilled into. © Len Bass 2015 11
  • 12. © Len Bass 2019 12 Information needs • Metrics collected by infrastructure • Logs from instance with relevant information • Central repository for logs • Dashboard that displays metrics • Alerting system • Monitoring latency of instances • Rule: if high latency then alarm
  • 13. © Len Bass 2019 13 Architecture of Monitoring System Configuration Management System OperatorUser tracking Operation logs Monitoring System Monitoring data storage Visualization Alarm evaluation Big Data Analytics Traditional BI Intrusion Detection Other applications Other systems Alerts System 1 Application Middleware OS Agent System 2 Application Middleware OS ... agent-based agentless Health checks
  • 14. © Len Bass 2019 14 Logs • A log is an append only data structure • Written by each software system. • Located in a fixed directory within the operating system • Enumerates events from within software system • Entry/exit • Troubleshooting • DB modifications • …
  • 15. © Len Bass 2019 15 Instance Log Configuration Management System OperatorUser tracking Operation logs Monitoring System Monitoring data storage Visualization Alarm evaluation Big Data Analytics Traditional BI Intrusion Detection Other applications Other systems Alerts System 1 Application Middleware OS Agent System 2 Application Middleware OS ... agent-based agentless Health checks Daemon on instance copies logs to central repository
  • 16. © Len Bass 2019 16 Logs on Entry/Exit • Recall that Protocol Buffers automatically generate procedures that are called on entry/exit to a service • These procedures can be made to call logging service with parameters and identification information. • Logs on entry/exit can be made without additional developer activity
  • 17. © Len Bass 2019 17 Metrics • Metrics are measures of activity over some period of time • Collected automatically by infrastructure over externally visible activities of VM • CPU • I/O • etc
  • 18. © Len Bass 2019 Configuration Management System OperatorUser tracking Operation logs Monitoring System Monitoring data storage Visualization Alarm evaluation Big Data Analytics Traditional BI Intrusion Detection Other applications Other systems Alerts System 1 Application Middleware OS Agent System 2 Application Middleware OS ... agent-based agentless Health checks Metrics collected by infrastructure
  • 19. © Len Bass 2019 Repository • Logs and metrics are placed in central repository • Repository generates alarms based on rules • Provides central location for examination when problem occurs • Displays information in dashboard that allows for drilling down to understand source of particular readings.
  • 20. © Len Bass 2019 Configuration Management System OperatorUser tracking Operation logs Monitoring System Monitoring data storage Visualization Alarm evaluation Big Data Analytics Traditional BI Intrusion Detection Other applications Other systems Alerts System 1 Application Middleware OS Agent System 2 Application Middleware OS ... agent-based agentless Health checks Central Repository with alerting and dashboard
  • 21. © Len Bass 2019 21 Overview • Telemetry • Incident response • Live testing
  • 22. © Len Bass 2019 22 Incident response • Incident occurs • can be a result of telemetry data or externally caused. • Incident response is the managing of the aftermath of the incident. • Ideal response: • Restore the system to production • Analyze cause of incident • Prevent the incident from re-occurring
  • 23. © Len Bass 2019 23 Two incident response philosophies • You build it, you run it (originated by Amazon) • Site Reliability Engineers (SRE) (originated by Google)
  • 24. © Len Bass 2019 24 You build it, you run it “There is another lesson here: Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day- to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.” -Wener Vogels https://queue.acm.org/detail.cfm?id=1142065
  • 25. © Len Bass 2019 25 SRE • Separate organizational unit whose responsibility is to manage incidents. • Coordination enables detection of system outage patterns • SRE team rotates pager duty • Term for an SRE is ~2-3 years. High stress and they burn out. Ex SREers go back to production unit.
  • 26. © Len Bass 2019 26 SRE mindset • "Here’s what you do when someone breaks something or finds something very difficult to debug: You say thank you. Thank you for finding this edge case. Thank you for highlighting this overcomplicated part of our system. Thank you for pointing out this gap in our docs. And then you go make it so nobody can break it the same way again.“ • Tanya Reilly https://landing.google.com/sre/
  • 27. © Len Bass 2019 27 Overview • Telemetry • Incident response • Live testing
  • 28. © Len Bass 2019 28 Live testing • Netflix has a “Simian Army” to perform testing after a service is in production. • Chaos Monkey kills production processes • Latency Monkey introduces extra latency into the network. • Various other monkeys perform janitor services • Looking for certificates or licenses about to expire • Ensuring appropriate localization • Cleaning up unused resources • Ensuring security groups are appropriately used. 28
  • 29. © Len Bass 2019 29 Summary • Developers may carry pagers and be first responders • Determining problem requires access to a wide variety of data • Logs • Metrics • Postproduction testing may introduce errors or provide janitorial services