SlideShare a Scribd company logo
snyk.io
Logging and Observability
Anton Drukh, VP Engineering
snyk.io
Agenda
What is observability?
How do we do it at Snyk
What should you expect
snyk.io
So what is Observability?
Monitoring tells you whether the system works.
Observability lets you ask why it's not working.
- @xarpb, 2017
snyk.io
So what is Observability?
Monitoring is for operating software/systems
Instrumentation is for writing software
Observability is for understanding systems
- @mipsytipsy, 2017
snyk.io
Monitoring ~= treating your code as a black box
Instrumentation ~= letting a black box into your code
Risk of black-boxing yourself
snyk.io
Just a bunch of `printf`s?
“Strings are where data go to die”
- Tim Wilde, 2018, honeycomb.io blog post
“Not sure where the problem is, so I added more logs”
- Every developer, every time
What does logging have to do with it?
snyk.io
I promised a ‘90s game analogy...
snyk.io
To those born in the 2000s...
snyk.io
To those born in the 2000s...
snyk.io
The lemmings are your requests:
They come in by themselves
You need to get them to safety
The way you structure your commands is coding
The way things roll out once you are done is production!
The amount of ways lemmings can die is the amount of error flows in
your code
Once set up properly, all lemmings follow ± the same path, as
successful requests usually do
Analogy explained
snyk.io
Snyk right now
13 developers
- 2 product delivery oriented teams
- Take care of their stuff Ops-BE-FE
- Everyone does oncall (1 week every 3 months)
~ 30 microservices running on a GKE cluster
- NodeJS & Python
- Each service writes single-line JSONs to stdout
- Push to logz.io with K8S metadata (node, pod, container)
snyk.io
Observability at the service level:
- Concurrency, Error rate, Latency, Throughput
- @xaprb, 2018
Observability at the request level:
- Traceability - time spent in each stage / service
- Error-to-root-cause detection
So… let’s log everything and see?
snyk.io
Separate log generation and collection
- Not big fans of complex clients
- Stdout is your friend
- Logging can impact performance!
Structured logging
- Not “Customer %d purchased %s, %s and %s”
- Instead, ‘{customer: 123, purchasedItemIds: [...], msg: “purchase made”}’
How do we log
snyk.io
When do we log
90% of the value is captured by
- Collecting context for log throughout the request lifetime
- Logging on response
For service level observability, consider logging on entry
For request level observability, consider logging on
exceptional cases - important ‘forks’ in request handling
-
snyk.io
90% of the value is captured by
- Request ID, query params and request body details
- Duration of request handling
- Status code of response
- Error, if such occurred
What do we log
snyk.io
Live demo
Clone https://github.com/snyk/koa2-bunyan-server
snyk.io
Describe the event being logged with a const string (i.e.
‘Replying’, ‘Failed to process purchase’, etc.)
Use a context object for all parameters (i.e. ‘{duration, status}’
on ‘Replying’ event)
Use log level to indicate severity of event: (i.e. info = normal
operation, warn = client error, error = server error)
It’s a team effort to keep logging relevant and standard!
Worth mentioning
snyk.io
Caveats
Indexing failures will happen - track them and fix them
- Same key with different value types
- Too long logs (beware of performance impact)
Sanitise sensitive data (i.e. auth tokens, PII)
- Beware of performance impact
At large volumes, summarise or skip logs
snyk.io
Troubleshooting time is shortened
Improvements and fixes become scientific experiments
Alerts and dashboards based on a single source of truth
Developers contribute to and benefit from service operability
What’s the gain?
snyk.io
Request count by date and status
snyk.io
Errors breakdown by service
snyk.io
Request breakdown by type
snyk.io
Request duration percentiles (50th, 95th, 99th)
snyk.io
Developers own their code in production
Strong support for our CI/CD workflow
Easier for developers to move between codebases
Support and sales engineering teams use the same logs!
Problems get identified and resolved quickly
The real gain is for the team
snyk.io
Why averages are bad and percentiles are the real thing
How to log errors and survive
How to sanitise sensitive data from leaking into logs
How to train your team on standard, structured logging
Anything else, really!
Ask me about...
snyk.io
Thank you!
Questions? :)

More Related Content

What's hot (20)

PDF
Observability
Martin Gross
 
PDF
Cloud-Native Observability
Tyler Treat
 
PDF
Observability at Scale
Knoldus Inc.
 
PDF
Elastic Observability
FaithWestdorp
 
PPTX
Monitoring & Observability
Lumban Sopian
 
PPTX
Observability
Maganathin Veeraragaloo
 
PDF
Combining Logs, Metrics, and Traces for Unified Observability
Elasticsearch
 
PPTX
DevOps Monitoring and Alerting
Khairul Zebua
 
PDF
Observability & Datadog
JamesAnderson599331
 
PDF
Observability driven development
Geert van der Cruijsen
 
PDF
Road to (Enterprise) Observability
Christoph Engelbert
 
PDF
Intro to open source observability with grafana, prometheus, loki, and tempo(...
LibbySchulze
 
PDF
Getting started with Site Reliability Engineering (SRE)
Abeer R
 
PDF
The Observability Pipeline
Tyler Treat
 
PDF
SRE Demystified - 01 - SLO SLI and SLA
Dr Ganesh Iyer
 
PPTX
SRE 101 (Site Reliability Engineering)
Hussain Mansoor
 
PDF
Api observability
Red Hat
 
PPTX
Continues Integration and Continuous Delivery with Azure DevOps - Deploy Anyt...
Janusz Nowak
 
PDF
Azure Monitoring Overview
gjuljo
 
PPTX
Azure DevOps
Juan Fabian
 
Observability
Martin Gross
 
Cloud-Native Observability
Tyler Treat
 
Observability at Scale
Knoldus Inc.
 
Elastic Observability
FaithWestdorp
 
Monitoring & Observability
Lumban Sopian
 
Combining Logs, Metrics, and Traces for Unified Observability
Elasticsearch
 
DevOps Monitoring and Alerting
Khairul Zebua
 
Observability & Datadog
JamesAnderson599331
 
Observability driven development
Geert van der Cruijsen
 
Road to (Enterprise) Observability
Christoph Engelbert
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
LibbySchulze
 
Getting started with Site Reliability Engineering (SRE)
Abeer R
 
The Observability Pipeline
Tyler Treat
 
SRE Demystified - 01 - SLO SLI and SLA
Dr Ganesh Iyer
 
SRE 101 (Site Reliability Engineering)
Hussain Mansoor
 
Api observability
Red Hat
 
Continues Integration and Continuous Delivery with Azure DevOps - Deploy Anyt...
Janusz Nowak
 
Azure Monitoring Overview
gjuljo
 
Azure DevOps
Juan Fabian
 

Similar to Logging and observability (20)

PPTX
How to Meta-Sumo - Using Logs for Agile Monitoring of Production Services
Christian Beedgen
 
PPTX
Log Standards & Future Trends by Dr. Anton Chuvakin
Anton Chuvakin
 
PPT
Application Logging Good Bad Ugly ... Beautiful?
Anton Chuvakin
 
PDF
Three Pillars, Zero Answers: Rethinking Observability
DevOps.com
 
PPTX
SplunkLive! Salt Lake City June 2013 - Ancestry.com
Splunk
 
PDF
Intelligent Monitoring
Intelie
 
ODP
Log aggregation and analysis
Dhaval Mehta
 
PDF
Azure tales: a real world CQRS and ES Deep Dive - Andrea Saltarello
ITCamp
 
PDF
The Heatmap
 - Why is Security Visualization so Hard?
Raffael Marty
 
PDF
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
ITCamp
 
PDF
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
DataWorks Summit
 
PPTX
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
Maarten Balliauw
 
PDF
How Do ‘Things’ Talk? - An Overview of the IoT/M2M Protocol Landscape at IoT ...
Christian Götz
 
PDF
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...
DevOps.com
 
PDF
THE POWER OF INTELLIGENT FLOWS REAL-TIME IOT BOTNET CLASSIFICATION WITH APACH...
André Fucs de Miranda
 
PPTX
MongoSF 2011 - Using MongoDB for IGN's Social Platform
Manish Pandit
 
PPT
Log Mining: Beyond Log Analysis
Anton Chuvakin
 
PDF
Observability at Spotify
Aleksandr Kuboskin, CFA
 
PPTX
Performance Monitoring for the Cloud - Java2Days 2017
Werner Keil
 
PPTX
Observability, the practical approach - Anton Drukh - DevOpsDays Tel Aviv 2018
DevOpsDays Tel Aviv
 
How to Meta-Sumo - Using Logs for Agile Monitoring of Production Services
Christian Beedgen
 
Log Standards & Future Trends by Dr. Anton Chuvakin
Anton Chuvakin
 
Application Logging Good Bad Ugly ... Beautiful?
Anton Chuvakin
 
Three Pillars, Zero Answers: Rethinking Observability
DevOps.com
 
SplunkLive! Salt Lake City June 2013 - Ancestry.com
Splunk
 
Intelligent Monitoring
Intelie
 
Log aggregation and analysis
Dhaval Mehta
 
Azure tales: a real world CQRS and ES Deep Dive - Andrea Saltarello
ITCamp
 
The Heatmap
 - Why is Security Visualization so Hard?
Raffael Marty
 
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
ITCamp
 
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
DataWorks Summit
 
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
Maarten Balliauw
 
How Do ‘Things’ Talk? - An Overview of the IoT/M2M Protocol Landscape at IoT ...
Christian Götz
 
Three Pillars, No Answers: Helping Platform Teams Solve Real Observability Pr...
DevOps.com
 
THE POWER OF INTELLIGENT FLOWS REAL-TIME IOT BOTNET CLASSIFICATION WITH APACH...
André Fucs de Miranda
 
MongoSF 2011 - Using MongoDB for IGN's Social Platform
Manish Pandit
 
Log Mining: Beyond Log Analysis
Anton Chuvakin
 
Observability at Spotify
Aleksandr Kuboskin, CFA
 
Performance Monitoring for the Cloud - Java2Days 2017
Werner Keil
 
Observability, the practical approach - Anton Drukh - DevOpsDays Tel Aviv 2018
DevOpsDays Tel Aviv
 
Ad

Recently uploaded (20)

PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Ad

Logging and observability

  • 2. snyk.io Agenda What is observability? How do we do it at Snyk What should you expect
  • 3. snyk.io So what is Observability? Monitoring tells you whether the system works. Observability lets you ask why it's not working. - @xarpb, 2017
  • 4. snyk.io So what is Observability? Monitoring is for operating software/systems Instrumentation is for writing software Observability is for understanding systems - @mipsytipsy, 2017
  • 5. snyk.io Monitoring ~= treating your code as a black box Instrumentation ~= letting a black box into your code Risk of black-boxing yourself
  • 6. snyk.io Just a bunch of `printf`s? “Strings are where data go to die” - Tim Wilde, 2018, honeycomb.io blog post “Not sure where the problem is, so I added more logs” - Every developer, every time What does logging have to do with it?
  • 7. snyk.io I promised a ‘90s game analogy...
  • 8. snyk.io To those born in the 2000s...
  • 9. snyk.io To those born in the 2000s...
  • 10. snyk.io The lemmings are your requests: They come in by themselves You need to get them to safety The way you structure your commands is coding The way things roll out once you are done is production! The amount of ways lemmings can die is the amount of error flows in your code Once set up properly, all lemmings follow ± the same path, as successful requests usually do Analogy explained
  • 11. snyk.io Snyk right now 13 developers - 2 product delivery oriented teams - Take care of their stuff Ops-BE-FE - Everyone does oncall (1 week every 3 months) ~ 30 microservices running on a GKE cluster - NodeJS & Python - Each service writes single-line JSONs to stdout - Push to logz.io with K8S metadata (node, pod, container)
  • 12. snyk.io Observability at the service level: - Concurrency, Error rate, Latency, Throughput - @xaprb, 2018 Observability at the request level: - Traceability - time spent in each stage / service - Error-to-root-cause detection So… let’s log everything and see?
  • 13. snyk.io Separate log generation and collection - Not big fans of complex clients - Stdout is your friend - Logging can impact performance! Structured logging - Not “Customer %d purchased %s, %s and %s” - Instead, ‘{customer: 123, purchasedItemIds: [...], msg: “purchase made”}’ How do we log
  • 14. snyk.io When do we log 90% of the value is captured by - Collecting context for log throughout the request lifetime - Logging on response For service level observability, consider logging on entry For request level observability, consider logging on exceptional cases - important ‘forks’ in request handling -
  • 15. snyk.io 90% of the value is captured by - Request ID, query params and request body details - Duration of request handling - Status code of response - Error, if such occurred What do we log
  • 17. snyk.io Describe the event being logged with a const string (i.e. ‘Replying’, ‘Failed to process purchase’, etc.) Use a context object for all parameters (i.e. ‘{duration, status}’ on ‘Replying’ event) Use log level to indicate severity of event: (i.e. info = normal operation, warn = client error, error = server error) It’s a team effort to keep logging relevant and standard! Worth mentioning
  • 18. snyk.io Caveats Indexing failures will happen - track them and fix them - Same key with different value types - Too long logs (beware of performance impact) Sanitise sensitive data (i.e. auth tokens, PII) - Beware of performance impact At large volumes, summarise or skip logs
  • 19. snyk.io Troubleshooting time is shortened Improvements and fixes become scientific experiments Alerts and dashboards based on a single source of truth Developers contribute to and benefit from service operability What’s the gain?
  • 20. snyk.io Request count by date and status
  • 24. snyk.io Developers own their code in production Strong support for our CI/CD workflow Easier for developers to move between codebases Support and sales engineering teams use the same logs! Problems get identified and resolved quickly The real gain is for the team
  • 25. snyk.io Why averages are bad and percentiles are the real thing How to log errors and survive How to sanitise sensitive data from leaking into logs How to train your team on standard, structured logging Anything else, really! Ask me about...