Logging and observability

snyk.io
Logging and Observability
Anton Drukh, VP Engineering

snyk.io
Agenda
What is observability?
How do we do it at Snyk
What should you expect

snyk.io
So what is Observability?
Monitoring tells you whether the system works.
Observability lets you ask why it's not working.
- @xarpb, 2017

snyk.io
So what is Observability?
Monitoring is for operating software/systems
Instrumentation is for writing software
Observability is for understanding systems
- @mipsytipsy, 2017

snyk.io
Monitoring ~= treating your code as a black box
Instrumentation ~= letting a black box into your code
Risk of black-boxing yourself

snyk.io
Just a bunch of `printf`s?
“Strings are where data go to die”
- Tim Wilde, 2018, honeycomb.io blog post
“Not sure where the problem is, so I added more logs”
- Every developer, every time
What does logging have to do with it?

snyk.io
I promised a ‘90s game analogy...

snyk.io
To those born in the 2000s...

snyk.io
The lemmings are your requests:
They come in by themselves
You need to get them to safety
The way you structure your commands is coding
The way things roll out once you are done is production!
The amount of ways lemmings can die is the amount of error flows in
your code
Once set up properly, all lemmings follow ± the same path, as
successful requests usually do
Analogy explained

snyk.io
Snyk right now
13 developers
- 2 product delivery oriented teams
- Take care of their stuff Ops-BE-FE
- Everyone does oncall (1 week every 3 months)
~ 30 microservices running on a GKE cluster
- NodeJS & Python
- Each service writes single-line JSONs to stdout
- Push to logz.io with K8S metadata (node, pod, container)

snyk.io
Observability at the service level:
- Concurrency, Error rate, Latency, Throughput
- @xaprb, 2018
Observability at the request level:
- Traceability - time spent in each stage / service
- Error-to-root-cause detection
So… let’s log everything and see?

snyk.io
Separate log generation and collection
- Not big fans of complex clients
- Stdout is your friend
- Logging can impact performance!
Structured logging
- Not “Customer %d purchased %s, %s and %s”
- Instead, ‘{customer: 123, purchasedItemIds: [...], msg: “purchase made”}’
How do we log

snyk.io
When do we log
90% of the value is captured by
- Collecting context for log throughout the request lifetime
- Logging on response
For service level observability, consider logging on entry
For request level observability, consider logging on
exceptional cases - important ‘forks’ in request handling
-

snyk.io
90% of the value is captured by
- Request ID, query params and request body details
- Duration of request handling
- Status code of response
- Error, if such occurred
What do we log

snyk.io
Live demo
Clone https://github.com/snyk/koa2-bunyan-server

snyk.io
Describe the event being logged with a const string (i.e.
‘Replying’, ‘Failed to process purchase’, etc.)
Use a context object for all parameters (i.e. ‘{duration, status}’
on ‘Replying’ event)
Use log level to indicate severity of event: (i.e. info = normal
operation, warn = client error, error = server error)
It’s a team effort to keep logging relevant and standard!
Worth mentioning

snyk.io
Caveats
Indexing failures will happen - track them and fix them
- Same key with different value types
- Too long logs (beware of performance impact)
Sanitise sensitive data (i.e. auth tokens, PII)
- Beware of performance impact
At large volumes, summarise or skip logs

snyk.io
Troubleshooting time is shortened
Improvements and fixes become scientific experiments
Alerts and dashboards based on a single source of truth
Developers contribute to and benefit from service operability
What’s the gain?

snyk.io
Request count by date and status

snyk.io
Errors breakdown by service

snyk.io
Request breakdown by type

snyk.io
Request duration percentiles (50th, 95th, 99th)

snyk.io
Developers own their code in production
Strong support for our CI/CD workflow
Easier for developers to move between codebases
Support and sales engineering teams use the same logs!
Problems get identified and resolved quickly
The real gain is for the team

snyk.io
Why averages are bad and percentiles are the real thing
How to log errors and survive
How to sanitise sensitive data from leaking into logs
How to train your team on standard, structured logging
Anything else, really!
Ask me about...

snyk.io
Thank you!
Questions? :)

Logging and observability

More Related Content

What's hot (20)

Similar to Logging and observability (20)

Recently uploaded (20)

Logging and observability