OSMC 2019 | How to improve database Observability by Charles Judith

How to improve database
observability?
@Charles_JUDITH

About me
● Senior Site Reliability Engineer at Criteo
● Working on monitoring topics since few years
● Currently providing the (open source) database service
at Criteo
● Previously worked on the observability stack at Criteo
● @Charles_JUDITH on Twitter

Agenda
1. Context
2. First iteration
3. Second iteration
4. Next steps
5. Resources

Context
● Feedback from my experience at Criteo
● MariaDB/MySQL setup on multiple data centers
● Bare metal servers

Context
● Hidden issues (backup, usage, …)
● Incident resolution was based on vague information
● No alerting
● Dashboard with metrics

Goal
● Alerting
● No more hidden issues
● Dashboards for everyone
● An observable platform!
● The DBA team shouldn’t be a “blocker” for the users!

OBSERVABILITY IS A MEASURE OF HOW WELL
INTERNAL STATES OF A SYSTEM CAN BE
INFERRED FROM KNOWLEDGE OF ITS EXTERNAL
OUTPUTS. »
SOURCE: WIKIPEDIA

What I think about observability
● It’s not only about the tools
● It’s not a fancy name to say “monitoring”
● It’s more about “transparency”

Why a system needs to be
observable?

Why a system needs to be observable?
● Is it working as expected by the users?
● To answer basic questions about your service/platform
● Increase the visibility for you and your users/customers
● Long term tends analysis
● “If can’t measure it, you can’t manage it”

Observability is fundamental for reliability
Analogy to the Maslow’s hierarchy of needs

The observability eﬀects
● Giving superpowers
● It’s like a roller coaster
● You need to be patient

USE method
● USE was introduced by @brendangregg
● Utilization: disk,CPU usage …
● Saturation: disk I/O
● Errors: network interface errors

The four golden signals
● Introduced in the Google SRE book
● Latency: response time, queue/wait time
● Trafﬁc: A measure of how much demand is being placed on the service
● Errors: The rate of requests that fail
● Saturation: How “full” is the service

RED method
● RED was introduced by @tom_wilkie
● (Request) Rate - the number of requests, per second, you services are serving.
● (Request) Errors - the number of failed requests per second.
● (Request) Duration - distributions of the amount of time each request takes.
● Subset of “The Four Golden Signals”

The seven golden signals
● CELT + USE introduced by @xaprb
● Concurrency: number of simultaneous requests
● Error rate
● Latency: response time
● Throughput: query per seconds (QPS)

CASE method
● CASE was introduced by @gphat
● Context-heavy
● Actionnable
● Symptom-based
● Evaluated

OSMC 2019 | How to improve database Observability by Charles Judith

Preferred approach
● The seven golden signals
● Good to measure the service quality
● System and application metrics are valuable in our case

How to collect the metrics?
● Collectd
● Node exporter
● MySQLD exporter
● Python MySQL plugin for CollectD
● Few others

What to do with all these metrics?
● Pick some useful “indicators” like:
○ thread usage
○ service status
○ backup status, duration, size
○ replication lag

How to show/use those
metrics?

Disk partition full with
tmp_table

Database cleaning and
optimize table

DATABASES EXPOSE LOTS OF METRICS ABOUT
THEIR STATUS, BUT MUCH LESS ABOUT THE
DETAILS OF THEIR WORKLOAD.

Current status
● We have system and database metrics
● Alerting
● Dashboards with metrics easily available for everyone

“WE THINK OUR DATABASE IS SLOW?”
“Last week week we noticed that
the database was slow.”

Logs
● Logs all the SQL queries (general log)
● Install an agent to ship those logs with “custom ﬁelds”
● Make the logs available for our users

Logs
● Logs all the SQL queries (general log)
● Install an agent to ship those logs with “custom ﬁelds”
● Conﬁgure MySQL/MariaDB to log the slow queries
● Use Rsyslog with a custom template!
● Make the logs available for our users

Beneﬁts
● The DBA is not a blocker for the developers
● Better visibility on the database service
● Happy customers/developers/users

Conclusions
● The visibility and transparency
● Effective monitoring
● Shipping slow queries is not easy
● In that case metrics and logs is a good combo but we want more!

Next steps
● Continue to improve the SQL logging
● Leverage the usage of sys_schema
● Metrics per database
● Publish the SLA
● Open source our probe for MySQL/MariaDB

Resources
https://github.com/CharlesJUDITH/database-observability-toolkit

OSMC 2019 | How to improve database Observability by Charles Judith

More Related Content

What's hot (17)

Similar to OSMC 2019 | How to improve database Observability by Charles Judith (20)

Recently uploaded (20)

OSMC 2019 | How to improve database Observability by Charles Judith