Monitoring with prometheus

Prometheus
By Kasper Nissen
@phennex
Monitoring with

Hi!
My name is Kasper
@phennex

What am I going to cover?
@phennex
+
+
+
Monitoring - why and what?
Prometheus - an introduction
Short demo

DEMO Part 1
@phennex
https://github.com/kaspernissen/automation_night_demo

What to monitor?
@phennex
Analyzing long-term trends
@phennex

What to monitor?
@phennex
Comparing over time or experiment groups
@phennex

What to monitor?
@phennex
Alerting
@phennex

What to monitor?
@phennex
Building dashboards
@phennex

@phennex
Conducting ad hoc retrospective analysis
@phennex

@phennex
Purpose:
What is broken?
and why?

What to monitor?
@phennex
Hosts
CPU, Memory, I/O, Network, Filesystem
@phennex

What to monitor?
@phennex
Containers
CPU, Memory, I/O, Restarts, Throttling
@phennex

What to monitor?
@phennex
Applications
Throughput, Latency
@phennex

The Four Golden Signals
@phennex
Site Reliability Engineering - How Google Runs Production Systems

What to monitor?
@phennex
Latency
The time it takes to service a request.
Important to distinguish between the latency of
successful and failed requests.
@phennex

What to monitor?
@phennex
Traffic
A measure of how much demand is being placed on your system,
measured in a high-level system-speciﬁc metric.
@phennex

What to monitor?
@phennex
Errors
The rate of requests that fail, either explicitly (e.g. HTTP 500s),
implicitly (HTTP 200 success with wrong content)
@phennex

What to monitor?
@phennex
Saturation
How “full” your service is. A measure of your system fraction,
emphasizing the resources that are most constrained
(e.g. in a memory-constrained system, show memory)
@phennex

What to monitor?
@phennex
Prometheus
Prometheus was presented to be the protector and benefactor of mankind.
@phennex

Prometheus
@phennex
+
+
+
+
Heavily inspired by Borgmon
Built by ex-Googlers at SoundCloud
Pull-based (scrapes at regular intervals)
Many integration possibilities

What is Prometheus?
@phennex
+
+
+
+
+
+
Monitoring system and Timeseries Database
Instrumentation
Metrics collection and storage
Querying
Alerting
Dashboard / Graphing / Trending
Source: https://promcon.io/2016-berlin/talks/prometheus-design-and-philosophy/

Prometheus focus on
@phennex
+
+
Operational systems monitoring
Dynamic cloud environments

Prometheus does not do
@phennex
+
+
+
+
+
+
Raw log / event collection (use ELK stack)
Request tracing (use opentracing.io)
“Magic” anomaly detection
Durable long-term storage
Automatic horizontal scaling
User / auth management

Prometheus Architecture
@phennex
Long-lived jobs
Pushgateway AlertmanagerShort-lived jobs
Grafana

The Data model
@phennex
<metric name>{<label name>=<label value>, …}
api_http_requests_total{method="POST", handler="/messages"}
Notation:
Example:
Every time series is uniquely identiﬁed by its metric name and a set of key-
value pairs, also known as labels.

How to get metrics?
@phennex
Directly
instrumented
Not Directly
instrumented
Exporter
Source: https://promcon.io/2016-berlin/talks/so-you-want-to-write-an-exporter/

Directly instrumented software
@phennex
cAdvisor
Doorman
Etcd
Kubernetes-Mesos
Kubernetes
RobustIRC
SkyDNS
Weave Flux

Official Prometheus Exporters
@phennex
Node/system metrics exporter
AWS CloudWatch exporter
Blackbox exporter
Collectd exporter
Consul exporter
Graphite exporter
HAProxy exporter
InfluxDB exporter
JMX exporter
Memcached exporter
Mesos task exporter
MySQL server exporter
SNMP exporter
StatsD exporter

3rd party exporters
@phennex
Databases
Aerospike exporter
ClickHouse exporter
CouchDB exporter
MongoDB exporter
PgBouncer exporter
PostgreSQL exporter
ProxySQL exporter
Redis exporter
RethinkDB exporter
SQL query result set metrics exporter

3rd party exporters
@phennex
Hardware related
apcupsd exporter
IoT Edison exporter
IPMI exporter
knxd exporter
Ubiquiti UniFi exporter
Messaging systems
NATS exporter
NSQ exporter
RabbitMQ exporter
RabbitMQ Management Plugin exporter
Mirth Connect exporter

3rd party exporters
@phennex
Storage
Ceph exporter
ScaleIO exporter
HTTP
Apache exporter
Nginx metric library
Passenger exporter
Varnish exporter
WebDriver exporter
APIs
Docker Hub exporter
GitHub exporter
OpenWeatherMap exporter
Rancher exporter
Speedtest.net exporter
Logging
Google's mtail log data extractor
Grok exporter
Other monitoring systems
Cloud Foundry Firehose exporter
scollector exporter
Heka dashboard exporter
Heka exporter
Munin exporter
New Relic exporter
Miscellaneous
BIG-IP exporter
BIND exporter
BOSH exporter
Jenkins exporter
Meteor JS web framework exporter
Minecraft exporter module
PowerDNS exporter
rTorrent exporter
SMTP/Maildir MDA blackbox prober
Xen exporter

PromQL
@phennex
+
+
+
Non-SQL Query Language
Better for metrics computation
Only does reads

PromQL - Operators
@phennex
+ (addition) == (equal)
- (substraction) != (not-equal)
* (multiplication) > (greater-than)
/ (division) < (less-than)
% (modulo) >= (greater-or-equal)
^ (exponentiation) <= (less-or-equal)
and (intersection) or (union)
unless (complement)
… and vector matching
Source: https://prometheus.io

PromQL - Aggregation Operators
@phennex
sum stddev bottomk
min stdvar topk
max count quantile
avg count_values
Source: https://prometheus.io

PromQL - Examples
@phennex
rate(api_http_requests_total[5m])
errors{job=“foo”} / total{job=“foo”}

DEMO Part 2
@phennex
https://github.com/kaspernissen/automation_night_demo

What to monitor?
@phennex
Symptom-based alerting
Be proactive
@phennex

What to monitor?
@phennex
Prevent alert fatigue
- Use ticketing systems (Avoid email spam)
- Warning are tasks like new features
@phennex

What to monitor?
@phennex
Provide runbooks
- Keep them concise
- Explanation, hints, links
- Dynamic - include recent observations
@phennex

What to monitor?
@phennex
Practice outages
“Firedrills”, “Gamedays” - repeat regularly
@phennex

@phennex
Start being proactive.
Dont be firefighters.

Hope is NOT a strategy
@phennex
Source: Site Reliability Engineering, How Google Runs Production Systems (2016), B. Beyer et al.

If you wanna know more…
@phennex
- prometheus.io
- promcon.io
- The Site Reliability Engineering book
- Podcasts:
- https://dev.to/sedaily/prometheus-monitoring-with-brian-brazil
- https://dev.to/sedaily/the-art-of-monitoring-with-james-turnbull  
(prefers push based opposite prometheus)
- https://dev.to/sedaily/prometheus-with-julius-volz

@phennex
The 3rd project in CNCF
opentracing.io

Thank you!
@phennex
kaspernissen@gmail.com
@phennex

Monitoring with prometheus

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Monitoring with prometheus (20)

More from Kasper Nissen (10)

Recently uploaded (20)

Monitoring with prometheus