SlideShare a Scribd company logo
The hitchhiker’s guide to
Remco Overdijk
1
"A Metric, The Hitchhiker's Guide to Prometheus says, is
about the most massively useful thing someone doing
Monitoring can have. It has great practical value. You can
wave your Metric in emergencies as a distress signal, and
produce pretty Graphs at the same time."
1. The Landscape
What are we running and why?
2. Core Concepts
How does Prometheus work?
3. Demo Time!
It’s a Tools in Action talk after all, right?
4. Tips & Tricks
Getting the most out of your Prometheus Experience
5. Questions?
I’m probably going to answer “42” to most of them..
So many things to tell, so little time..
2
The Hitchhiker’s Guide to Prometheus
• Started out in TES, doing Metrics, Monitoring & Logging.
(Graphite, Statsd, Grafana, Nagios, Logstash, ElasticSearch, Kibana, etc. )
• Currently in DPI, doing CI/CD and bringing Gitlab/Spinnaker to the Cloud.
That requires a lot of monitoring…
• Member of the Cloud9 MML Circle, doing Prometheus
• Core Contributor to the R2D2 module that manages Prometheus and Monitoring/Alerting resources
within Cloud9
• Worked on implementing Prometheus and Grafana, while also using these stacks for monitoring
production systems.
• NightOwl for SRT Platform; I know how pagers work.
Who are you, and why are you telling us this?
3
Introduction
The Landscape
What are we running?
Data Center VS Cloud
VM’s and Servers VS containers in Kubernetes
5
Monitoring Prometheus
Metrics Prometheus (+
InfluxDB/Thanos)
Alerting AlertManager, Iris,
OnCall, Grafana
Visualization Grafana
Logging StackDriver,
ElasticSearch + Kibana
Monitoring Nagios + Thruk +
Lookingglass
Metrics Graphite + Statsd
Alerting SMS modems in
physical servers
Visualization Grafana
Logging ElasticSearch + Kibana
•Applications in Kubernetes are much more dynamic than we’re used to.
• No Static IP addresses.
• No Static amount servers (Well, pods actually..)
• Kubernetes can reschedule / relocate pods at will.
• Prometheus uses Service Discovery to find targets
•Both Nagios and Graphite have scaling issues and are too rigid.
• Prometheus is Pull instead of Push based and doesn’t require execution for every single check
• Combines Metrics & Monitoring into a single stack, but focuses on Monitoring.
•Being based on BorgMon, it works out of the box with a lot of Kubernetes /
Cloud native components and the services supporting them.
•StackDriver is not a full fledged alternative due to features, retention and cost.
Why didn’t you come up with something else?
6
So, why Prometheus?
•Out of the box, Prometheus also doesn’t scale endlessly without compromises
(But Thanos will)
•Scalability is solved through retention, manual sharding and vertical scaling,
which all have clear drawbacks.
•HA is solved through duplication (Polling twice from independent instances
with individual TSDB’s).
•Prometheus development is very focused, which shows in certain aspects.
Well.. No.
7
Is this the answer to everything then?
All the pods & services
8
Infrastructure Overview
Kubernetes {DEV, STG, PRO} Clusters
Datacenters
Prometheus
Prometheus
AlertManager
AlertManager
AlertManager
Grafana
PushGateway
IRIS
OnCall
SMS / Call
Provider
HipChat
Operator
Remote
Storage
Adapter
InfluxDB
YOUR App!
Kubernetes
Exporters
Core Concepts
How does it work and what makes it tick?
- Counters
- A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only
increase or be reset to zero on restart. (1, 2, 5, 9, 0, 2, 7)
- Gauges
- A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
(1, 4, 2, 5, 8)
- Histograms
- A histogram samples observations (usually things like request durations or response sizes) and counts them in
configurable buckets. It also provides a sum of all observed values.
- Summaries
- Similar to a histogram, a summary samples observations (usually things like request durations and response
sizes). While it also provides a total count of observations and a sum of all observed values, it calculates
configurable quantiles over a sliding time window.
- Quantiles are convenient when (for example) expressing median (2-quantile) and 95th percentiles.
Supported Types
10
Making Metrics
- Instead of creating separate checks for every metric that should be monitored for your
application, you expose a single (or multiple..) HTTP Endpoint containing all metrics.
- It’s your responsibility to make this endpoint Available, Fast and Reliable.
- Multiple Frameworks and Libraries can help you provisioning and maintaining such an
endpoint.
- Axle Comes with built-in support for MicroMeter, which does everything for you.
- Backspin support is coming soon™.
- Example: http://localhost:30000/metrics
The concept of Scraping HTTP Metric Endpoints
11
Exposing Metrics: Push VS Pull
# HELP prometheus_tsdb_head_min_time Minimum time bound of the head block.
# TYPE prometheus_tsdb_head_min_time gauge
prometheus_tsdb_head_min_time 1.5282792e+12
# HELP prometheus_tsdb_head_samples_appended_total Total number of appended samples.
# TYPE prometheus_tsdb_head_samples_appended_total counter
prometheus_tsdb_head_samples_appended_total 2.9485092e+07
# HELP prometheus_tsdb_head_series Total number of series in the head block.
# TYPE prometheus_tsdb_head_series gauge
prometheus_tsdb_head_series 19956
# HELP prometheus_tsdb_head_series_created_total Total number of series created in the head
# TYPE prometheus_tsdb_head_series_created_total gauge
prometheus_tsdb_head_series_created_total 56888
- An actual Query Language that looks a lot more like SQL than Graphite.
- You’ll need to learn a new language, but it’s only a single language for creating Graphs and Alerts; for
monitoring and long term metrics.
- Allows for a lot of flexibility, but can be a bit harder to grasp when starting out.
- Supports functions, operators, regex, arithmetic and expressions.
- Four expression types are supported:
- Instant Vectors (like http_requests_total{environment=~"staging|testing|development", method!="GET"})
- Instant vector selectors allow the selection of a set of time series and a single sample value for each at a given timestamp
(instant): in the simplest form, only a metric name is specified. This results in an instant vector containing elements for all time
series that have this metric name.
- Range Vectors (like http_requests_total{job="prometheus"}[5m] )
- Range vector literals work like instant vector literals, except that they select a range of samples back from the current instant.
Syntactically, a range duration is appended in square brackets ([]) at the end of a vector selector to specify how far back in time
values should be fetched for each resulting range vector element.
- Scalars
- Strings
PromQL
12
Querying Metrics
- Custom Resource Type provided by Prometheus-operator
- Abstraction of Prometheus “job” and Service Discovery
- Allows for easy ingestion of new endpoints through their k8s service
- Example:
ServiceMonitors
13
Getting your endpoint monitored
Prometheus
Prometheus OperatorYOUR App! K8s Service ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
endpoints:
- bearerTokenFile:
/var/run/secrets/kubernetes.io/serviceaccount/token
interval: 30s
port: https
scheme: https
tlsConfig:
insecureSkipVerify: true
jobLabel: k8s-app
selector:
matchLabels:
k8s-app: node-exporter
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: node-exporter
name: node-exporter
spec:
ports:
- name: https
port: 9100
protocol: TCP
targetPort: https
selector:
app: node-exporter
type: ClusterIP
- The same tool you were probably already using.
- The central interface for cloud insights
- Contains a specialized query editor for Prometheus data sources.
- Prometheus currently doesn’t store metrics older than one month for performance reasons.
- Multiple solutions for long term metrics exist, but it’s a work in progress.
Dashboarding with Grafana
14
Creating Insights
Prometheus
Prometheus Grafana
HipChat
Remote
Storage
Adapter
InfluxDB
Trouble in Paradise
Creating Alerts, choosing your weapon
15
WARNINGS – Notifications During workhours
- No direct intervention is required
- Usually picked up by members of the team
developing / maintaining a system.
- Alert delivery is NOT guaranteed.
Use Grafana with HipChat or Email alerts
CRITICALS – 24x7 Text Messages with Escalation
- Actionable events that require immediate attention
by an Engineer on Duty, who does not necessarily
have intimate knowledge of your system.
- Response is required to silence/end the alert.
- Provisioned through RuleList (R2D2 / Operator)
Use AlertManager / Iris / Oncall
Yes, It’s PromQL as well!
16
Alert Basics
%YAML 1.1
---
kind: PrometheusAlertRule
Data:
test.rules: |
Groups:
- name: Load
interval: 30s
Rules:
- alert: HighLoad
expr: rate(web_http_responses_total[1m]) > 1
for: 1m
Labels:
Severity: attention
Annotations:
description: The rate of HTTP requests is too high.
- Alerts should be actionable: Somebody has to do something, now.
- They should be simple: Someone without intimate knowledge of the system should ideally be
able to solve the alert.
- They should be urgent and require human intervention: No point in waking someone up if they
shouldn’t have to do something, or when tomorrow afternoon would be soon enough.
- Provide accurate descriptions and a playbook where possible.
- Basic system monitoring should be based on SLI/SLO’s rather than infra metrics.
- Prefer AM/Iris/OnCall if you’re serious about your alert.
Creating the perfect alert
17
Alert Perfection
Prometheus
AlertManager
AlertManager
AlertManager
Grafana
IRIS OnCall
SMS / Call
Provider
HipChat
• A long list of exporters is available at https://prometheus.io/docs/instrumenting/exporters/
• A number of these come preconfigured with our Kubernetes clusters and provide additional metrics
When artisanal endpoints don’t cut the cake
18
Exporters - Additional sources of metrics
Databases
Aerospike exporter
ClickHouse exporter
Consul exporter (official)
CouchDB exporter
ElasticSearch exporter
Memcached exporter (official)
MongoDB exporter
MSSQL server exporter
MySQL server exporter (official)
OpenTSDB Exporter
Oracle DB Exporter
PgBouncer exporter
PostgreSQL exporter
ProxySQL exporter
RavenDB exporter
Redis exporter
RethinkDB exporter
SQL exporter
Tarantool metric library
Hardware related
apcupsd exporter
Collins exporter
IoT Edison exporter
IPMI exporter
knxd exporter
Node/system metrics exporter (official)
Ubiquiti UniFi exporter
Messaging systems
Beanstalkd exporter
Gearman exporter
Kafka exporter
NATS exporter
NSQ exporter
Mirth Connect exporter
MQTT blackbox exporter
RabbitMQ exporter
RabbitMQ Management Plugin exporter
Storage
Ceph exporter
Ceph RADOSGW exporter
Gluster exporter
Hadoop HDFS FSImage exporter
Lustre exporter
ScaleIO exporter
HTTP
Apache exporter
HAProxy exporter (official)
Nginx metric library
Nginx VTS exporter
Passenger exporter
Tinyproxy exporter
Varnish exporter
WebDriver exporter
APIs
AWS ECS exporter
AWS Health exporter
AWS SQS exporter
Cloudflare exporter
DigitalOcean exporter
Docker Cloud exporter
Docker Hub exporter
GitHub exporter
InstaClustr exporter
Mozilla Observatory exporter
OpenWeatherMap exporter
Pagespeed exporter
Rancher exporter
Speedtest exporter
Logging
Fluentd exporter
Google's mtail log data extractor
Grok exporter
Other monitoring systems
Akamai Cloudmonitor exporter
AWS CloudWatch exporter (official)
Cloud Foundry Firehose exporter
Collectd exporter (official)
Google Stackdriver exporter
Graphite exporter (official)
Heka dashboard exporter
Heka exporter
InfluxDB exporter (official)
JavaMelody exporter
JMX exporter (official)
Munin exporter
Nagios / Naemon exporter
New Relic exporter
NRPE exporter
Osquery exporter
Pingdom exporter
scollector exporter
Sensu exporter
SNMP exporter (official)
StatsD exporter (official)
Miscellaneous
Bamboo exporter
BIG-IP exporter
BIND exporter
Bitbucket exporter
Blackbox exporter (official)
BOSH exporter
cAdvisor
Confluence exporter
Dovecot exporter
eBPF exporter
Jenkins exporter
JIRA exporter
Kannel exporter
Kemp LoadBalancer exporter
Meteor JS web framework exporter
Minecraft exporter module
PHP-FPM exporter
PowerDNS exporter
Process exporter
rTorrent exporter
SABnzbd exporter
Script exporter
Shield exporter
SMTP/Maildir MDA blackbox prober
SoftEther exporter
Transmission exporter
Unbound exporter
Xen exporter
• StackDriver Exporter- Get your GCP Project’s native metrics into Prometheus.
• Blackbox Exporter – Monitor Golden Signals on any system, without knowledge about the inner working
• Nginx exporter – used in Ingresses
• SNMP Exporter – Bring your own MIB’s.
• Statsd Exporter – Push your statsd metrics to a sidecar container
• Node Exporter – Provides system metrics for VM and Physical systems (like kubernetes nodes)
• cAdvisor – Get generic container metrics
• Etcd
• Kubernetes
• Minio (Gitlab Runner Caching)
The most commonly used
19
Exporters - Highlights
Prometheus
Prometheus OperatorExporter K8s Service ServiceMonitor
• For situations where you are unable to serve a HTTP metrics page for a reliable period of time.
• Ideal for short running tasks like Kubernetes CronJobs, Hadoop Jobs, Scripts, etc.
• Allows you to Push (through a HTTP call) Metrics to buffering service, which in turn exposes them to
Prometheus.
• Metrics will live forever on the Gateway, so be careful of what you push and how you name them.
• Avoid this route if possible, since it scales very badly and is NOT redundant. Bring your own endpoint if
and when possible.
• PRO-Tip: If you have an ephemeral job, also push the timestamp of last successful job completion.
The Push Gateway
20
Metrics for ephemeral jobs
Prometheus
PrometheusYOUR App! Push Gateway
echo ”ultimate_answer 42.0" | curl --data-binary @- http://gateway:9091/metrics/job/magrathea/instance/zaphod-001/group/vogon/opex/DPI
ultimate_answer{group=”vogon",instance=”zaphod-001",job=”magrathea",opex=”DPI"} 42.0
Demo Time!
• Kubernetes Running on Docker for macOS.
• Out of the box Prometheus on Kubernetes from https://github.com/coreos/prometheus-
operator/tree/master/contrib/kube-prometheus
• Services are running without an Ingress, so we’re accessing them directly, using NodePorts.
• We’re going to add our own Full Featured Axle Service by creating a Deployment and a Service to match
it, adding a ServiceMonitor, watching Service Discovery do it’s thing, graphing one of the metrics and
creating an alert for it.
• Prometheus: http://localhost:30000/graph
• AlertManager: http://localhost:31000/#/alerts
• Grafana: http://localhost:32000/d/9dP_FHImz/pods
Getting started in 5 minutes
22
Today’s Quick Demo
Tips & Tricks
Getting the most out of your Prometheus Experience
• Metrics in Prometheus are multi dimensional; They consist of names and labels.
• Names are generic identifiers to tell WHAT you are measuring, in what format.
• Metric Names SHOULD have a single (base!) unit, added as a suffix describing that unit. (bytes, seconds,
meters)
• Labels describe characteristics, and are usually used to identify WHERE those metrics are coming from,
and can be multi faceted.
• Prometheus saves a separate Time Series for each name/labels combination, so you have to ensure
label cardinality does not get too high, or you will kill Prometheus in the end. (Bad examples: usernames,
internet IP addresses, hashes).
• Read https://prometheus.io/docs/practices/naming/ before you start making your own!
Keep things running smoothly by not making a mess.
24
Metric Naming
api_http_requests_total { type="create|update|delete”, method=“GET|POST|DELETE” }
api_request_duration_seconds { stage="extract|transform|load” }
api_errors_total { endpoint=“listProducts|updatePricing”, code=“500|404|418 I'm a teapot” }
•An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of
the level of service that is provided.
•An SLO is a service level objective: a target value or range of values for a service level that is
measured by an SLI. A natural structure for SLOs is thus
[SLI ≤ target], or [lower bound ≤ SLI ≤ upper bound].
•Symptoms vs Causes: Monitor things that users will notice when using your system.
•Latency - The time it takes to service a request.
•Traffic. - A measure of how much demand is being placed on your system, measured in a
high-level system-specific metric. For a web service, this measurement is usually HTTP
requests per second.
•Errors - The rate of requests that fail (like HTTP 500’s)
•Saturation- "How "full" your service is. A measure of your system fraction, emphasizing the
resources that are most constrained.
What should you be monitoring?
25
The Golden Signals
•BlackBox Exporter for period requests and their Metrics (Success, Latency, Errors)
•Nginx Ingress Metrics for a man-in-the-middle view of your application (Flow, Latency, Errors)
•Your own application’s Metrics for insights, details and under-the-hood view.
Combining Metric Sources for an unbiassed view
26
Bringing it all together
Your App
Blackbox
Exporter
Ingress
Poll Metrics
Ingress Metrics
App Metrics
- job_name: 'blackbox’
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- http://myapp.behindingress.io # Target to probe with http
Prometheus scrape
•Introducing the GenericServiceMonitor and DCServiceMonitor
•These types allow you to define endpoints outside of Kubernetes, and allow
you to monitor on-premise services.
•DCServiceMonitor works based on bol_applications and as such is bol.com
specific:
•GenericServiceMonitor works on static endpoints
My stuff runs in the DC and I want to keep it there.
27
So what about non-Cloud resources?
kind: Prometheus/DCServiceMonitor
name: tst-sdd-app
spec:
port: 8080
path: /internal/metrics
kind: Prometheus/GenericServiceMonitor
name: dev-atscale-app
Spec:
hosts: - ip: 1.2.3.4
hostname: some.host.name
port: 8080
path: /internal/metrics
opex: srt-bificsps
•Always initialize your metrics at zero when possible, or you won’t know the significance of the
first value.
•How do you know if your application is OK when the metrics stopped working? The up metric
might also disappear when Service Discovery no longer detects your service. Always use
absent() to check for existence of up!
•(i)rate()/increase() then sum(), not sum() then (i)rate()/increase(), since those
are the only safe functions to deal with resets.
•The rate function takes a time series over a time range, and based on the first and last data
points within that range (http://localhost:32000/d/h3RZO2Iik/rate-vs-irate?orgId=1 )
•By contrast irate is an instant rate. It only looks at the last two points within the
range passed to it and calculates a per-second rate.
•To complement the saturation signal; Prometheus has predict_linear() for Gauges.
•All the metrics? http://localhost:30000/federate?match[]={__name__%3D~%22[a-z].*%22}
Things you’ll encounter once you start making queries
28
Other tips
Questions?
Don’t bother to ask me the Ultimate Question of Life, the
Universe and Everything, because you already know the answer.
(and yes, I know where my towel is.)
Remco Overdijk
roverdijk@bol.com
So Long!
And thanks for all the fish.

More Related Content

PDF
Love the Problem, Not Your Solution
Ash Maurya
 
PDF
Invision Design Systems Handbook
Harsha MV
 
PDF
Usability testing
Priyanka Rana
 
PPTX
Agile metrics
Ankit Tandon
 
PPTX
Agile Methodology and Tools
Naresh Gajuveni
 
PPTX
UX in Real Life
Narek Kozmoyan
 
PPTX
Water-Scrum-Fall: The Good, the Bad, and the [Scrum]Butt-Ugly
Brad Appleton
 
PDF
DevOps vs Traditional IT Ops (DevOps Days ignite talk by Oliver White)
ZeroTurnaround
 
Love the Problem, Not Your Solution
Ash Maurya
 
Invision Design Systems Handbook
Harsha MV
 
Usability testing
Priyanka Rana
 
Agile metrics
Ankit Tandon
 
Agile Methodology and Tools
Naresh Gajuveni
 
UX in Real Life
Narek Kozmoyan
 
Water-Scrum-Fall: The Good, the Bad, and the [Scrum]Butt-Ugly
Brad Appleton
 
DevOps vs Traditional IT Ops (DevOps Days ignite talk by Oliver White)
ZeroTurnaround
 

What's hot (20)

PDF
Ppt design thinking_chandra_kusuma_xii-ips
Chandra Kusuma
 
PDF
Portfolio Kanban
Pawel Brodzinski
 
PDF
Kanban for Portfolio Management
Gaetano Mazzanti
 
PDF
Approaches to scaling agile
Srinath Ramakrishnan
 
PPTX
Intro to ux and how to design a thoughtful ui
Thanos Makaronas
 
PDF
Visualization in Agile
Vineet Patni
 
PPTX
SMAC
Mphasis
 
PDF
Measuring What Matters: A UX Approach to Metrics :: UX Days Tokyo [April 2015]
Kate Rutter
 
PDF
Before After PowerPoint Presentation Slides
SlideTeam
 
PDF
Growing up with agile - how the Spotify 'model' has evolved
Peter Antman
 
PPTX
Chatbot ppt
Geff Thomas
 
PDF
Lyssa Adkins & Michael Spayd (Keynote)
AgileNZ Conference
 
PPTX
How to Break the Requirements into User Stories
ShriKant Vashishtha
 
PPSX
UX Explained
Mind Over Machines
 
PDF
How spotify makes product
Ali Sarrafi
 
PDF
Impact Maps and Story Maps: delivering what really matters
Christian Hassa
 
PPTX
Learning a Personalized Homepage
Justin Basilico
 
PDF
Usability Testing 101 - an introduction
Elizabeth Snowdon
 
PPT
מצגת מלווה לשיעור בנושא העור
Carmit Cohen
 
Ppt design thinking_chandra_kusuma_xii-ips
Chandra Kusuma
 
Portfolio Kanban
Pawel Brodzinski
 
Kanban for Portfolio Management
Gaetano Mazzanti
 
Approaches to scaling agile
Srinath Ramakrishnan
 
Intro to ux and how to design a thoughtful ui
Thanos Makaronas
 
Visualization in Agile
Vineet Patni
 
SMAC
Mphasis
 
Measuring What Matters: A UX Approach to Metrics :: UX Days Tokyo [April 2015]
Kate Rutter
 
Before After PowerPoint Presentation Slides
SlideTeam
 
Growing up with agile - how the Spotify 'model' has evolved
Peter Antman
 
Chatbot ppt
Geff Thomas
 
Lyssa Adkins & Michael Spayd (Keynote)
AgileNZ Conference
 
How to Break the Requirements into User Stories
ShriKant Vashishtha
 
UX Explained
Mind Over Machines
 
How spotify makes product
Ali Sarrafi
 
Impact Maps and Story Maps: delivering what really matters
Christian Hassa
 
Learning a Personalized Homepage
Justin Basilico
 
Usability Testing 101 - an introduction
Elizabeth Snowdon
 
מצגת מלווה לשיעור בנושא העור
Carmit Cohen
 
Ad

Similar to The hitchhiker’s guide to Prometheus (20)

PDF
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil
 
PPTX
Prometheus for Monitoring Metrics (Fermilab 2018)
Brian Brazil
 
PDF
Microservices and Prometheus (Microservices NYC 2016)
Brian Brazil
 
PDF
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Brian Brazil
 
PPTX
An Introduction to Prometheus (GrafanaCon 2016)
Brian Brazil
 
PDF
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil
 
PDF
Prometheus (Microsoft, 2016)
Brian Brazil
 
PDF
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Accumulo Summit
 
PDF
Monitoring with prometheus at scale
Juraj Hantak
 
PDF
Monitoring with prometheus at scale
Adam Hamsik
 
PDF
Distributed Tracing
distributedtracing
 
PDF
RxJava@Android
Maxim Volgin
 
PDF
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Brian Brazil
 
PPTX
Mini training - Reactive Extensions (Rx)
Betclic Everest Group Tech Team
 
PPTX
Distributed tracing 101
Itiel Shwartz
 
PDF
Go Observability (in practice)
Eran Levy
 
PPTX
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Sridhar Kumar N
 
PDF
Slack in the Age of Prometheus
George Luong
 
PPT
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
 
PDF
Prometheus Overview
Brian Brazil
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Brian Brazil
 
Microservices and Prometheus (Microservices NYC 2016)
Brian Brazil
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Brian Brazil
 
An Introduction to Prometheus (GrafanaCon 2016)
Brian Brazil
 
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil
 
Prometheus (Microsoft, 2016)
Brian Brazil
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Accumulo Summit
 
Monitoring with prometheus at scale
Juraj Hantak
 
Monitoring with prometheus at scale
Adam Hamsik
 
Distributed Tracing
distributedtracing
 
RxJava@Android
Maxim Volgin
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Brian Brazil
 
Mini training - Reactive Extensions (Rx)
Betclic Everest Group Tech Team
 
Distributed tracing 101
Itiel Shwartz
 
Go Observability (in practice)
Eran Levy
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Sridhar Kumar N
 
Slack in the Age of Prometheus
George Luong
 
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
 
Prometheus Overview
Brian Brazil
 
Ad

More from Bol.com Techlab (20)

PDF
Test long and prosper
Bol.com Techlab
 
PDF
The Reactive Rollercoaster
Bol.com Techlab
 
PDF
Best painkiller for Java headache
Bol.com Techlab
 
PDF
Organizing a conference in 80 days
Bol.com Techlab
 
PDF
Three steps to untangle data traffic jams
Bol.com Techlab
 
PDF
Understanding Operating Systems by breaking them
Bol.com Techlab
 
PDF
How to train your dragon
Bol.com Techlab
 
PDF
The hitchhiker’s guide to Prometheus
Bol.com Techlab
 
PDF
Software for drafting a cold beer
Bol.com Techlab
 
PDF
Going to the cloud: Forget EVERYTHING you know!
Bol.com Techlab
 
PDF
How to create your presentation in an iterative way
Bol.com Techlab
 
PDF
Wax on, wax off
Bol.com Techlab
 
PDF
Jupyter and Pandas to the rescue!
Bol.com Techlab
 
PDF
How the best of Design and Development come together
Bol.com Techlab
 
PDF
The addition to your team you never knew you needed
Bol.com Techlab
 
PDF
Gravitational waves: A new era in astronomy
Bol.com Techlab
 
PDF
Consumer Driven Contract Testing
Bol.com Techlab
 
PDF
I want to go fast! - Exposing performance bottlenecks
Bol.com Techlab
 
PDF
Kubernetes: love at first sight?
Bol.com Techlab
 
PDF
Blockchain: the magical database in the cloud?
Bol.com Techlab
 
Test long and prosper
Bol.com Techlab
 
The Reactive Rollercoaster
Bol.com Techlab
 
Best painkiller for Java headache
Bol.com Techlab
 
Organizing a conference in 80 days
Bol.com Techlab
 
Three steps to untangle data traffic jams
Bol.com Techlab
 
Understanding Operating Systems by breaking them
Bol.com Techlab
 
How to train your dragon
Bol.com Techlab
 
The hitchhiker’s guide to Prometheus
Bol.com Techlab
 
Software for drafting a cold beer
Bol.com Techlab
 
Going to the cloud: Forget EVERYTHING you know!
Bol.com Techlab
 
How to create your presentation in an iterative way
Bol.com Techlab
 
Wax on, wax off
Bol.com Techlab
 
Jupyter and Pandas to the rescue!
Bol.com Techlab
 
How the best of Design and Development come together
Bol.com Techlab
 
The addition to your team you never knew you needed
Bol.com Techlab
 
Gravitational waves: A new era in astronomy
Bol.com Techlab
 
Consumer Driven Contract Testing
Bol.com Techlab
 
I want to go fast! - Exposing performance bottlenecks
Bol.com Techlab
 
Kubernetes: love at first sight?
Bol.com Techlab
 
Blockchain: the magical database in the cloud?
Bol.com Techlab
 

Recently uploaded (20)

PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
The Future of Artificial Intelligence (AI)
Mukul
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 

The hitchhiker’s guide to Prometheus

  • 1. The hitchhiker’s guide to Remco Overdijk 1 "A Metric, The Hitchhiker's Guide to Prometheus says, is about the most massively useful thing someone doing Monitoring can have. It has great practical value. You can wave your Metric in emergencies as a distress signal, and produce pretty Graphs at the same time."
  • 2. 1. The Landscape What are we running and why? 2. Core Concepts How does Prometheus work? 3. Demo Time! It’s a Tools in Action talk after all, right? 4. Tips & Tricks Getting the most out of your Prometheus Experience 5. Questions? I’m probably going to answer “42” to most of them.. So many things to tell, so little time.. 2 The Hitchhiker’s Guide to Prometheus
  • 3. • Started out in TES, doing Metrics, Monitoring & Logging. (Graphite, Statsd, Grafana, Nagios, Logstash, ElasticSearch, Kibana, etc. ) • Currently in DPI, doing CI/CD and bringing Gitlab/Spinnaker to the Cloud. That requires a lot of monitoring… • Member of the Cloud9 MML Circle, doing Prometheus • Core Contributor to the R2D2 module that manages Prometheus and Monitoring/Alerting resources within Cloud9 • Worked on implementing Prometheus and Grafana, while also using these stacks for monitoring production systems. • NightOwl for SRT Platform; I know how pagers work. Who are you, and why are you telling us this? 3 Introduction
  • 5. Data Center VS Cloud VM’s and Servers VS containers in Kubernetes 5 Monitoring Prometheus Metrics Prometheus (+ InfluxDB/Thanos) Alerting AlertManager, Iris, OnCall, Grafana Visualization Grafana Logging StackDriver, ElasticSearch + Kibana Monitoring Nagios + Thruk + Lookingglass Metrics Graphite + Statsd Alerting SMS modems in physical servers Visualization Grafana Logging ElasticSearch + Kibana
  • 6. •Applications in Kubernetes are much more dynamic than we’re used to. • No Static IP addresses. • No Static amount servers (Well, pods actually..) • Kubernetes can reschedule / relocate pods at will. • Prometheus uses Service Discovery to find targets •Both Nagios and Graphite have scaling issues and are too rigid. • Prometheus is Pull instead of Push based and doesn’t require execution for every single check • Combines Metrics & Monitoring into a single stack, but focuses on Monitoring. •Being based on BorgMon, it works out of the box with a lot of Kubernetes / Cloud native components and the services supporting them. •StackDriver is not a full fledged alternative due to features, retention and cost. Why didn’t you come up with something else? 6 So, why Prometheus?
  • 7. •Out of the box, Prometheus also doesn’t scale endlessly without compromises (But Thanos will) •Scalability is solved through retention, manual sharding and vertical scaling, which all have clear drawbacks. •HA is solved through duplication (Polling twice from independent instances with individual TSDB’s). •Prometheus development is very focused, which shows in certain aspects. Well.. No. 7 Is this the answer to everything then?
  • 8. All the pods & services 8 Infrastructure Overview Kubernetes {DEV, STG, PRO} Clusters Datacenters Prometheus Prometheus AlertManager AlertManager AlertManager Grafana PushGateway IRIS OnCall SMS / Call Provider HipChat Operator Remote Storage Adapter InfluxDB YOUR App! Kubernetes Exporters
  • 9. Core Concepts How does it work and what makes it tick?
  • 10. - Counters - A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. (1, 2, 5, 9, 0, 2, 7) - Gauges - A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. (1, 4, 2, 5, 8) - Histograms - A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values. - Summaries - Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window. - Quantiles are convenient when (for example) expressing median (2-quantile) and 95th percentiles. Supported Types 10 Making Metrics
  • 11. - Instead of creating separate checks for every metric that should be monitored for your application, you expose a single (or multiple..) HTTP Endpoint containing all metrics. - It’s your responsibility to make this endpoint Available, Fast and Reliable. - Multiple Frameworks and Libraries can help you provisioning and maintaining such an endpoint. - Axle Comes with built-in support for MicroMeter, which does everything for you. - Backspin support is coming soon™. - Example: http://localhost:30000/metrics The concept of Scraping HTTP Metric Endpoints 11 Exposing Metrics: Push VS Pull # HELP prometheus_tsdb_head_min_time Minimum time bound of the head block. # TYPE prometheus_tsdb_head_min_time gauge prometheus_tsdb_head_min_time 1.5282792e+12 # HELP prometheus_tsdb_head_samples_appended_total Total number of appended samples. # TYPE prometheus_tsdb_head_samples_appended_total counter prometheus_tsdb_head_samples_appended_total 2.9485092e+07 # HELP prometheus_tsdb_head_series Total number of series in the head block. # TYPE prometheus_tsdb_head_series gauge prometheus_tsdb_head_series 19956 # HELP prometheus_tsdb_head_series_created_total Total number of series created in the head # TYPE prometheus_tsdb_head_series_created_total gauge prometheus_tsdb_head_series_created_total 56888
  • 12. - An actual Query Language that looks a lot more like SQL than Graphite. - You’ll need to learn a new language, but it’s only a single language for creating Graphs and Alerts; for monitoring and long term metrics. - Allows for a lot of flexibility, but can be a bit harder to grasp when starting out. - Supports functions, operators, regex, arithmetic and expressions. - Four expression types are supported: - Instant Vectors (like http_requests_total{environment=~"staging|testing|development", method!="GET"}) - Instant vector selectors allow the selection of a set of time series and a single sample value for each at a given timestamp (instant): in the simplest form, only a metric name is specified. This results in an instant vector containing elements for all time series that have this metric name. - Range Vectors (like http_requests_total{job="prometheus"}[5m] ) - Range vector literals work like instant vector literals, except that they select a range of samples back from the current instant. Syntactically, a range duration is appended in square brackets ([]) at the end of a vector selector to specify how far back in time values should be fetched for each resulting range vector element. - Scalars - Strings PromQL 12 Querying Metrics
  • 13. - Custom Resource Type provided by Prometheus-operator - Abstraction of Prometheus “job” and Service Discovery - Allows for easy ingestion of new endpoints through their k8s service - Example: ServiceMonitors 13 Getting your endpoint monitored Prometheus Prometheus OperatorYOUR App! K8s Service ServiceMonitor apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token interval: 30s port: https scheme: https tlsConfig: insecureSkipVerify: true jobLabel: k8s-app selector: matchLabels: k8s-app: node-exporter apiVersion: v1 kind: Service metadata: labels: k8s-app: node-exporter name: node-exporter spec: ports: - name: https port: 9100 protocol: TCP targetPort: https selector: app: node-exporter type: ClusterIP
  • 14. - The same tool you were probably already using. - The central interface for cloud insights - Contains a specialized query editor for Prometheus data sources. - Prometheus currently doesn’t store metrics older than one month for performance reasons. - Multiple solutions for long term metrics exist, but it’s a work in progress. Dashboarding with Grafana 14 Creating Insights Prometheus Prometheus Grafana HipChat Remote Storage Adapter InfluxDB
  • 15. Trouble in Paradise Creating Alerts, choosing your weapon 15 WARNINGS – Notifications During workhours - No direct intervention is required - Usually picked up by members of the team developing / maintaining a system. - Alert delivery is NOT guaranteed. Use Grafana with HipChat or Email alerts CRITICALS – 24x7 Text Messages with Escalation - Actionable events that require immediate attention by an Engineer on Duty, who does not necessarily have intimate knowledge of your system. - Response is required to silence/end the alert. - Provisioned through RuleList (R2D2 / Operator) Use AlertManager / Iris / Oncall
  • 16. Yes, It’s PromQL as well! 16 Alert Basics %YAML 1.1 --- kind: PrometheusAlertRule Data: test.rules: | Groups: - name: Load interval: 30s Rules: - alert: HighLoad expr: rate(web_http_responses_total[1m]) > 1 for: 1m Labels: Severity: attention Annotations: description: The rate of HTTP requests is too high.
  • 17. - Alerts should be actionable: Somebody has to do something, now. - They should be simple: Someone without intimate knowledge of the system should ideally be able to solve the alert. - They should be urgent and require human intervention: No point in waking someone up if they shouldn’t have to do something, or when tomorrow afternoon would be soon enough. - Provide accurate descriptions and a playbook where possible. - Basic system monitoring should be based on SLI/SLO’s rather than infra metrics. - Prefer AM/Iris/OnCall if you’re serious about your alert. Creating the perfect alert 17 Alert Perfection Prometheus AlertManager AlertManager AlertManager Grafana IRIS OnCall SMS / Call Provider HipChat
  • 18. • A long list of exporters is available at https://prometheus.io/docs/instrumenting/exporters/ • A number of these come preconfigured with our Kubernetes clusters and provide additional metrics When artisanal endpoints don’t cut the cake 18 Exporters - Additional sources of metrics Databases Aerospike exporter ClickHouse exporter Consul exporter (official) CouchDB exporter ElasticSearch exporter Memcached exporter (official) MongoDB exporter MSSQL server exporter MySQL server exporter (official) OpenTSDB Exporter Oracle DB Exporter PgBouncer exporter PostgreSQL exporter ProxySQL exporter RavenDB exporter Redis exporter RethinkDB exporter SQL exporter Tarantool metric library Hardware related apcupsd exporter Collins exporter IoT Edison exporter IPMI exporter knxd exporter Node/system metrics exporter (official) Ubiquiti UniFi exporter Messaging systems Beanstalkd exporter Gearman exporter Kafka exporter NATS exporter NSQ exporter Mirth Connect exporter MQTT blackbox exporter RabbitMQ exporter RabbitMQ Management Plugin exporter Storage Ceph exporter Ceph RADOSGW exporter Gluster exporter Hadoop HDFS FSImage exporter Lustre exporter ScaleIO exporter HTTP Apache exporter HAProxy exporter (official) Nginx metric library Nginx VTS exporter Passenger exporter Tinyproxy exporter Varnish exporter WebDriver exporter APIs AWS ECS exporter AWS Health exporter AWS SQS exporter Cloudflare exporter DigitalOcean exporter Docker Cloud exporter Docker Hub exporter GitHub exporter InstaClustr exporter Mozilla Observatory exporter OpenWeatherMap exporter Pagespeed exporter Rancher exporter Speedtest exporter Logging Fluentd exporter Google's mtail log data extractor Grok exporter Other monitoring systems Akamai Cloudmonitor exporter AWS CloudWatch exporter (official) Cloud Foundry Firehose exporter Collectd exporter (official) Google Stackdriver exporter Graphite exporter (official) Heka dashboard exporter Heka exporter InfluxDB exporter (official) JavaMelody exporter JMX exporter (official) Munin exporter Nagios / Naemon exporter New Relic exporter NRPE exporter Osquery exporter Pingdom exporter scollector exporter Sensu exporter SNMP exporter (official) StatsD exporter (official) Miscellaneous Bamboo exporter BIG-IP exporter BIND exporter Bitbucket exporter Blackbox exporter (official) BOSH exporter cAdvisor Confluence exporter Dovecot exporter eBPF exporter Jenkins exporter JIRA exporter Kannel exporter Kemp LoadBalancer exporter Meteor JS web framework exporter Minecraft exporter module PHP-FPM exporter PowerDNS exporter Process exporter rTorrent exporter SABnzbd exporter Script exporter Shield exporter SMTP/Maildir MDA blackbox prober SoftEther exporter Transmission exporter Unbound exporter Xen exporter
  • 19. • StackDriver Exporter- Get your GCP Project’s native metrics into Prometheus. • Blackbox Exporter – Monitor Golden Signals on any system, without knowledge about the inner working • Nginx exporter – used in Ingresses • SNMP Exporter – Bring your own MIB’s. • Statsd Exporter – Push your statsd metrics to a sidecar container • Node Exporter – Provides system metrics for VM and Physical systems (like kubernetes nodes) • cAdvisor – Get generic container metrics • Etcd • Kubernetes • Minio (Gitlab Runner Caching) The most commonly used 19 Exporters - Highlights Prometheus Prometheus OperatorExporter K8s Service ServiceMonitor
  • 20. • For situations where you are unable to serve a HTTP metrics page for a reliable period of time. • Ideal for short running tasks like Kubernetes CronJobs, Hadoop Jobs, Scripts, etc. • Allows you to Push (through a HTTP call) Metrics to buffering service, which in turn exposes them to Prometheus. • Metrics will live forever on the Gateway, so be careful of what you push and how you name them. • Avoid this route if possible, since it scales very badly and is NOT redundant. Bring your own endpoint if and when possible. • PRO-Tip: If you have an ephemeral job, also push the timestamp of last successful job completion. The Push Gateway 20 Metrics for ephemeral jobs Prometheus PrometheusYOUR App! Push Gateway echo ”ultimate_answer 42.0" | curl --data-binary @- http://gateway:9091/metrics/job/magrathea/instance/zaphod-001/group/vogon/opex/DPI ultimate_answer{group=”vogon",instance=”zaphod-001",job=”magrathea",opex=”DPI"} 42.0
  • 22. • Kubernetes Running on Docker for macOS. • Out of the box Prometheus on Kubernetes from https://github.com/coreos/prometheus- operator/tree/master/contrib/kube-prometheus • Services are running without an Ingress, so we’re accessing them directly, using NodePorts. • We’re going to add our own Full Featured Axle Service by creating a Deployment and a Service to match it, adding a ServiceMonitor, watching Service Discovery do it’s thing, graphing one of the metrics and creating an alert for it. • Prometheus: http://localhost:30000/graph • AlertManager: http://localhost:31000/#/alerts • Grafana: http://localhost:32000/d/9dP_FHImz/pods Getting started in 5 minutes 22 Today’s Quick Demo
  • 23. Tips & Tricks Getting the most out of your Prometheus Experience
  • 24. • Metrics in Prometheus are multi dimensional; They consist of names and labels. • Names are generic identifiers to tell WHAT you are measuring, in what format. • Metric Names SHOULD have a single (base!) unit, added as a suffix describing that unit. (bytes, seconds, meters) • Labels describe characteristics, and are usually used to identify WHERE those metrics are coming from, and can be multi faceted. • Prometheus saves a separate Time Series for each name/labels combination, so you have to ensure label cardinality does not get too high, or you will kill Prometheus in the end. (Bad examples: usernames, internet IP addresses, hashes). • Read https://prometheus.io/docs/practices/naming/ before you start making your own! Keep things running smoothly by not making a mess. 24 Metric Naming api_http_requests_total { type="create|update|delete”, method=“GET|POST|DELETE” } api_request_duration_seconds { stage="extract|transform|load” } api_errors_total { endpoint=“listProducts|updatePricing”, code=“500|404|418 I'm a teapot” }
  • 25. •An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided. •An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus [SLI ≤ target], or [lower bound ≤ SLI ≤ upper bound]. •Symptoms vs Causes: Monitor things that users will notice when using your system. •Latency - The time it takes to service a request. •Traffic. - A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second. •Errors - The rate of requests that fail (like HTTP 500’s) •Saturation- "How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained. What should you be monitoring? 25 The Golden Signals
  • 26. •BlackBox Exporter for period requests and their Metrics (Success, Latency, Errors) •Nginx Ingress Metrics for a man-in-the-middle view of your application (Flow, Latency, Errors) •Your own application’s Metrics for insights, details and under-the-hood view. Combining Metric Sources for an unbiassed view 26 Bringing it all together Your App Blackbox Exporter Ingress Poll Metrics Ingress Metrics App Metrics - job_name: 'blackbox’ metrics_path: /probe params: module: [http_2xx] # Look for a HTTP 200 response. static_configs: - targets: - http://myapp.behindingress.io # Target to probe with http Prometheus scrape
  • 27. •Introducing the GenericServiceMonitor and DCServiceMonitor •These types allow you to define endpoints outside of Kubernetes, and allow you to monitor on-premise services. •DCServiceMonitor works based on bol_applications and as such is bol.com specific: •GenericServiceMonitor works on static endpoints My stuff runs in the DC and I want to keep it there. 27 So what about non-Cloud resources? kind: Prometheus/DCServiceMonitor name: tst-sdd-app spec: port: 8080 path: /internal/metrics kind: Prometheus/GenericServiceMonitor name: dev-atscale-app Spec: hosts: - ip: 1.2.3.4 hostname: some.host.name port: 8080 path: /internal/metrics opex: srt-bificsps
  • 28. •Always initialize your metrics at zero when possible, or you won’t know the significance of the first value. •How do you know if your application is OK when the metrics stopped working? The up metric might also disappear when Service Discovery no longer detects your service. Always use absent() to check for existence of up! •(i)rate()/increase() then sum(), not sum() then (i)rate()/increase(), since those are the only safe functions to deal with resets. •The rate function takes a time series over a time range, and based on the first and last data points within that range (http://localhost:32000/d/h3RZO2Iik/rate-vs-irate?orgId=1 ) •By contrast irate is an instant rate. It only looks at the last two points within the range passed to it and calculates a per-second rate. •To complement the saturation signal; Prometheus has predict_linear() for Gauges. •All the metrics? http://localhost:30000/federate?match[]={__name__%3D~%22[a-z].*%22} Things you’ll encounter once you start making queries 28 Other tips
  • 29. Questions? Don’t bother to ask me the Ultimate Question of Life, the Universe and Everything, because you already know the answer. (and yes, I know where my towel is.)