Infrastructure & System Monitoring using Prometheus

Infrastructure & System
Monitoring
using Prometheus
Marco Pas
Philips Lighting
Software geek, hands on
Developer/Architect/DevOps Engineer
@marcopas

Some stuff about me...
● Mostly doing cloud related stuff
○ Java, Groovy, Scala, Spring Boot, IOT, AWS, Terraform, Infrastructure
● Enjoying the good things
● Chef leuke dingen doen == “trying out cool and new stuff”
● Currently involved in a big IOT project
● Wannabe chef, movie & Netflix addict

Agenda
● Monitoring
○ Introducing you to a Scary Movie
● Prometheus overview (demo’s)
○ Running Prometheus
○ Gathering host metrics
○ Introducing Grafana
○ Monitoring Docker containers
○ Alerting
○ Instrumenting your own code
○ Service Discovery (Consul) integration

I am going to introduce
you to some bad movies

Commonality
between
these movies?

Our scary movie “The Happy Developer”
● Lets push out features
● I can demo so it works :)
● It works with 1 user, so it will work with
multiple
● Don’t worry about performance we will
just scale using multiple
machines/processes
● Logging is into place

Did
anyone
notice?
Disaster Strikes

Logging
“recording to diagnose a system”
Monitoring
“observation, checking and recording”
http_requests_total{method="post",code="200"} 1027 1395066363000
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Logging != Monitoring

Why Monitoring?
● Know when things go wrong
○ Detection & Alerting
● Be able to debug and gain insight
● Detect changes over time and
drive technical/business decisions
● Feed into other systems/processes
(e.g. security, automation)

What to monitor?
IT Network
Operating
System
Services
Applications
Capture
Monitoring
Information
Functional
Monitoring
Operational
Monitoring
metric data

Houston we have Storage problem!
Storage
metric data
metric data
metric data
metric data
metric data
metric data
metric data
metric data
metric data
How to store the mass amount of
metrics and also making them easy
to query?

Time Series - Database
● Time series data is a sequence of data points collected at regular intervals
over a period of time. (metrics)
○ Examples:
■ Device data
■ Weather data
■ Stock prices
■ Tide measurements
■ Solar flare tracking
● The data requires aggregation and analysis
Time Series
Database
metric data
● High write performance
● Data compaction
● Fast, easy range queries

metric name and a set of key-value pairs, also known as labels
<metric name>{<label name>=<label value>, ...} value [ timestamp ]
http_requests_total{method="post",code="200"} 1027 1395066363000
Time Series - Data format

Source:
http://db-engines.com/en/ranking/time+series+dbmshttp://db-engines.com/en/ranking/time+series+dbms

Prometheus
Prometheus is an open-source systems monitoring and alerting toolkit originally
built at SoundCloud. It is now a standalone open source project and maintained
independently of any company.
https://prometheus.io
Implemented using

Prometheus Components
● The main Prometheus server which scrapes and stores time series data
● Client libraries for instrumenting application code
● A push gateway for supporting short-lived jobs
● Special-purpose exporters (for HAProxy, StatsD, Graphite, etc.)
● An alertmanager
● Various support tools
● WhiteBox Monitoring instead of probing [aka BlackBox Monitoring]

List of Job Exporters
● Prometheus managed:
○ JMX
○ Node
○ Graphite
○ Blackbox
○ SNMP
○ HAProxy
○ Consul
○ Memcached
○ AWS Cloudwatch
○ InfluxDB
○ StatsD
○ ...
● Custom ones:
○ Database
○ Hardware related
○ Messaging systems
○ Storage
○ HTTP
○ APIs
○ Logging
○ …
https://prometheus.io/docs/instrumenting/exporters/

# file: prometheus.yml
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
# some settings intentionally removed!!
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

Code Demo
“Running Prometheus Native”

Demo: Run Prometheus using Docker

34
# file: docker-compose.yml
version: '2'
services:
prometheus:
image: prom/prometheus:latest → Using official prometheus container
volumes:
- $PWD:/etc/prometheus → Mount local directory used for config + data
ports:
- "9090:9090" → Port mapping used for this container host:container
command:
- "-config.file=/etc/prometheus/prometheus.yml" → Prometheus configuration

Code Demo
“Running Prometheus Dockerized”

version: '2'
services:
prometheus: → Runnning prometheus as Docker container
image: prom/prometheus:latest → Using official prometheus container
volumes:
- $PWD:/etc/prometheus → Mount local directory used for config + data
ports:
command:
- "-config.file=/etc/prometheus/prometheus.yml" → Prometheus configuration
node-exporter:
image: prom/node-exporter:latest → Using node exporter as an additional container
ports:
- '9100:9100' → Port mapping used for this container host:container

38
global:
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']

Code Demo
“Add host metrics”

version: '2'
services:
# some code intentionally removed!!
grafana:
image: grafana/grafana:latest → Using official prometheus container
ports:
You get the idea :)

Demo: Monitor Docker containers

Alerting Configuration
● Alert Rules
○ What are the settings where we
need to alert upon?
● Alert Manager
○ Where do we need to send the alert
to?

# file: alert.rules
ALERT serviceDownAlert
IF absent(((time() - container_last_seen{name="<service_name>"}) < 5))
FOR 5s
LABELS {
severity = "critical", → setting the labels so we can use them in the AlertManager
service = "backend"
}
ANNOTATIONS { → information used in the alert event
SUMMARY = "Container Instance down",
DESCRIPTION = "Container Instance is down for more than 15 sec."
}

# file: alert-manager.yml
global: → Global settings
smtp_smarthost: 'mailslurper:2500'
smtp_from: 'alertmanager@example.org'
smtp_require_tls: false
route: → Routing
receiver: mail # Fallback → Fallback is there is no match
routes:
- match:
severity: critical → Match on label!
continue: true → Continue with other receivers if there is a match
receiver: mail → Determine the receiver
- match:
severity: critical
receiver: slack

# file: alert-manager.yml (continued)
receivers:
- name: mail → mail receiver
email_configs:
- to: 'team-X+alerts@example.org'
- name: slack → slack receiver
slack_configs:
- send_resolved: true
username: 'AlertManager'
channel: '#alert'
api_url: 'THIS IS A VERY SECRET URL :)’

global:
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "alert.rules"

Code Demo
“Alerting -> The Alert Manager”

Instrumenting your own code!
● Counter
○ A cumulative metric that represents a single numerical value that only ever goes up
● Gauge
○ Single numerical value that can arbitrarily go up and down
● Histogram
○ Samples observations (usually things like request durations or response sizes) and counts
them in configurable buckets. It also provides a sum of all observed values
● Summary
○ Histogram + total count of observations + sum of all observed values, it calculates
configurable quantiles over a sliding time window

Available Languages
● Official
○ Go, Java or Scala, Python, Ruby
● Unofficial
○ Bash, C++, Common Lisp, Elixir, Erlang, Haskell, Lua for Nginx, Lua for Tarantool, .NET / C#,
Node.js, PHP, Rust
// Spring Boot example -> file: build.gradle
dependencies {
compile('org.springframework.boot:spring-boot-starter-web')
testCompile('org.springframework.boot:spring-boot-starter-test')
compile('io.prometheus:simpleclient_spring_boot:0.0.21') → Add dependency
}

Prometheus Client Libaries: SpringBoot Example
@EnablePrometheusEndpoint
@EnableSpringBootMetricsCollector
@RestController
@SpringBootApplication
public class DemoApplication {
public static void main(String[] args) { SpringApplication.run(DemoApplication.class, args); }
static final Counter requests = Counter.build() → create metric type counter
.name("helloworld_requests_total") → set metric name
.help("HelloWorld Total requests.").register(); → register the metric
@RequestMapping("/helloworld")
String home() {
requests.inc(); → increment the counter with 1 (helloworld_requests_total)
return "Hello World!";
}
}

Code Demo
“Application metrics”

Service Discovery
(Consul) Integration

Demo: Consul integration
Register the services with
Consul and Monitor
1
2

Code Demo
“Consul to the rescue”

That’s a wrap!
Question?
https://github.com/mpas/infrastructure-and-system-monitoring-using-prometheus
Marco Pas
Philips Lighting
Software geek, hands on
Developer/Architect/DevOps Engineer
@marcopas

Infrastructure & System Monitoring using Prometheus

More Related Content

What's hot (20)

Similar to Infrastructure & System Monitoring using Prometheus (20)

Recently uploaded (20)

Infrastructure & System Monitoring using Prometheus