Dip into prometheus

hello!
I am Zaar Hai
Staff Cloud Architect at DoiT International
linkedin.com/in/zaar
2

Cloud Out of the box
✘ VM CPU/Network/Disk stats…
✘ But not memory
➢ requires vendor-speciﬁc agents
4

✘ But not memory
✘ Gets much better with GKE
➢ Memory usage for pods IF your YAMLs behave
5

✘ But not memory
✘ Gets much better with GKE
➢ Memory usage for pods IF your YAMLs behave
✘ All of the above are metrics… But about our app?
6

App Metrics
It’s all about app metrics
7

When logs are not enough
✘ Too detailed to see the big picture
✘ Hard to see KPIs / trends
9

Logs to metrics
✘ Log-based metrics in StackDriver
➢ Fragile configuration
➢ Disjoint from the app
➢ Vendor specific
✘ Smart log parsers, e.g. Coralogix
Parsed 100 docs in 0.31 seconds
✘ An option when retrofitting monitoring on the existing app
10

Metrics are
12
Just a tuple
Numeric
value
Time
stamp
Metric
name

Metric examples
Process A:
2020-07-28T02:32:06Z http_requests_total 1239
Process B:
Now we can aggregate across!
13

Metric dimensions
Process A:
Process B:
14

Metric dimensions
Process A:
2020-07-28T02:32:06Z http_requests_total{code=200} 1227
Process B:
15

Metric dimensions
2020-07-28T02:32:06Z
http_requests_total{code=200, process=A} 1227
http_requests_total{code=404, process=A} 12
http_requests_total{code=200, process=B} 1177
http_requests_total{code=404, process=B} 8
16

Metric dimensions
2020-07-28T02:32:06Z
http_requests_total{code=200, process=A, path=/foo} 1107
http_requests_total{code=404, process=A, path=/foo} 12
http_requests_total{code=404, process=A, path=/bar} 120
http_requests_total{code=200, process=B, path=/foo} 1005
http_requests_total{code=404, process=B, path=/foo} 8
http_requests_total{code=200, process=B, path=/bar} 172
17

1,000Metrics Per MicroService
3,504,000,000Samples per month
80,000Samples per minute
20
x20

1,000Metrics Per MicroService
3,504,000,000Samples per month
80,000Samples per minute
21
x20
Just for your
app

Wait, there is more!
✘ 10-20k metrics per average K8s node
✘ That’s 1,200,000/minute for 15 node cluster
➢ Assuming 15s collection interval
✘ Or 52,560,000,000 samples a month!
✘ And that’s just for average sized app
22

“No matter what you chose,
stay within
24
A Single Pane of Glass

Option 1
Stick with the Vendor
25

Enhance the existing
✘ Both AWS and GCP give you so much for free already
✘ Just add your app metrics
➢ And they support that!
✘ No need to ship system metrics, e.g. K8s - they are alredy there
26

But it’s costly
✘ GCP StackDriver
➢ $84/month per 1k metrics
➢ Price drops after 300k metrics
27

But it’s costly
✘ GCP StackDriver
✘ AWS CloudWatch
➢ $100/month after ﬁrst 10k, $50 after the ﬁrst 240k
28

But it’s costly
✘ GCP StackDriver
✘ AWS CloudWatch
➢ $100/month after ﬁrst 10k, $50 after the ﬁrst 240k
✘ Logs are expensive too, btw
➢ GCP SD: $0.50/GB
➢ AWS CW: $0.60/GB + charge of $0.0057 per scanned GB for queries
29

But it’s costly
✘ 20 μSvc app with 1k metrics and 1GB logs/day per μSvc:
➢ 1*20*30 = 600GB/month
➢ 20k metrics/month
✘ Will cost you:
➢ GCP: $300 for logs + $1760 for metrics
➢ AWS: $360 for logs + $4000 for metrics
✘ It’s only half a story!
➢ With containers metrics are short lived
30

Further considerations
✘ Vendor speciﬁc APIs to ship
➢ Challenging for multi-cloud
➢ Gets better with K8s
✘ Limited to 1 minute resolution
31

There are many out there
✘ DataDog, Sysdig, NewRelic, Splunk, SumoLogic, Grafana (hosted)
✘ Once you see the pricing, GCP/AWS $-ﬁgures make sense :)
✘ Lot’s of added features though:
➢ AI-assisted anomaly detection, etc.
✘ Multi-cloud!
33

There are many out there
✘ DataDog, Sysdig, NewRelic, Splunk, SumoLogic, Grafana (hosted)
✘ Once you see the pricing, GCP/AWS $-ﬁgures make sense :)
✘ Lot’s of added features though:
➢ AI-assisted anomaly detection, etc.
✘ Multi-cloud!
✘ But now you need to ship all your system metrics
➢ Can become expensive quickly
34

Simpler than it may sound
Instrument
Collect
&
Store
Display
&
Alert
36

Grafana to Display (and Alert)
De-facto dashboarding software for DevOps and beyond
37

Grafana Multiple Data sources
38
GCP
Pub/Sub
AWS
SES
Email
Parser

Grafana Multiple Data sources
39
GCP
Pub/Sub
AWS
SES
Email
Parser
One Grafana Dashboard
StackDriver
CloudW
atch Prometheus

Hybrid SaaS
As a hybrid SaaS, or “Option 2.5” you can:
✘ Setup hosted Graphana on Graphana Labs
✘ Connect it to CloudWatch, StackDriver, etc.
✘ Ship your app-only metrics to Graphana Labs
➢ At $16/month per 1k metrics
✘ Still limited for 1 minute resolution for CloudWatch/Stackdriver
40

Let’s Kollekt
Instrument
Collect
&
Store
Display
&
Alert
41

Where to?
✘ We have 3 billion app / 50 billion system metric samples per month
✘ Storage size per sample matters here
42

Where to?
✘ We have 3 billion app / 50 billion system metric samples per month
✘ Storage size per sample matters here
✘ MySQL
➢ ~50 bytes per sample (including indexing, etc)
➢ 2.3TB for 50b samples
✘ ElasticSearch
➢ ~20 bytes per sample
➢ 930GB for 50b samples
43

General purpose DBs are expensive
✘ MySQL
➢ $230 for 1 month retention
✘ ElasticSearch
✘ That’s just for storage! For one app!
44

But metrics data is unique
✘ Immutable (no updates)
✘ Write once
✘ Lots of metrics do not change often
✘ And this is why Time Series Databases were born!
45

Prometheus at glance
✘ Not a ﬁrst TSDB, but became a golden standard
✘ 1-2 bytes per sample
➢ $30-$60 storage cost for 3 month retention as in the previous example
✘ Can process 1 million samples per minute on your laptop
47

Not just TSDB
✘ Prometheus discovers:
➢ Your GCE VM
➢ Your GKE pods
✘ Prometheus pulls metrics from targets
✘ Prometheus stores metrics and allows you to query them OR
✘ Federates them further to a central storage
48

Collection at glance
49
GKE Cluster
POD
POD POD
P8S

50
GKE Cluster
POD
POD POD
P8S
VM
VM
VM
P8S

51
GKE Cluster
POD
POD POD
P8S
VM
VM
VM
P8S
P8S
Thanos
VictoriaMetrics
etc.
Grafana

Instrumentation
Instrument
Collect
&
Store
Display
&
Alert
52

Python example
53
import time
from flask import Flask
from prometheus_client import start_http_server, Summary
app = Flask(__name__)
REQUEST_TIME = Summary("request_processing_seconds",
"Time spent processing request")
@app.route("/")
@REQUEST_TIME.time()
def hello_world():
return "Hello, World!n"
if __name__ == "__main__":
start_http_server(8081)
app.run(port=8080)
Dedicated port!

Python example - in action!
54
$ python app.py &
$ curl localhost:8080
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 247.0
# HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 60.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="8",patchlevel="3",version="3.8.3"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 2.34852352e+08
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 2.6411008e+07
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.29000000000000004
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 7.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024.0
# HELP request_processing_seconds Time spent processing request
# TYPE request_processing_seconds summary
request_processing_seconds_count 2.0
request_processing_seconds_sum 1.3547949492931366e-05
# HELP request_processing_seconds_created Time spent processing request
# TYPE request_processing_seconds_created gauge
request_processing_seconds_created 1.5959190974287152e+09

Recap
55
GKE Cluster
POD
POD POD
P8S
VM
VM
VM
P8S
P8S
Thanos
VictoriaMetrics
etc.
Grafana

Dip into prometheus

More Related Content

What's hot (18)

Similar to Dip into prometheus (20)

More from Zaar Hai (7)

Recently uploaded (20)

Dip into prometheus