SlideShare a Scribd company logo
Dip Into P8s
hello!
I am Zaar Hai
Staff Cloud Architect at DoiT International
linkedin.com/in/zaar
2
Why this talk?
3
Cloud Out of the box
✘ VM CPU/Network/Disk stats…
✘ But not memory
➢ requires vendor-specific agents
4
Cloud Out of the box
✘ VM CPU/Network/Disk stats…
✘ But not memory
➢ requires vendor-specific agents
✘ Gets much better with GKE
➢ Memory usage for pods IF your YAMLs behave
5
Cloud Out of the box
✘ VM CPU/Network/Disk stats…
✘ But not memory
➢ requires vendor-specific agents
✘ Gets much better with GKE
➢ Memory usage for pods IF your YAMLs behave
✘ All of the above are metrics… But about our app?
6
App Metrics
It’s all about app metrics
7
Why metics at all?
8
When logs are not enough
✘ Too detailed to see the big picture
✘ Hard to see KPIs / trends
9
Logs to metrics
✘ Log-based metrics in StackDriver
➢ Fragile configuration
➢ Disjoint from the app
➢ Vendor specific
✘ Smart log parsers, e.g. Coralogix
Parsed 100 docs in 0.31 seconds
Parsed 100 docs in 0.38 seconds
Parsed 100 docs in 0.34 seconds
✘ An option when retrofitting monitoring on the existing app
10
So what are metrics?
11
Metrics are
12
Just a tuple
Numeric
value
Time
stamp
Metric
name
Metric examples
Process A:
2020-07-28T02:32:06Z http_requests_total 1239
2020-07-28T02:32:07Z http_requests_total 1245
Process B:
2020-07-28T02:32:06Z http_requests_total 1185
2020-07-28T02:32:07Z http_requests_total 1185
Now we can aggregate across!
13
Metric dimensions
Process A:
2020-07-28T02:32:06Z http_requests_total 1239
Process B:
2020-07-28T02:32:06Z http_requests_total 1185
14
Metric dimensions
Process A:
2020-07-28T02:32:06Z http_requests_total{code=200} 1227
2020-07-28T02:32:06Z http_requests_total{code=404} 12
Process B:
2020-07-28T02:32:06Z http_requests_total{code=200} 1177
2020-07-28T02:32:06Z http_requests_total{code=404} 8
15
Metric dimensions
2020-07-28T02:32:06Z
http_requests_total{code=200, process=A} 1227
http_requests_total{code=404, process=A} 12
http_requests_total{code=200, process=B} 1177
http_requests_total{code=404, process=B} 8
16
Metric dimensions
2020-07-28T02:32:06Z
http_requests_total{code=200, process=A, path=/foo} 1107
http_requests_total{code=404, process=A, path=/foo} 12
http_requests_total{code=404, process=A, path=/bar} 120
http_requests_total{code=200, process=B, path=/foo} 1005
http_requests_total{code=404, process=B, path=/foo} 8
http_requests_total{code=200, process=B, path=/bar} 172
17
Now we can graph that!
18
How many metrics?
19
1,000Metrics Per MicroService
3,504,000,000Samples per month
80,000Samples per minute
20
x20
1,000Metrics Per MicroService
3,504,000,000Samples per month
80,000Samples per minute
21
x20
Just for your
app
Wait, there is more!
✘ 10-20k metrics per average K8s node
✘ That’s 1,200,000/minute for 15 node cluster
➢ Assuming 15s collection interval
✘ Or 52,560,000,000 samples a month!
✘ And that’s just for average sized app
22
23
What are
My Options?
“No matter what you chose,
stay within
24
A Single Pane of Glass
Option 1
Stick with the Vendor
25
Enhance the existing
✘ Both AWS and GCP give you so much for free already
✘ Just add your app metrics
➢ And they support that!
✘ No need to ship system metrics, e.g. K8s - they are alredy there
26
But it’s costly
✘ GCP StackDriver
➢ $84/month per 1k metrics
➢ Price drops after 300k metrics
27
But it’s costly
✘ GCP StackDriver
➢ $84/month per 1k metrics
➢ Price drops after 300k metrics
✘ AWS CloudWatch
➢ $300/month per 1k metrics
➢ $100/month after first 10k, $50 after the first 240k
28
But it’s costly
✘ GCP StackDriver
➢ $84/month per 1k metrics
➢ Price drops after 300k metrics
✘ AWS CloudWatch
➢ $300/month per 1k metrics
➢ $100/month after first 10k, $50 after the first 240k
✘ Logs are expensive too, btw
➢ GCP SD: $0.50/GB
➢ AWS CW: $0.60/GB + charge of $0.0057 per scanned GB for queries
29
But it’s costly
✘ 20 μSvc app with 1k metrics and 1GB logs/day per μSvc:
➢ 1*20*30 = 600GB/month
➢ 20k metrics/month
✘ Will cost you:
➢ GCP: $300 for logs + $1760 for metrics
➢ AWS: $360 for logs + $4000 for metrics
✘ It’s only half a story!
➢ With containers metrics are short lived
30
Further considerations
✘ Vendor specific APIs to ship
➢ Challenging for multi-cloud
➢ Gets better with K8s
✘ Limited to 1 minute resolution
31
Option 2
Dedicated SaaS
32
There are many out there
✘ DataDog, Sysdig, NewRelic, Splunk, SumoLogic, Grafana (hosted)
✘ Once you see the pricing, GCP/AWS $-figures make sense :)
✘ Lot’s of added features though:
➢ AI-assisted anomaly detection, etc.
✘ Multi-cloud!
33
There are many out there
✘ DataDog, Sysdig, NewRelic, Splunk, SumoLogic, Grafana (hosted)
✘ Once you see the pricing, GCP/AWS $-figures make sense :)
✘ Lot’s of added features though:
➢ AI-assisted anomaly detection, etc.
✘ Multi-cloud!
✘ But now you need to ship all your system metrics
➢ Can become expensive quickly
34
Option 3
Host your own
35
Simpler than it may sound
Instrument
Collect
&
Store
Display
&
Alert
36
Grafana to Display (and Alert)
De-facto dashboarding software for DevOps and beyond
37
Grafana Multiple Data sources
38
GCP
Pub/Sub
AWS
SES
Email
Parser
Grafana Multiple Data sources
39
GCP
Pub/Sub
AWS
SES
Email
Parser
One Grafana Dashboard
StackDriver
CloudW
atch Prometheus
Hybrid SaaS
As a hybrid SaaS, or “Option 2.5” you can:
✘ Setup hosted Graphana on Graphana Labs
✘ Connect it to CloudWatch, StackDriver, etc.
✘ Ship your app-only metrics to Graphana Labs
➢ At $16/month per 1k metrics
✘ Still limited for 1 minute resolution for CloudWatch/Stackdriver
40
Let’s Kollekt
Instrument
Collect
&
Store
Display
&
Alert
41
Where to?
✘ We have 3 billion app / 50 billion system metric samples per month
✘ Storage size per sample matters here
42
Where to?
✘ We have 3 billion app / 50 billion system metric samples per month
✘ Storage size per sample matters here
✘ MySQL
➢ ~50 bytes per sample (including indexing, etc)
➢ 2.3TB for 50b samples
✘ ElasticSearch
➢ ~20 bytes per sample
➢ 930GB for 50b samples
43
General purpose DBs are expensive
✘ MySQL
➢ $230 for 1 month retention
➢ $1380 for 3 month retention
✘ ElasticSearch
➢ $93 for 1 month retention
➢ $550 for 3 month retention
✘ That’s just for storage! For one app!
44
But metrics data is unique
✘ Immutable (no updates)
✘ Write once
✘ Lots of metrics do not change often
✘ And this is why Time Series Databases were born!
45
Prometheus
Finally!
46
Prometheus at glance
✘ Not a first TSDB, but became a golden standard
✘ 1-2 bytes per sample
➢ $30-$60 storage cost for 3 month retention as in the previous example
✘ Can process 1 million samples per minute on your laptop
47
Not just TSDB
✘ Prometheus discovers:
➢ Your GCE VM
➢ Your GKE pods
✘ Prometheus pulls metrics from targets
✘ Prometheus stores metrics and allows you to query them OR
✘ Federates them further to a central storage
48
Collection at glance
49
GKE Cluster
POD
POD POD
P8S
Collection at glance
50
GKE Cluster
POD
POD POD
P8S
VM
VM
VM
P8S
Collection at glance
51
GKE Cluster
POD
POD POD
P8S
VM
VM
VM
P8S
P8S
Thanos
VictoriaMetrics
etc.
Grafana
Instrumentation
Instrument
Collect
&
Store
Display
&
Alert
52
Python example
53
import time
from flask import Flask
from prometheus_client import start_http_server, Summary
app = Flask(__name__)
REQUEST_TIME = Summary("request_processing_seconds",
"Time spent processing request")
@app.route("/")
@REQUEST_TIME.time()
def hello_world():
return "Hello, World!n"
if __name__ == "__main__":
start_http_server(8081)
app.run(port=8080)
Dedicated port!
Python example - in action!
54
$ python app.py &
$ curl localhost:8080
$ curl localhost:8080
$ curl localhost:8081
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 247.0
python_gc_objects_collected_total{generation="1"} 151.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 60.0
python_gc_collections_total{generation="1"} 5.0
python_gc_collections_total{generation="2"} 0.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="8",patchlevel="3",version="3.8.3"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 2.34852352e+08
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 2.6411008e+07
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.29000000000000004
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 7.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024.0
# HELP request_processing_seconds Time spent processing request
# TYPE request_processing_seconds summary
request_processing_seconds_count 2.0
request_processing_seconds_sum 1.3547949492931366e-05
# HELP request_processing_seconds_created Time spent processing request
# TYPE request_processing_seconds_created gauge
request_processing_seconds_created 1.5959190974287152e+09
Recap
55
GKE Cluster
POD
POD POD
P8S
VM
VM
VM
P8S
P8S
Thanos
VictoriaMetrics
etc.
Grafana
More to come in
Part II
56
thanks!
Any questions?
57

More Related Content

PDF
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
C4Media
 
PDF
Mantis: Netflix's Event Stream Processing System
C4Media
 
PPTX
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
DataStax
 
PDF
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
Nagios
 
PDF
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Monal Daxini
 
PDF
netflix-real-time-data-strata-talk
Danny Yuan
 
PDF
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
PPTX
Arc305 how netflix leverages multiple regions to increase availability an i...
Ruslan Meshenberg
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
C4Media
 
Mantis: Netflix's Event Stream Processing System
C4Media
 
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
DataStax
 
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
Nagios
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Monal Daxini
 
netflix-real-time-data-strata-talk
Danny Yuan
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Ruslan Meshenberg
 

What's hot (18)

PDF
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
smalltown
 
PDF
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
 
PDF
openstack, devops and people
Andrew Yongjoon Kong
 
PDF
Keystone - ApacheCon 2016
Peter Bakas
 
PDF
Amazon CloudWatch - Observability and Monitoring
Rick Hwang
 
PDF
Keystone - Leverage Big Data 2016
Peter Bakas
 
PDF
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
Nicolas Brousse
 
PDF
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Monal Daxini
 
PDF
Openstack summit 2015
Andrew Yongjoon Kong
 
PPTX
Svc 202-netflix-open-source
Ruslan Meshenberg
 
PDF
Way to cloud
Andrew Yongjoon Kong
 
PDF
Netflix at-disney-09-26-2014
Monal Daxini
 
PDF
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
aspyker
 
PDF
StackWatch: A prototype CloudWatch service for CloudStack
Chiradeep Vittal
 
PDF
Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015
Datadog
 
PPTX
Autoscaling with Kubernetes
Johannes Würbach
 
ODP
Container orchestration: the cold war - Giulio De Donato - Codemotion Rome 2017
Codemotion
 
PDF
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
Diego Pacheco
 
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
smalltown
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
 
openstack, devops and people
Andrew Yongjoon Kong
 
Keystone - ApacheCon 2016
Peter Bakas
 
Amazon CloudWatch - Observability and Monitoring
Rick Hwang
 
Keystone - Leverage Big Data 2016
Peter Bakas
 
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
Nicolas Brousse
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Monal Daxini
 
Openstack summit 2015
Andrew Yongjoon Kong
 
Svc 202-netflix-open-source
Ruslan Meshenberg
 
Way to cloud
Andrew Yongjoon Kong
 
Netflix at-disney-09-26-2014
Monal Daxini
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
aspyker
 
StackWatch: A prototype CloudWatch service for CloudStack
Chiradeep Vittal
 
Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015
Datadog
 
Autoscaling with Kubernetes
Johannes Würbach
 
Container orchestration: the cold war - Giulio De Donato - Codemotion Rome 2017
Codemotion
 
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
Diego Pacheco
 
Ad

Similar to Dip into prometheus (20)

PDF
DevOps Spain 2019. Beatriz Martínez-IBM
atSistemas
 
PDF
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
PDF
app/server monitoring
Jaemok Jeong
 
PPTX
The IBM dashboard for operational metrics
Platform CF
 
PDF
Welcome to the Metrics
VMware Tanzu
 
PDF
Metrics & more
Stefan Thies
 
PDF
OSS Japan - Application Monitoring And Tracing In Kubernetes
David vonThenen
 
PDF
AWS vs Azure vs Google (GCP) - Slides
TobyWilman
 
PPTX
Observability for Application Developers (1)-1.pptx
OpsTree solutions
 
PPTX
Observability
Maganathin Veeraragaloo
 
PDF
Monitoring Your AWS Cloud Infrastructure
Newvewm
 
PDF
Introduction to MicroProfile Metrics
Kenji HASUNUMA
 
PDF
Introduction to MicroProfile Metrics
Kenji HASUNUMA
 
PDF
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Altinity Ltd
 
PDF
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
VictoriaMetrics
 
PDF
Cloud Camp Chicago Dec 2012 - All presentations
CloudCamp Chicago
 
PDF
Cloud Camp Chicago Dec 2012 Slides
Ryan Koop
 
PDF
What the hell is your software doing at runtime?
Roberto Franchini
 
PDF
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil
 
PDF
Control and monitor_microservices_with_microprofile
Rudy De Busscher
 
DevOps Spain 2019. Beatriz Martínez-IBM
atSistemas
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
app/server monitoring
Jaemok Jeong
 
The IBM dashboard for operational metrics
Platform CF
 
Welcome to the Metrics
VMware Tanzu
 
Metrics & more
Stefan Thies
 
OSS Japan - Application Monitoring And Tracing In Kubernetes
David vonThenen
 
AWS vs Azure vs Google (GCP) - Slides
TobyWilman
 
Observability for Application Developers (1)-1.pptx
OpsTree solutions
 
Monitoring Your AWS Cloud Infrastructure
Newvewm
 
Introduction to MicroProfile Metrics
Kenji HASUNUMA
 
Introduction to MicroProfile Metrics
Kenji HASUNUMA
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Altinity Ltd
 
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
VictoriaMetrics
 
Cloud Camp Chicago Dec 2012 - All presentations
CloudCamp Chicago
 
Cloud Camp Chicago Dec 2012 Slides
Ryan Koop
 
What the hell is your software doing at runtime?
Roberto Franchini
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil
 
Control and monitor_microservices_with_microprofile
Rudy De Busscher
 
Ad

More from Zaar Hai (7)

PDF
When Less is More - Save Brain Cycles with GKE Autopilot and Cloud Run
Zaar Hai
 
PDF
Google auth dispelling the magic
Zaar Hai
 
PDF
Google auth - dispelling the magic
Zaar Hai
 
PDF
Deep into Prometheus
Zaar Hai
 
PDF
Apache ignite - a do-it-all key-value db?
Zaar Hai
 
PDF
Advanced Python, Part 2
Zaar Hai
 
PDF
Advanced Python, Part 1
Zaar Hai
 
When Less is More - Save Brain Cycles with GKE Autopilot and Cloud Run
Zaar Hai
 
Google auth dispelling the magic
Zaar Hai
 
Google auth - dispelling the magic
Zaar Hai
 
Deep into Prometheus
Zaar Hai
 
Apache ignite - a do-it-all key-value db?
Zaar Hai
 
Advanced Python, Part 2
Zaar Hai
 
Advanced Python, Part 1
Zaar Hai
 

Recently uploaded (20)

PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 

Dip into prometheus

  • 2. hello! I am Zaar Hai Staff Cloud Architect at DoiT International linkedin.com/in/zaar 2
  • 4. Cloud Out of the box ✘ VM CPU/Network/Disk stats… ✘ But not memory ➢ requires vendor-specific agents 4
  • 5. Cloud Out of the box ✘ VM CPU/Network/Disk stats… ✘ But not memory ➢ requires vendor-specific agents ✘ Gets much better with GKE ➢ Memory usage for pods IF your YAMLs behave 5
  • 6. Cloud Out of the box ✘ VM CPU/Network/Disk stats… ✘ But not memory ➢ requires vendor-specific agents ✘ Gets much better with GKE ➢ Memory usage for pods IF your YAMLs behave ✘ All of the above are metrics… But about our app? 6
  • 7. App Metrics It’s all about app metrics 7
  • 8. Why metics at all? 8
  • 9. When logs are not enough ✘ Too detailed to see the big picture ✘ Hard to see KPIs / trends 9
  • 10. Logs to metrics ✘ Log-based metrics in StackDriver ➢ Fragile configuration ➢ Disjoint from the app ➢ Vendor specific ✘ Smart log parsers, e.g. Coralogix Parsed 100 docs in 0.31 seconds Parsed 100 docs in 0.38 seconds Parsed 100 docs in 0.34 seconds ✘ An option when retrofitting monitoring on the existing app 10
  • 11. So what are metrics? 11
  • 12. Metrics are 12 Just a tuple Numeric value Time stamp Metric name
  • 13. Metric examples Process A: 2020-07-28T02:32:06Z http_requests_total 1239 2020-07-28T02:32:07Z http_requests_total 1245 Process B: 2020-07-28T02:32:06Z http_requests_total 1185 2020-07-28T02:32:07Z http_requests_total 1185 Now we can aggregate across! 13
  • 14. Metric dimensions Process A: 2020-07-28T02:32:06Z http_requests_total 1239 Process B: 2020-07-28T02:32:06Z http_requests_total 1185 14
  • 15. Metric dimensions Process A: 2020-07-28T02:32:06Z http_requests_total{code=200} 1227 2020-07-28T02:32:06Z http_requests_total{code=404} 12 Process B: 2020-07-28T02:32:06Z http_requests_total{code=200} 1177 2020-07-28T02:32:06Z http_requests_total{code=404} 8 15
  • 16. Metric dimensions 2020-07-28T02:32:06Z http_requests_total{code=200, process=A} 1227 http_requests_total{code=404, process=A} 12 http_requests_total{code=200, process=B} 1177 http_requests_total{code=404, process=B} 8 16
  • 17. Metric dimensions 2020-07-28T02:32:06Z http_requests_total{code=200, process=A, path=/foo} 1107 http_requests_total{code=404, process=A, path=/foo} 12 http_requests_total{code=404, process=A, path=/bar} 120 http_requests_total{code=200, process=B, path=/foo} 1005 http_requests_total{code=404, process=B, path=/foo} 8 http_requests_total{code=200, process=B, path=/bar} 172 17
  • 18. Now we can graph that! 18
  • 20. 1,000Metrics Per MicroService 3,504,000,000Samples per month 80,000Samples per minute 20 x20
  • 21. 1,000Metrics Per MicroService 3,504,000,000Samples per month 80,000Samples per minute 21 x20 Just for your app
  • 22. Wait, there is more! ✘ 10-20k metrics per average K8s node ✘ That’s 1,200,000/minute for 15 node cluster ➢ Assuming 15s collection interval ✘ Or 52,560,000,000 samples a month! ✘ And that’s just for average sized app 22
  • 24. “No matter what you chose, stay within 24 A Single Pane of Glass
  • 25. Option 1 Stick with the Vendor 25
  • 26. Enhance the existing ✘ Both AWS and GCP give you so much for free already ✘ Just add your app metrics ➢ And they support that! ✘ No need to ship system metrics, e.g. K8s - they are alredy there 26
  • 27. But it’s costly ✘ GCP StackDriver ➢ $84/month per 1k metrics ➢ Price drops after 300k metrics 27
  • 28. But it’s costly ✘ GCP StackDriver ➢ $84/month per 1k metrics ➢ Price drops after 300k metrics ✘ AWS CloudWatch ➢ $300/month per 1k metrics ➢ $100/month after first 10k, $50 after the first 240k 28
  • 29. But it’s costly ✘ GCP StackDriver ➢ $84/month per 1k metrics ➢ Price drops after 300k metrics ✘ AWS CloudWatch ➢ $300/month per 1k metrics ➢ $100/month after first 10k, $50 after the first 240k ✘ Logs are expensive too, btw ➢ GCP SD: $0.50/GB ➢ AWS CW: $0.60/GB + charge of $0.0057 per scanned GB for queries 29
  • 30. But it’s costly ✘ 20 μSvc app with 1k metrics and 1GB logs/day per μSvc: ➢ 1*20*30 = 600GB/month ➢ 20k metrics/month ✘ Will cost you: ➢ GCP: $300 for logs + $1760 for metrics ➢ AWS: $360 for logs + $4000 for metrics ✘ It’s only half a story! ➢ With containers metrics are short lived 30
  • 31. Further considerations ✘ Vendor specific APIs to ship ➢ Challenging for multi-cloud ➢ Gets better with K8s ✘ Limited to 1 minute resolution 31
  • 33. There are many out there ✘ DataDog, Sysdig, NewRelic, Splunk, SumoLogic, Grafana (hosted) ✘ Once you see the pricing, GCP/AWS $-figures make sense :) ✘ Lot’s of added features though: ➢ AI-assisted anomaly detection, etc. ✘ Multi-cloud! 33
  • 34. There are many out there ✘ DataDog, Sysdig, NewRelic, Splunk, SumoLogic, Grafana (hosted) ✘ Once you see the pricing, GCP/AWS $-figures make sense :) ✘ Lot’s of added features though: ➢ AI-assisted anomaly detection, etc. ✘ Multi-cloud! ✘ But now you need to ship all your system metrics ➢ Can become expensive quickly 34
  • 36. Simpler than it may sound Instrument Collect & Store Display & Alert 36
  • 37. Grafana to Display (and Alert) De-facto dashboarding software for DevOps and beyond 37
  • 38. Grafana Multiple Data sources 38 GCP Pub/Sub AWS SES Email Parser
  • 39. Grafana Multiple Data sources 39 GCP Pub/Sub AWS SES Email Parser One Grafana Dashboard StackDriver CloudW atch Prometheus
  • 40. Hybrid SaaS As a hybrid SaaS, or “Option 2.5” you can: ✘ Setup hosted Graphana on Graphana Labs ✘ Connect it to CloudWatch, StackDriver, etc. ✘ Ship your app-only metrics to Graphana Labs ➢ At $16/month per 1k metrics ✘ Still limited for 1 minute resolution for CloudWatch/Stackdriver 40
  • 42. Where to? ✘ We have 3 billion app / 50 billion system metric samples per month ✘ Storage size per sample matters here 42
  • 43. Where to? ✘ We have 3 billion app / 50 billion system metric samples per month ✘ Storage size per sample matters here ✘ MySQL ➢ ~50 bytes per sample (including indexing, etc) ➢ 2.3TB for 50b samples ✘ ElasticSearch ➢ ~20 bytes per sample ➢ 930GB for 50b samples 43
  • 44. General purpose DBs are expensive ✘ MySQL ➢ $230 for 1 month retention ➢ $1380 for 3 month retention ✘ ElasticSearch ➢ $93 for 1 month retention ➢ $550 for 3 month retention ✘ That’s just for storage! For one app! 44
  • 45. But metrics data is unique ✘ Immutable (no updates) ✘ Write once ✘ Lots of metrics do not change often ✘ And this is why Time Series Databases were born! 45
  • 47. Prometheus at glance ✘ Not a first TSDB, but became a golden standard ✘ 1-2 bytes per sample ➢ $30-$60 storage cost for 3 month retention as in the previous example ✘ Can process 1 million samples per minute on your laptop 47
  • 48. Not just TSDB ✘ Prometheus discovers: ➢ Your GCE VM ➢ Your GKE pods ✘ Prometheus pulls metrics from targets ✘ Prometheus stores metrics and allows you to query them OR ✘ Federates them further to a central storage 48
  • 49. Collection at glance 49 GKE Cluster POD POD POD P8S
  • 50. Collection at glance 50 GKE Cluster POD POD POD P8S VM VM VM P8S
  • 51. Collection at glance 51 GKE Cluster POD POD POD P8S VM VM VM P8S P8S Thanos VictoriaMetrics etc. Grafana
  • 53. Python example 53 import time from flask import Flask from prometheus_client import start_http_server, Summary app = Flask(__name__) REQUEST_TIME = Summary("request_processing_seconds", "Time spent processing request") @app.route("/") @REQUEST_TIME.time() def hello_world(): return "Hello, World!n" if __name__ == "__main__": start_http_server(8081) app.run(port=8080) Dedicated port!
  • 54. Python example - in action! 54 $ python app.py & $ curl localhost:8080 $ curl localhost:8080 $ curl localhost:8081 # HELP python_gc_objects_collected_total Objects collected during gc # TYPE python_gc_objects_collected_total counter python_gc_objects_collected_total{generation="0"} 247.0 python_gc_objects_collected_total{generation="1"} 151.0 python_gc_objects_collected_total{generation="2"} 0.0 # HELP python_gc_objects_uncollectable_total Uncollectable object found during GC # TYPE python_gc_objects_uncollectable_total counter python_gc_objects_uncollectable_total{generation="0"} 0.0 python_gc_objects_uncollectable_total{generation="1"} 0.0 python_gc_objects_uncollectable_total{generation="2"} 0.0 # HELP python_gc_collections_total Number of times this generation was collected # TYPE python_gc_collections_total counter python_gc_collections_total{generation="0"} 60.0 python_gc_collections_total{generation="1"} 5.0 python_gc_collections_total{generation="2"} 0.0 # HELP python_info Python platform information # TYPE python_info gauge python_info{implementation="CPython",major="3",minor="8",patchlevel="3",version="3.8.3"} 1.0 # HELP process_virtual_memory_bytes Virtual memory size in bytes. # TYPE process_virtual_memory_bytes gauge process_virtual_memory_bytes 2.34852352e+08 # HELP process_resident_memory_bytes Resident memory size in bytes. # TYPE process_resident_memory_bytes gauge process_resident_memory_bytes 2.6411008e+07 # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. # TYPE process_cpu_seconds_total counter process_cpu_seconds_total 0.29000000000000004 # HELP process_open_fds Number of open file descriptors. # TYPE process_open_fds gauge process_open_fds 7.0 # HELP process_max_fds Maximum number of open file descriptors. # TYPE process_max_fds gauge process_max_fds 1024.0 # HELP request_processing_seconds Time spent processing request # TYPE request_processing_seconds summary request_processing_seconds_count 2.0 request_processing_seconds_sum 1.3547949492931366e-05 # HELP request_processing_seconds_created Time spent processing request # TYPE request_processing_seconds_created gauge request_processing_seconds_created 1.5959190974287152e+09
  • 56. More to come in Part II 56