SlideShare a Scribd company logo
Singapore Meetup, 2017-11-23
Speaker: Arseny Chernov
So The Story Goes Like…
Full story: https://www.slideshare.net/grobie/the-history-of-prometheus-at-soundcloud
• 2012 - Joined SoundCloud
• Left Google in 2012 after 5+ years
• Side-project for open-source
monitoring system for Not Only IT
(econometrics, biochemical etc.)
• Started LevelDB-backed
Prometheus
• Server, client_golang
• Protocol Buffers
• 2012 - Joined SoundCloud
• Left Google in 2012 after 2+ years
• Configuration, query language
& &
• 2013 - Joined SoundCloud
• Left Google in 2013 after 7+ years
• Storage rewrite (LevelDB to Chunks): March 2014
• Public release: January 2015
• Join Cloud Native Computing Foundation (CNCF): May 2016
• Prometheus 2.0 announced: November 08, 2017
• Singapore Meetup: 23 November, 2017
Motivation Behind - Google SRE Best Practices
Read book: https://landing.google.com/sre/book.html
• SRE: Have software engineers do operations
• Do the same work as an operations team, but with
automation instead of manual labour
• 50% upper bound cap on the amount of “ops”
Google SLI, SLO, SLA
Full story: https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html
Service Level Indicators (SLIs)
• A carefully defined quantitative measure of some aspect of the level of service that is provided
• request latency / error rate (often expressed as % of all requests received ) / system throughput,
Service Level Objectives (SLOs)
• Lower bound ≤ SLI ≤ upper bound
• Define the lowest level of reliability, and state that as your Service Level Objective
(SLO).
Service Level Agreements (SLAs)
• SLA is a looser objective than the SLO. Alternatively the SLA might only specify a subset of SLO metrics.
• I.e. availability SLA of 99.9% over 1 month with internal availability SLO of 99.95%
• A promise to someone using a service that its availability should meet a certain level over a certain
period, and if it fails to do so then some kind of penalty will be paid (partial refund of subscription fee
paid by customers for that period, or subscription time added for free)
Example 1
Example 2
Latency
• The time it takes to service a request.
• Successful vs. failed requests
• Slow error is even worse than a fast error. Track error latency.
Traffic
• A measure of how much demand is being placed on your system
• Usually HTTP requests per second (static vs dynamic content)
• Streaming system - network I/O rate or concurrent sessions
• Key-value storage system - TPS.
Errors
• The rate of requests that fail, (e.g.: HTTP 500s or HTTP 200 but coupled with wrong content)
Saturation
• How "full" your service is. CPU, Memory, I/O
• Can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it
currently receives?
• Saturation is also concerned with predictions of impending saturation, such as "It looks like your database will fill its
hard drive in 4 hours.”
Four Golden Signals
Error Budget = 100% - SLO
Full story: https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html
Move fast without breaking SLO
• 100% is the wrong reliability target
• Error Budgets balance the goals of:
• Product development teams (KPI is feature velocity, incentive to push code often)
• SRE teams (KPI is reliability of a service, incentive to pushback against change)
• Error budget can be spent on anything: launching features, etc.
• Error budget provokes for discussion of phased rollouts and 1% experiments
Goal of SRE team isn’t “zero outages”
• SRE and product incentive-aligned to spend error budget and get max. feature velocity
Googlers use Borgmon (a.k.a. Borgmon rules)
Full story: https://landing.google.com/sre/book/chapters/practical-alerting.html
%curl http://webserver:80/varz
http_requests 37
errors_total 12
Each of the major languages used at Google has an implementation of the exported variable interface that automagically
registers with the HTTP server built into every Google binary by default. It’s called “Collection via /varz “
Time Series:
Distributed:
…traditional monitoring in kube era
Full story: https://www.slideshare.net/FabianReinartz/monitoring-a-kubernetes-backed-microservice-architecture-with-prometheus
A lot of traffic to monitor
Way more targets to monitor
…and they constantly change
Need a fleet-wide view (i..e What’s my overall 99th percentile latency)?
Still need to be able to drill down for troubleshooting
&
Prometheus Relies on Exporters
Full story: https://www.slideshare.net/FabianReinartz/monitoring-a-kubernetes-backed-microservice-architecture-with-prometheus
Exporters: The endpoint being polled by the prometheus server and answering the GET requests is typically
called exporter, e.g. the host-level metrics exporter is node-exporter.
https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exporters.md
Prometheus Architecture
Full story: https://jaxenter.com/prometheus-monitoring-pros-cons-136019.html
The 3 path-method combinations with the highest number of failing
requests?
topk(3,
sum by(path, method) (
rate(http_requests_total{status=~"5.."}[5m]))
)
The 99th percentile request latency by request path?
histogram_quantile(0.99, sum by(le, path) (
rate(http_requests_duration_seconds_bucket[5m])
))
PromQL:
Prometheus Storage Architecture
• A monitoring system must be more reliabile than the systems it is monitoring
• Prometheus's local storage is not meant as durable long-term storage.
• Chunks of data are in RAM, with WAL on disk
needed_disk_space =
retention_time_seconds *
ingested_samples_per_second *
bytes_per_sample [1…2 bytes]
• Possible LVM solution if _really_ desperate
As of writing (Nov. 2017) moment possible to integrate via adapters to:
Chronix , Cortex , CrateDB , Graphite , InfluxDB , OpenTSDB , PostgreSQL/TimescaleDB , SignalFx , Clickhouse etc.
This is primarily intended for long term storage. It is recommended that you perform careful evaluation of any
solution in this space to confirm it can handle your data volumes.
Full story: https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage
What Protetheus Is Not & Best Practice
• Not 100% accurate
• No logs, only metrics
• Not a durable long-term storage
• Not an anomaly detection
• Not a dashboarding solution
Full story: https://prometheus.io/docs/introduction/overview/#when-does-it-not-fit
Run one Prometheus server (or HA pair) in each failure domain / zone / cluster, monitoring jobs only in that zone.
Have a set of global Prometheus servers that monitor (federate from) the per-cluster ones.

More Related Content

PPTX
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Arseny Chernov
 
PPTX
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
PPTX
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Cloudera, Inc.
 
PPTX
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
Cloudera, Inc.
 
PDF
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
PDF
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Timo Walther
 
PPTX
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
HBaseCon
 
PDF
Presto anatomy
Dongmin Yu
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Arseny Chernov
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Cloudera, Inc.
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
Cloudera, Inc.
 
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Timo Walther
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
HBaseCon
 
Presto anatomy
Dongmin Yu
 

What's hot (20)

PPTX
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon
 
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
PPTX
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
PPTX
HBaseCon 2013: ETL for Apache HBase
Cloudera, Inc.
 
PDF
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
HBaseCon
 
PPTX
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.
 
PDF
12-Step Program for Scaling Web Applications on PostgreSQL
Konstantin Gredeskoul
 
PDF
Presto At Treasure Data
Taro L. Saito
 
PPTX
Big Data Anti-Patterns: Lessons From the Front LIne
Douglas Moore
 
PDF
Introduction to Presto at Treasure Data
Taro L. Saito
 
PPTX
Operating and Supporting Apache HBase Best Practices and Improvements
DataWorks Summit/Hadoop Summit
 
PDF
Apache Drill (ver. 0.1, check ver. 0.2)
Camuel Gilyadov
 
PDF
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera, Inc.
 
PDF
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 
PDF
Apache Big Data EU 2015 - HBase
Nick Dimiduk
 
PDF
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 
PPTX
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
Michael Stack
 
KEY
Near-realtime analytics with Kafka and HBase
dave_revell
 
PPTX
Architecting Applications with Hadoop
markgrover
 
PPTX
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
HBaseCon 2013: ETL for Apache HBase
Cloudera, Inc.
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
HBaseCon
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.
 
12-Step Program for Scaling Web Applications on PostgreSQL
Konstantin Gredeskoul
 
Presto At Treasure Data
Taro L. Saito
 
Big Data Anti-Patterns: Lessons From the Front LIne
Douglas Moore
 
Introduction to Presto at Treasure Data
Taro L. Saito
 
Operating and Supporting Apache HBase Best Practices and Improvements
DataWorks Summit/Hadoop Summit
 
Apache Drill (ver. 0.1, check ver. 0.2)
Camuel Gilyadov
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera, Inc.
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 
Apache Big Data EU 2015 - HBase
Nick Dimiduk
 
The Future of Apache Storm
DataWorks Summit/Hadoop Summit
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
Michael Stack
 
Near-realtime analytics with Kafka and HBase
dave_revell
 
Architecting Applications with Hadoop
markgrover
 
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
Ad

Similar to Introduction to Prometheus Monitoring (Singapore Meetup) (20)

PDF
Intro to open source observability with grafana, prometheus, loki, and tempo(...
LibbySchulze
 
PDF
Finding the Golden Signals with Prometheus
All Things Open
 
PPTX
Observability - Stockholm Splunk UG Jan 19 2023.pptx
Magnus Johansson
 
ODP
Monitoring SLA with Prometheus and LibreOffice Calc
Didiet A. Pambudiono
 
PDF
Monitor your Java application with Prometheus Stack
Wojciech Barczyński
 
PPTX
DevOps & Site Reliability Engineering (SRE).pptx
abiguimeleroy
 
PDF
I pushed in production :). Have a nice weekend
Nicolas Carlier
 
PPTX
DockerCon Europe 2018 Monitoring & Logging Workshop
Brian Christner
 
PDF
Overcoming scalability issues in your prometheus ecosystem
Nebulaworks
 
PDF
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
GetInData
 
PDF
How to get started with Site Reliability Engineering
Andrew Kirkpatrick
 
PPTX
An Introduction to Prometheus (GrafanaCon 2016)
Brian Brazil
 
PDF
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
Weaveworks
 
PDF
Overcoming (organizational) scalability issues in your Prometheus ecosystem
QAware GmbH
 
PPTX
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Brian Brazil
 
PDF
Seeing RED: Monitoring and Observability in the Age of Microservices
Dave McAllister
 
PDF
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Keet Sugathadasa
 
PPTX
Evolution of Monitoring and Prometheus (Dublin 2018)
Brian Brazil
 
PDF
Overview of Site Reliability Engineering (SRE) & best practices
Ashutosh Agarwal
 
PDF
2020 10-08 measuring-qualityinproduction
Abigail Bangser
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
LibbySchulze
 
Finding the Golden Signals with Prometheus
All Things Open
 
Observability - Stockholm Splunk UG Jan 19 2023.pptx
Magnus Johansson
 
Monitoring SLA with Prometheus and LibreOffice Calc
Didiet A. Pambudiono
 
Monitor your Java application with Prometheus Stack
Wojciech Barczyński
 
DevOps & Site Reliability Engineering (SRE).pptx
abiguimeleroy
 
I pushed in production :). Have a nice weekend
Nicolas Carlier
 
DockerCon Europe 2018 Monitoring & Logging Workshop
Brian Christner
 
Overcoming scalability issues in your prometheus ecosystem
Nebulaworks
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
GetInData
 
How to get started with Site Reliability Engineering
Andrew Kirkpatrick
 
An Introduction to Prometheus (GrafanaCon 2016)
Brian Brazil
 
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
Weaveworks
 
Overcoming (organizational) scalability issues in your Prometheus ecosystem
QAware GmbH
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Brian Brazil
 
Seeing RED: Monitoring and Observability in the Age of Microservices
Dave McAllister
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Keet Sugathadasa
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Brian Brazil
 
Overview of Site Reliability Engineering (SRE) & best practices
Ashutosh Agarwal
 
2020 10-08 measuring-qualityinproduction
Abigail Bangser
 
Ad

Recently uploaded (20)

PDF
Data Protection & Resilience in Focus.pdf
AmyPoblete3
 
PPTX
Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025
Reversed Out Creative
 
PPTX
LESSON-2-Roles-of-ICT-in-Teaching-for-learning_123922 (1).pptx
renavieramopiquero
 
PPTX
Perkembangan Perangkat jaringan komputer dan telekomunikasi 3.pptx
Prayudha3
 
PDF
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
PPTX
Black Yellow Modern Minimalist Elegant Presentation.pptx
nothisispatrickduhh
 
PPTX
原版北不列颠哥伦比亚大学毕业证文凭UNBC成绩单2025年新版在线制作学位证书
e7nw4o4
 
PDF
Cybersecurity Awareness Presentation ppt.
banodhaharshita
 
PDF
PDF document: World Game (s) Great Redesign.pdf
Steven McGee
 
PDF
LOGENVIDAD DANNYFGRETRRTTRRRTRRRRRRRRR.pdf
juan456ytpro
 
PPTX
The Latest Scam Shocking the USA in 2025.pptx
onlinescamreport4
 
PDF
LB# 820-1889_051-7370_C000.schematic.pdf
matheusalbuquerqueco3
 
PPTX
谢尔丹学院毕业证购买|Sheridan文凭不见了怎么办谢尔丹学院成绩单
mookxk3
 
PPTX
How tech helps people in the modern era.
upadhyayaryan154
 
PPTX
B2B_Ecommerce_Internship_Simranpreet.pptx
LipakshiJindal
 
PPT
Introduction to dns domain name syst.ppt
MUHAMMADKAVISHSHABAN
 
PPTX
办理方法西班牙假毕业证蒙德拉贡大学成绩单MULetter文凭样本
xxxihn4u
 
PDF
UI/UX Developer Guide: Tools, Trends, and Tips for 2025
Penguin peak
 
PDF
BGP Security Best Practices that Matter, presented at PHNOG 2025
APNIC
 
PPTX
dns domain name system history work.pptx
MUHAMMADKAVISHSHABAN
 
Data Protection & Resilience in Focus.pdf
AmyPoblete3
 
Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025
Reversed Out Creative
 
LESSON-2-Roles-of-ICT-in-Teaching-for-learning_123922 (1).pptx
renavieramopiquero
 
Perkembangan Perangkat jaringan komputer dan telekomunikasi 3.pptx
Prayudha3
 
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
Black Yellow Modern Minimalist Elegant Presentation.pptx
nothisispatrickduhh
 
原版北不列颠哥伦比亚大学毕业证文凭UNBC成绩单2025年新版在线制作学位证书
e7nw4o4
 
Cybersecurity Awareness Presentation ppt.
banodhaharshita
 
PDF document: World Game (s) Great Redesign.pdf
Steven McGee
 
LOGENVIDAD DANNYFGRETRRTTRRRTRRRRRRRRR.pdf
juan456ytpro
 
The Latest Scam Shocking the USA in 2025.pptx
onlinescamreport4
 
LB# 820-1889_051-7370_C000.schematic.pdf
matheusalbuquerqueco3
 
谢尔丹学院毕业证购买|Sheridan文凭不见了怎么办谢尔丹学院成绩单
mookxk3
 
How tech helps people in the modern era.
upadhyayaryan154
 
B2B_Ecommerce_Internship_Simranpreet.pptx
LipakshiJindal
 
Introduction to dns domain name syst.ppt
MUHAMMADKAVISHSHABAN
 
办理方法西班牙假毕业证蒙德拉贡大学成绩单MULetter文凭样本
xxxihn4u
 
UI/UX Developer Guide: Tools, Trends, and Tips for 2025
Penguin peak
 
BGP Security Best Practices that Matter, presented at PHNOG 2025
APNIC
 
dns domain name system history work.pptx
MUHAMMADKAVISHSHABAN
 

Introduction to Prometheus Monitoring (Singapore Meetup)

  • 2. So The Story Goes Like… Full story: https://www.slideshare.net/grobie/the-history-of-prometheus-at-soundcloud • 2012 - Joined SoundCloud • Left Google in 2012 after 5+ years • Side-project for open-source monitoring system for Not Only IT (econometrics, biochemical etc.) • Started LevelDB-backed Prometheus • Server, client_golang • Protocol Buffers • 2012 - Joined SoundCloud • Left Google in 2012 after 2+ years • Configuration, query language & & • 2013 - Joined SoundCloud • Left Google in 2013 after 7+ years • Storage rewrite (LevelDB to Chunks): March 2014 • Public release: January 2015 • Join Cloud Native Computing Foundation (CNCF): May 2016 • Prometheus 2.0 announced: November 08, 2017 • Singapore Meetup: 23 November, 2017
  • 3. Motivation Behind - Google SRE Best Practices Read book: https://landing.google.com/sre/book.html • SRE: Have software engineers do operations • Do the same work as an operations team, but with automation instead of manual labour • 50% upper bound cap on the amount of “ops”
  • 4. Google SLI, SLO, SLA Full story: https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html Service Level Indicators (SLIs) • A carefully defined quantitative measure of some aspect of the level of service that is provided • request latency / error rate (often expressed as % of all requests received ) / system throughput, Service Level Objectives (SLOs) • Lower bound ≤ SLI ≤ upper bound • Define the lowest level of reliability, and state that as your Service Level Objective (SLO). Service Level Agreements (SLAs) • SLA is a looser objective than the SLO. Alternatively the SLA might only specify a subset of SLO metrics. • I.e. availability SLA of 99.9% over 1 month with internal availability SLO of 99.95% • A promise to someone using a service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid (partial refund of subscription fee paid by customers for that period, or subscription time added for free)
  • 6. Latency • The time it takes to service a request. • Successful vs. failed requests • Slow error is even worse than a fast error. Track error latency. Traffic • A measure of how much demand is being placed on your system • Usually HTTP requests per second (static vs dynamic content) • Streaming system - network I/O rate or concurrent sessions • Key-value storage system - TPS. Errors • The rate of requests that fail, (e.g.: HTTP 500s or HTTP 200 but coupled with wrong content) Saturation • How "full" your service is. CPU, Memory, I/O • Can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? • Saturation is also concerned with predictions of impending saturation, such as "It looks like your database will fill its hard drive in 4 hours.” Four Golden Signals
  • 7. Error Budget = 100% - SLO Full story: https://cloudplatform.googleblog.com/2017/01/availability-part-deux--CRE-life-lessons.html Move fast without breaking SLO • 100% is the wrong reliability target • Error Budgets balance the goals of: • Product development teams (KPI is feature velocity, incentive to push code often) • SRE teams (KPI is reliability of a service, incentive to pushback against change) • Error budget can be spent on anything: launching features, etc. • Error budget provokes for discussion of phased rollouts and 1% experiments Goal of SRE team isn’t “zero outages” • SRE and product incentive-aligned to spend error budget and get max. feature velocity
  • 8. Googlers use Borgmon (a.k.a. Borgmon rules) Full story: https://landing.google.com/sre/book/chapters/practical-alerting.html %curl http://webserver:80/varz http_requests 37 errors_total 12 Each of the major languages used at Google has an implementation of the exported variable interface that automagically registers with the HTTP server built into every Google binary by default. It’s called “Collection via /varz “ Time Series: Distributed:
  • 9. …traditional monitoring in kube era Full story: https://www.slideshare.net/FabianReinartz/monitoring-a-kubernetes-backed-microservice-architecture-with-prometheus A lot of traffic to monitor Way more targets to monitor …and they constantly change Need a fleet-wide view (i..e What’s my overall 99th percentile latency)? Still need to be able to drill down for troubleshooting &
  • 10. Prometheus Relies on Exporters Full story: https://www.slideshare.net/FabianReinartz/monitoring-a-kubernetes-backed-microservice-architecture-with-prometheus Exporters: The endpoint being polled by the prometheus server and answering the GET requests is typically called exporter, e.g. the host-level metrics exporter is node-exporter. https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exporters.md
  • 11. Prometheus Architecture Full story: https://jaxenter.com/prometheus-monitoring-pros-cons-136019.html The 3 path-method combinations with the highest number of failing requests? topk(3, sum by(path, method) ( rate(http_requests_total{status=~"5.."}[5m])) ) The 99th percentile request latency by request path? histogram_quantile(0.99, sum by(le, path) ( rate(http_requests_duration_seconds_bucket[5m]) )) PromQL:
  • 12. Prometheus Storage Architecture • A monitoring system must be more reliabile than the systems it is monitoring • Prometheus's local storage is not meant as durable long-term storage. • Chunks of data are in RAM, with WAL on disk needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample [1…2 bytes] • Possible LVM solution if _really_ desperate As of writing (Nov. 2017) moment possible to integrate via adapters to: Chronix , Cortex , CrateDB , Graphite , InfluxDB , OpenTSDB , PostgreSQL/TimescaleDB , SignalFx , Clickhouse etc. This is primarily intended for long term storage. It is recommended that you perform careful evaluation of any solution in this space to confirm it can handle your data volumes. Full story: https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage
  • 13. What Protetheus Is Not & Best Practice • Not 100% accurate • No logs, only metrics • Not a durable long-term storage • Not an anomaly detection • Not a dashboarding solution Full story: https://prometheus.io/docs/introduction/overview/#when-does-it-not-fit Run one Prometheus server (or HA pair) in each failure domain / zone / cluster, monitoring jobs only in that zone. Have a set of global Prometheus servers that monitor (federate from) the per-cluster ones.