一、 Prometheus简介
Prometheus 是一个开源的监控和警报系统,专门设计用于收集和存储时间序列数据。它广泛应用于现代云原生架构、微服务架构以及容器化环境中,尤其适用于 Kubernetes 和 Docker 等环境的监控。Prometheus 提供了一种高效的方式来监控应用、基础设施和服务的健康状态,并通过警报和可视化仪表板帮助开发人员和运维人员做出决策。Prometheus有以下主要特性:
- 时间序列数据存储
Prometheus 使用时序数据库来存储数据,每个数据点都带有时间戳,并与一个或多个标签(labels)关联。通过标签,Prometheus 可以灵活地查询和组织不同维度的数据。 - 强大的查询语言 (PromQL)
Prometheus 提供了一个强大的查询语言——PromQL(Prometheus Query Language)。通过 PromQL,用户可以编写复杂的查询语句来聚合、筛选和操作数据。 - 无依赖的抓取机制
Prometheus 采用 pull 模式来抓取目标数据,意味着 Prometheus 会定期从被监控的服务或系统中主动拉取指标。相比之下,许多监控系统使用 push 模式,即被监控的服务将数据推送到监控系统。 - 自动发现和服务发现
Prometheus 支持自动发现服务和实例,在 Kubernetes 或其他动态环境中尤为重要。它能够动态地发现新的服务并开始监控。 - 集成警报功能
Prometheus 提供内置的警报机制,能够在监控指标超过设定阈值时触发警报。警报可以通过 Alertmanager 进行管理和通知,支持邮件、Slack、Webhook 等多种通知方式。 - 多维度数据模型
Prometheus 的数据模型基于标签(Labels),可以在每个时间序列中附加多个标签。比如,可以为同一指标(如 CPU 使用率)指定不同的标签(如主机、区域、应用等)。 - 支持大规模监控
Prometheus 被设计为高效的分布式系统,能够处理大规模的时间序列数据,并通过水平扩展进行优化。其高效的存储引擎和查询优化使得它能够在大型分布式系统中表现出色。
二、Prometheus架构
Prometheus主要由以下组件组成:
- Prometheus server:
a. 负责拉取(pull)exporter暴露的metrics指标和拉取short-lived job push给Pushgateway中的指标。
b. 负责推送(push)告警至Alertmanager
c. 通过ProQL将查询的数据在Grafana中显示
d. 存储时序数据至本地
e. 服务发现- Alertmanager: 负责接收prometheus推送的告警,并通过告警模板发送至Email、企业微信和钉钉等。
- Pushgateway: 负责接收自定义的metrics指标,等待prometheus的拉去(pull)
- Prometheus数据库:用于存储抓取到的监控数据。它能够根据时间戳和标签快速查询和聚合数据
三、Kubernetes部署Prometheus
- 使用Deployment控制器部署Prometheus
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
name: prometheus
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
volumes:
- name: data
persistentVolumeClaim:
claimName: prometheus
- name: config-volume
configMap:
name: prometheus-config
- name: node-rules
configMap:
name: prometheus-node-rules
containers:
- name: prometheus
image: prometheus:v2.51.1
imagePullPolicy: Always
env:
- name: TZ
value: Asia/Shanghai
command:
- "/bin/prometheus"
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--web.enable-lifecycle"
- "--storage.tsdb.retention=24h"
#- "--web.listen-address=:30090"
ports:
- containerPort: 9090
protocol: TCP
resources:
requests:
cpu: 0.1
memory: 1G
limits:
cpu: 0.5
memory: 2G
volumeMounts:
- name: data
mountPath: "/prometheus"
- name: config-volume
mountPath: "/etc/prometheus"
- name: node-rules
mountPath: "/etc/prometheus/rule"
- 配置ConfigMap资源–存储Prometheus的启动配置文件
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /etc/prometheus/rule/*_rules.yaml
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
scrape_configs:
- job_name: 'prod-node-exporter'
static_configs:
- targets:
- '172.16.8.123:30091'
- '172.16.8.124:30091'
- '172.16.8.125:30091'
- '172.16.8.126:30091'
- '172.16.8.177:30091'
- '172.16.8.178:30091'
- '172.16.8.179:30091'
- '172.16.8.167:9100'
- job_name: 'mysql-node-exporter'
static_configs:
- targets:
- '172.16.8.45:9100'
- job_name: 'redis-node-exporter'
static_configs:
- targets:
- '172.16.8.71:9100'
- job_name: 'docker-redis-rabbitmq-node-exporter'
static_configs:
- targets:
- '172.16.8.164:9100'
- job_name: 'minio-node-exporter'
static_configs:
- targets:
- '172.16.8.165:9100'
- 通过Service的NodePort方式暴露服务
kind: Service
apiVersion: v1
metadata:
name: prometheus
namespace: monitoring
spec:
type: NodePort
ports:
port: 9090
protocol: TCP
targetPort: 9090
nodePort: 30090
selector:
app: prometheus
- 配置RBAC赋予prometheus权限
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
- 通过动态存储类和PVC资源,自动申请PV持久存储卷
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus
namespace: monitoring
annotations:
volume.beta.kubernetes.io/storage-class: "nfs-storage"
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 30G
四、Kubernetes部署Grafana
- Deployment资源部署Grafana
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: monitoring
name: grafana
labels:
app: grafana
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- image: grafana-enterprise:11.5.3
name: grafana
imagePullPolicy: Always
ports:
- containerPort: 3000
protocol: TCP
env:
- name: root_url
value: '%(protocol)s://%(domain)s:%(http_port)s/grafana'
- name: domian
value: localhost
- name: serve_from_sub_path
value: 'true'
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 200m
memory: 512Mi
restartPolicy: Always
- 通过NodePort方式提供服务
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitoring
spec:
type: NodePort
ports:
- port: 3000
protocol: TCP
targetPort: 3000
nodePort: 30030
selector:
app: grafana