文章目录
一、prometheus监控k8s
一、常见监控
常用k8s资源监控,其他监控参数和docker一样
https://cloud.tencent.com/developer/article/2145203
如果同时想监控conntainer需要安装cAdvsor 见
1、监控node
(1) 监控
# 监控所有node节点cpu使用率
100 -avg(irate(node_cpu_seconds_total{mode="idle",Type="NODE"}[5m])) by (instance)* 100
(2)告警
# cpu告警
100 - avg(irate(node_cpu_seconds_total{Type="NODE",mode="idle"}[1m])) by (instance) * 100 > 75
2、监控pod
# 网卡下载速度
irate(container_network_receive_bytes_total{image!="",namespace!="",interface="eth0"}[5m])
# 所有pod的cpu使用率
sum(irate(container_cpu_usage_seconds_total{image!="",namespace!=""}[1m])) without (cpu) * 100
3、监控container
# 下载速度,pod的name字段大于20个字符,只需要选择name小于20个字段的就是container
irate(container_network_receive_bytes_total{image!="",name=~"[\u4e00-\u9fa5_a-zA-Z0-9_]{1,19}",interface="eth0"}[5m])
# cpu使用率
通常pod的name会超过20位字符,只需要匹配中文+英文+数字+字母并且长度在1-19位的name就可以拿到container了
sum(irate(container_cpu_usage_seconds_total{image!="",name=~"[\u4e00-\u9fa5_a-zA-Z0-9_]{4,10}"}[1m])) without (cpu) * 100
二 部署
• kubernetes 1.25.6
• 集群外部prometheus监控k8s是通过访问kube-apiserver实现的,但是需要配置RBAC授权才能访问kube-apiserver
• node-exporter是用来采集node节点数据的,比如cpu 内存 网卡流量 磁盘
• cAdvisor是集成到kubelet中的,用来采集pod数据,但是如果想同时采集container数据就要单独部署cAdvisor
• kube-state-metrics 会监听API Server生成有关资源对象的状态指标,比如Deployment、Node、Pod
pod的指标是用cAdvisor来收集,目前这个工具集成到 Kubelet中,当Kubelet启动时会同时启动cAdvisor,且一个cAdvisor只监控一个Node节点的信息。cAdvisor 自动查找所有在其所在节点上的容器,自动采集CPU、内存、文件系统和网络使用的统计信息。cAdvisor 通过它所在节点机的 Root 容器,采集并分析该节点机的全面使用情况。
kubelet会输出一些监控指标数据,因此pod的监控数据有kubelet和cadvisor,监控url分别为,这个url不能直接访问会包401
https://192.168.10.160:6443/api/v1/nodes/node-name:10250/proxy/metrics
https://192.168.10.160:6443/api/v1/nodes/node-name:10250/proxy/metrics/cadvisor
由于kubelet天然存在,因此直接使用即可,无需做其他配置。
参考文章
https://www.cnblogs.com/cyh00001/p/16725312.html
https://blog.csdn.net/m0_48898914/article/details/128194358
1、先在docker或本地部署好prometheus+grafana+alertmanager
如果想监控k8s的同时还想监控container就需要部署 cAdvisor
参考 【docker部署promethues】
2、创建RBAC用户并授权
(1) 创建名称空间用来存放prometheus监控资源
# cd /soft/prometheus-rbac/
# kubectl create ns monitoring
(2) 创建一个k8s系统用户
# cat prometheus-sa.yaml
apiVersion: v1
kind: ServiceAccount # 这是一个SA用户(k8s内置用户)
metadata:
name: prometheus # SA用户名
namespace: monitoring # 名称空间
# kubectl apply -f prometheus-sa.yaml
# kubectl get sa -n monitoring
NAME SECRETS AGE
default 0 5m57s
prometheus 0 4m26s
(3) 为sa用户准备token(相当于为sa用户设置密码)
# cat prometheus-secret.yaml
apiVersion: v1
kind: Secret # 这是一个密码类型资源用于存放账户密码/token等类型
type: kubernetes.io/service-account-token
metadata:
name: prometheus-token # token的名字
namespace: monitoring
annotations: # 注释,这个token的账户是prometheus
kubernetes.io/service-account.name: "prometheus"
# kubectl apply -f prometheus-secret.yaml
# kubectl get secret -n monitoring
NAME TYPE DATA AGE
prometheus-token kubernetes.io/service-account-token 3 3m52s
(4) 创建一个集群角色并给该角色授权
# cat prometheus-ClusterRole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole # 集群角色类型
metadata:
name: prometheus # 集群角色名称
namespace: monitoring # 集群角色好像不能分陪名称空间
rules: # 授权操作
- apiGroups:
- ""
resources: # 授权对象
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs: # 拥有这些权限
- get
- list
- watch
- apiGroups:
- "extensions"
resources:
- ingresses
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
# kubectl apply -f prometheus-ClusterRole.yaml
# kubectl get ClusterRole |grep "prometheus"
(5) 把之前创建的SA账户绑定到集群角色上,让SA账户拥有集群角色权限
# cat prometheus-ClusterRoleBinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding # 集群角色绑定类型资源
metadata:
name: prometheus # 为这个集群角色绑定资源类型起个名字
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole # 要绑定的类型为集群角色
name: prometheus # 要绑定的集群角色名称
subjects:
- kind: ServiceAccount # 要绑定的类型为SA账户
name: prometheus # 要绑定的SA账户名称
namespace: monitoring
# kubectl apply -f prometheus-ClusterRoleBinding.yaml
# kubectl get ClusterRoleBinding prometheus
NAME(ClusterRoleBinding名称) ROLE(绑定的集群角色名称) AGE
prometheus ClusterRole/prometheus 34s
3、查询创建的token并保存到文件
# kubectl get secret -n monitoring
NAME TYPE DATA AGE
prometheus-token kubernetes.io/service-account-token 3 19m
# kubectl describe secret prometheus-token -n monitoring
Name: prometheus-token
Namespace: monitoring
Labels: <none>
Annotations: kubernetes.io/service-account.name: prometheus
kubernetes.io/service-account.uid: f38735e0-1b31-4573-bc78-2c8036591e7e
Type: kubernetes.io/service-account-token
Data
====
ca.crt: 1310 bytes
namespace: 10 bytes
token: eyJhbGciOiJSUzI1NiIsImtpZCI6ImUxTDlwaDdydDJHOHdCdERyVGVjbjZRc2VXSU9DaWVUUGR4YzZqaVkyOUkifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJtb25pdG9yaW5nIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6InByb21ldGhldXMtdG9rZW4iLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImYzODczNWUwLTFiMzEtNDU3My1iYzc4LTJjODAzNjU5MWU3ZSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDptb25pdG9yaW5nOnByb21ldGhldXMifQ.drWRtZWsYddF-abb1w4vxjW0Se9us2hZNNRNos4cgTF-9WktpuIg6kxkZjY06bUmersxn6tMC9fcHHbr_-ZzF8zcAwXSZIZn37ptexr7OZDp2uZaBtxRTtUy6KRK7826ZwZaElkApb9T6h2Kxwl4exuHK5NFNU0rSWRMN3LjfvFoQFNHxw2ISgA0zWcvwAm_OGG7M6fV7sRTLin5xmRoYVZhrcBDvTHk52tJAObJUJUBl5jo0TSAFZ_mJMOca-xA3L3Ylq_u2SbDjcSt7UUmN6XmmHjqzlNcLIVO1-S_9gz6R7C2bfQIrfpzUXY6raPDnW-9Er-Sm521gS0ZlWfLgg
# 把这个token保存到/etc/prometheus/k8s_token
4、部署node-exporter
• node-exporter工具主要用来采集集群node节点的服务器层面的数据,如cpu、内存、磁盘、网卡流量等,监控的url是:http://node-ip:9100/metrics
• node-exporter工具也可以用docker方式独立部署在服务器上,但是独立部署有一个问题 ,就是在集群扩展上,需要手动部署,并且prometheus server也需要手动修改配置文件,非常麻烦。因此我们采用在k8s集群上将node-exporter工具以DaemonSet形式部署,配合Prometheus动态发现更加方便
# 复制下面的内容
# cat prometheus-node-exporter.yaml
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: node-exporter
namespace: monitoring
annotations:
prometheus.io/scrape: 'true' #用于prometheus自动发现
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
name: node-exporter
spec:
containers:
- image: quay.io/prometheus/node-exporter:latest
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
name: node-exporter
hostNetwork: true #这里面要修改成hostNetwork,以便于直接通过node url直接访问
hostPID: true
tolerations: #设置tolerations,以便master节点也可以安装
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
---
kind: Service
apiVersion: v1
metadata:
annotations:
prometheus.io/scrape: 'true'
labels:
app: node-exporter
name: node-exporter
namespace: monitoring
spec:
type: ClusterIP
clusterIP: None
ports:
- name: node-exporter
port: 9100
protocol: TCP
selector:
app: node-exporter
# kubectl apply -f prometheus-node-exporter.yaml
# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
node-exporter-62cvk 1/1 Running 0 15s
5、部署kube-state-metrics
我的k8s是1.25用的是镜像是2.6
# cat prometheus-state-metrics.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: kube-state-metrics
template:
metadata:
labels:
app: kube-state-metrics
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: registry.cn-hangzhou.aliyuncs.com/zhangshijie/kube-state-metrics:v2.6.0
ports:
- containerPort: 8080
---
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups: [""]
resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"]
verbs: ["list", "watch"]
- apiGroups: ["extensions"]
resources: ["daemonsets", "deployments", "replicasets"]
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources: ["cronjobs", "jobs"]
verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: monitoring
---
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: 'true'
name: kube-state-metrics
namespace: monitoring
labels:
app: kube-state-metrics
spec:
type: NodePort
ports:
- name: kube-state-metrics
port: 8080
targetPort: 8080
nodePort: 31666
protocol: TCP
selector:
app: kube-state-metrics
# kubectl apply -f prometheus-state-metrics.yaml
6、配置cAdvisor
pod的指标是用cAdvisor来收集,目前这个工具集成到 Kubelet中,当Kubelet启动时会同时启动cAdvisor,且一个cAdvisor只监控一个Node节点的信息。cAdvisor 自动查找所有在其所在节点上的容器,自动采集CPU、内存、文件系统和网络使用的统计信息。cAdvisor 通过它所在节点机的 Root 容器,采集并分析该节点机的全面使用情况。
kubelet会输出一些监控指标数据,因此pod的监控数据有kubelet和cadvisor,监控url分别为,这个url不能直接访问会报401
https://192.168.10.160:6443/api/v1/nodes/node-name:10250/proxy/metrics
https://192.168.10.160:6443/api/v1/nodes/node-name:10250/proxy/metrics/cadvisor
由于kubelet天然存在,因此直接使用即可,无需做其他配置。
7、配置prrometheus.yaml
# cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["192.168.1.15:9093"]
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
# 配置alertmanager报警规则
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "/etc/prometheus/node_rules.yml"
- "/etc/prometheus/pod_rules.yml"
- "/etc/prometheus/container_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["192.168.1.15:9090"] # 1.15是k8s机器也是prometheus机器
# 如果还需要监控container就需要配置cadvisor
# container
#- job_name: "cadvisor"
# static_configs:
# - targets: ["192.168.1.15:18104"]
# labels:
# rule: cadvisor
# kube-state-metrics
- job_name: "kube-state-metrics"
static_configs:
- targets: ["192.168.1.15:31666"]
#API Serevr节点指标信息采集
- job_name: 'kubernetes-apiservers-monitor'
kubernetes_sd_configs:
- role: endpoints
api_server: https://192.168.1.15:6443
tls_config:
insecure_skip_verify: true
bearer_token_file: /etc/prometheus/k8s_token #这个k8s_token文件就是刚才前面生成的文件,保存在/etc/promethues路径下
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /etc/prometheus/k8s_token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- source_labels: [__address__]
regex: '(.*):6443'
replacement: '${1}:9100'
target_label: __address__
action: replace
- source_labels: [__scheme__]
regex: https
replacement: http
target_label: __scheme__
action: replace
#node节点指标信息采集
- job_name: 'kubernetes-nodes-monitor'
scheme: http
tls_config:
insecure_skip_verify: true
bearer_token_file: /etc/prometheus/k8s_token
kubernetes_sd_configs:
- role: node
api_server: https://192.168.1.15:6443
tls_config:
insecure_skip_verify: true
bearer_token_file: /etc/prometheus/k8s_token
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
action: replace
- source_labels: [__meta_kubernetes_node_label_failure_domain_beta_kubernetes_io_region]
regex: '(.*)'
replacement: '${1}'
action: replace
target_label: LOC
- source_labels: [__meta_kubernetes_node_label_failure_domain_beta_kubernetes_io_region]
regex: '(.*)'
replacement: 'NODE'
action: replace
target_label: Type
- source_labels: [__meta_kubernetes_node_label_failure_domain_beta_kubernetes_io_region]
regex: '(.*)'
replacement: 'K8S-test'
action: replace
target_label: Env
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
#pod指标信息采集直接复制
# kubelet
- job_name: "kube-node-kubelet"
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /etc/prometheus/k8s_token
kubernetes_sd_configs:
- role: node
api_server: "https://192.168.1.15:6443"
tls_config:
insecure_skip_verify: true
bearer_token_file: /etc/prometheus/k8s_token
relabel_configs:
- target_label: __address__
# 使用replacement值替换__address__默认值
replacement: 192.168.1.15:6443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
# 使用replacement值替换__metrics_path__默认值
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}:10250/proxy/metrics
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service_name
#pod指标信息采集 直接复制
#advisor
- job_name: "kube-pod-cadvisor"
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /etc/prometheus/k8s_token
kubernetes_sd_configs:
- role: node
api_server: "https://192.168.1.15:6443"
tls_config:
insecure_skip_verify: true
bearer_token_file: /etc/prometheus/k8s_token
relabel_configs:
- target_label: __address__
# 使用replacement值替换__address__默认值
replacement: 192.168.1.15:6443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
# 使用replacement值替换__metrics_path__默认值
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}:10250/proxy/metrics/cadvisor
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service_name
# docker restart prome
http://nodeip:9090
8、设置alertmanager规则
(1) node节点告警
# cat node_rules.yml
groups:
- name: node_alert-rule1 # 第一个告警规则
rules:
- alert: node_cpu_alert # 告警名称
expr: 100 - avg(irate(node_cpu_seconds_total{Type="NODE",mode="idle"}[1m])) by (instance) * 100 > 75 # 告警语句
for: 1m # 持续1分钟就发邮件告警
labels:
level: warning # 告警级别
system: node # 这里可以自定义标签对应alertmanager.yml中的子路由标签
annotations:
description: "节点ip: {{ $labels.instance }} value: {{ $value }}" # 告警内容
summary: "node节点cpu太高!" # 描述信息
(2) pod告警
# cat pod_rules.yml
groups:
- name: pod_alert-rule1 # 第一个告警规则
rules:
- alert: pod_cpu_alert # 告警名称
expr: sum(irate(container_cpu_usage_seconds_total{image!="",namespace!=""}[1m])) without (cpu) * 100 > 75 # 告警语句
for: 1m # 持续1分钟就发邮件告警
labels:
level: warning # 告警级别
system: pod # 这里可以自定义标签对应alertmanager.yml中的子路由标签
annotations:
description: "pod: {{ $labels.pod }} value: {{ $value }}" # 告警内容
summary: "pod cpu使用率太高!" # 描述信息
(3) container告警
# cat container_rules.yml
groups:
- name: container_alert-rule1 # 第一个告警规则
rules:
- alert: container_cpu_alert # 告警名称
expr: sum(irate(container_cpu_usage_seconds_total{image!="",name=~"[\u4e00-\u9fa5_a-zA-Z0-9_]{4,10}"}[1m])) without (cpu) * 100 > 75 # 告警语句
for: 1m # 持续1分钟就发邮件告警
labels:
level: warning # 告警级别
system: container # 这里可以自定义标签对应alertmanager.yml中的子路由标签
annotations:
description: "container: {{ $labels.name }} value: {{ $value }}" # 告警内容
summary: "container cpu使用率太高!" # 描述信息