开箱即用的告警规则

[TOC]

开箱即用的告警规则

https://awesome-prometheus-alerts.grep.to/

https://blog.csdn.net/weixin_43798031/article/details/127488164

prometheus

组件

prometheus server
1. Push gateway 接收指标数据的网关
2. storage 内置tsdb数据库
3. rules and alerts 规则
4. service discovery 动态发现待监控的target
alertmanager
exporter
grafana

客户端采集方式pull/push

pull主动拉取的形式

指的是客户端安装各类已有得exporter（是一个http_server可对http请求做出响应返回k/v数据）在系统上，采集数据；
push被动推送的形式

客户端（服务端）安装官方提供的pushgateway插件，使用运维自行开发的各种脚本，把监控数据组织成k/v的形式以metrics形式发送给pushgateway后，pushgateway再推送给Prometheus；

prometheus 相关exporter

K8S 生态的组件都会提供/metric接口以提供自监控，这里列下我们正在使用的：

cadvisor: 集成在 Kubelet 中。
kubelet: 10255为非认证端口，10250为认证端口。
apiserver: 6443端口，关心请求数、延迟等。
scheduler: 10251端口。
controller-manager: 10252端口。
etcd: 如etcd 写入读取延迟、存储容量等。
docker: 需要开启 experimental 实验特性，配置 metrics-addr，如容器创建耗时等指标。
kube-proxy: 默认 127 暴露，10249端口。外部采集时可以修改为 0.0.0.0 监听，会暴露：写入 iptables 规则的耗时等指标。
kube-state-metrics: K8S 官方项目，采集pod、deployment等资源的元信息。
node-exporter: Prometheus 官方项目，采集机器指标如 CPU、内存、磁盘。
blackbox_exporter: Prometheus 官方项目，网络探测，dns、ping、http监控
process-exporter: 采集进程指标
nvidia exporter: 我们有 gpu 任务，需要 gpu 数据监控
node-problem-detector: 即 npd，准确的说不是 exporter，但也会监测机器状态，上报节点异常打 taint
应用层 exporter: mysql、nginx、mq等，看业务需求。

四个黄金信号：延迟、流量、错误数、饱和度

部署

server部署

二进制部署

下载：https://prometheus.io/download/

启动参数：参考

检查配置文件：
./promtool check config ./prometheus.yml

启动：
./prometheus --config.file=./prometheus.yml --web.listen-address="0.0.0.0:9090" --web.enable-lifecycle --log.level=warn --web.enable-admin-api --storage.tsdb.wal-compression --storage.tsdb.path=./data --storage.tsdb.retention.time=15d --web.read-timeout=5m --web.max-connections=512

参数详解：
    --storage.tsdb.path=/prometheus # 指标(数据）存储的基本路径 
    --web.enable-lifecycle # 启用是否通过HTTP请求重新加载 
    --web.listen-address="0.0.0.0:9090" 
    --web.read-timeout=5m # 空闲连接的超时时间,防止太多空闲链接占用资源  
    --web.max-connections=512  # 最大连接数
    --web.external-url=<URL> # 可从外部访问Prometheus的URL,比如反向代理 
    --web.cors.origin=".*" 
    --log.level=warn
    --web.enable-admin-api # 管理api 数据删除清理 
    --storage.tsdb.retention.time=15d # 数据保留的天数，默认15天 
    --storage.tsdb.wal-compression # 压缩tsdb WAL 
    --rules.alert.for-grace-period=10m   #警报和恢复的“ for”状态之间的最短持续时间。

docker-compose部署

version: '3.8'

services:
  prom:
    container_name: prometheus
    image: prom/prometheus:v2.47.1
    restart: on-failure:5
    hostname: prometheus
    ports:
      - 9090:9090
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      # 用于登陆认证
      - "--web.config.file=/etc/prometheus/config.yml"
      - "--web.enable-lifecycle"
      - "--log.level=warn"
      - "--web.enable-admin-api"
      - "--web.console.libraries=/etc/prometheus/console_libraries"
      - "--web.console.templates=/etc/prometheus/consoles"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=7d" 
    environment:
      TZ: Asia/Shanghai
    volumes:
      - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime
      - ./config/prometheus.yml:/etc/prometheus/prometheus.yml
      # 用于登陆认证 htpasswd -nBC 12 '' | tr -d ':\n' 
      #cat > ./config.yml<<eof
      #basic_auth_users:
        # 当前设置的用户名为admin， 可以设置多个
      #  admin: $2y$12$BKFmICzKaeqjDCJOK/y9e./NcFso6XN10txEKwtzpguI3G.AvSwgS
      - ./config/config.yml:/etc/prometheus/config.yml
      - ./data:/prometheus
      - ./rules:/rules
      - ./config/targets:/etc/prometheus/targets

  dingtalk-webhook:
    container_name: dingtalk-webhook
    image: zhangyudd/webhook-dingtalk:v6 
    restart: on-failure:5
    hostname: prometheus
    ports:
      - 9091:8080
    command:
      - "https://oapi.dingtalk.com/robot/send?access_token=18ab9ca9251d577c770b771c79f6166a95f069d4299f1c03743f878974025c83"
    environment:
      TZ: Asia/Shanghai

  alert:
    container_name: alert
    image: prom/alertmanager:v0.26.0
    restart: on-failure:5
    hostname: prometheus
    ports:
      - 9092:9093
      - 9093:9094
    environment:
      TZ: Asia/Shanghai
    command: --config.file=/etc/alertmanager/alertmanager.yml --log.level=debug
    volumes:
      - /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime
      - ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml

  cadvisor:
    image: zhangyudd/google_containers:cadvisor_v0.47.1
    container_name: monitoring_cadvisor
    restart: unless-stopped
    privileged: true
    environment:
      - TZ=Asia/Shanghai
    command: --http_auth_file /cadvisor.htpasswd --http_auth_realm localhost
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - ./config/cadvisor.htpasswd:/cadvisor.htpasswd
    ports:
      - 9094:8080

networks:
  default:
    external:
      name: my_net

添加登陆认证

yum install httpd-tools -y
htpasswd -nBC 12 '' | tr -d ':\n'
$2y$12$wK39XcvCNgU89Cr/z3SYxupyJUIzKAHP5VAzTcnrOWU04t83PkDG6

vi web-config.yml

basic_auth_users:
    admin: $2y$12$wK39XcvCNgU89Cr/z3SYxupyJUIzKAHP5VAzTcnrOWU04t83PkDG6

添加启动参数

      - "--web.config.file=/etc/prometheus/config.yml"

node_exporter部署

web-config.yaml(开启认证)

basic_auth_users:
    admin: $2y$12$wK39XcvCNgU89Cr/z3SYxupyJUIzKAHP5VAzTcnrOWU04t83PkDG6

node_export_start.sh

#!/bin/bash

basePath="/usr/local"
softVersion="node_exporter"

nohup $basePath/$softVersion/node_exporter --web.config.file="$basePath/$softVersion/web-config.yaml"  > $basePath/$softVersion/node_exporter.log 2>&1 &

pushgateway 数据采集脚本

pushgateway 是一种采用被动推送的方式，单独运行在任意节点。

优缺点

使用它的原因主要是：

Prometheus 采用 pull 模式，可能由于不在一个子网或者防火墙原因，导致 Prometheus 无法直接拉取各个 target 数据。

在监控业务数据的时候，需要将不同数据汇总, 由 Prometheus 统一收集。

缺点有：

将多个节点数据汇总到 pushgateway, 如果 pushgateway 挂了，受影响比多个 target 大。

Prometheus 拉取状态 up 只针对 pushgateway, 无法做到对每个节点有效。

Pushgateway 可以持久化推送给它的所有监控数据。因此，即使你的监控已经下线，prometheus 还会拉取到旧的监控数据，需要手动清理 pushgateway 不要的数据。

curl -X DELETE http://127.0.0.1:9091/metrics/job/Ping_check/instance/y （del单个）

curl -X DELETE http://172.16.11.198:9091/metrics/job/Ping_check（del一组）

curl -X PUT http://172.16.11.198:9091/api/v1/admin/wipe（del all）

下载安装

prometheus.io 下载
./pushgateway --web.enable-admin-api 启动

Prometheus配置
- job_name: 'pushgateway'
  static_configs:
    - targets:
      - 127.0.0.1:9091
        labels:
          instance: pushgateway

采集脚本范例（等待连接数）

#!/bin/bash
instance_name=`hostname -f | cut -d '.' -f1`  # 本机机器名 变量 用于之后得标签

if [ $instance_name == "localhost" ];then  # 要求机器名不能是localhost 不然标签无法区分
  echo "Must FQDN hostname"
  exit 1
fi

label="count_netstat_wait_connections"  # 定义一个新的key
count_netstat_wait_connections=`netstat -an | grep -i wait | wc -l`  # 定义一个value

# 推送多个数据 
echo "$label $count_netstat_wait_connections"
echo "$label $count_netstat_wait_connections" | curl --data-binary @- http://172.16.11.198:9091/metrics/job/pushgateway1/instance/$instance_name  

# 自定义label及推送多个指标
cat <<EOF | curl --data-binary @- http://172.16.11.198:9091/metrics/job/pushgateway1/instance/$instance_name 
# TYPE test_metrics counter
test_metrics{label="app1",name="demo"} 100.00
# TYPE another_test_metrics gauge
# HELP another_test_metrics Just an example.
another_test_metrics 123.45
EOF


# /metrics/job/<JOB_NAME>{/<LABEL_NAME>/<LABEL_VALUE>}  # 推送数据定义

shell 获取ping数据

关键命令使用

# 延迟
timeout 5 ping -q -A -s 500 -W 1000 -c 10 127.0.0.1 | grep rtt | awk -F '/' '{print $5}'
-s 包大小 默认64
-W 延迟timeout
-c 发送多少个数据包

# 丢包
timeout 5 ping -q -A -s 500 -W 1000 -c 10 127.0.0.1 | grep transmitted | awk '{print $6}'

获取指标上传到pushgateway

#!/bin/bash

# 捕获异常信号，执行优雅操作
trap 'rm -f tmp_push_prometheus.txt && curl -X PUT http://$pushgateway_addr/api/v1/admin/wipe && exit' ERR EXIT SIGINT

# 相关指标
ping_lost_package="ping_lost_package" # 定义一个丢包率label
ping_rtt_avg="ping_rtt_avg"           # 定义一个延迟lable
pushgateway_addr="172.16.11.198:9091"

# 监控下线，删除指标
del_value() {
	curl -X DELETE http://$pushgateway_addr/metrics/job/Ping_check/instance/$1
}
# 推送指标到pushgateway
push_prometheus(){
    curl -X POST --data-binary @tmp_push_prometheus.txt http://$pushgateway_addr/metrics/job/Ping_check/instance/$1
}

# 获取指标并上传到pushgateway
func() {
	x=0 # 计数器（用于对应两个数组的值）
	for ip in ${ips[*]}; do
		for label in ${ipLabels[*]}; do
			echo $ip ${ipLabels[$x]}

			ping_lost_package=$(timeout 5 ping -q -A -s 500 -W 1000 -c 10 $ip | grep transmitted | awk '{print $6}' | awk -F '%' '{print $1}')
			ping_rtt_avg=$(timeout 5 ping -q -A -s 500 -W 1000 -c 10 $ip | grep rtt | awk -F '/' '{print $5}')

			# 判断value是否有值,为空，清理pushgateway数据
			if [ ! -n $ping_lost_package ]; then
				del_value $1
				break
			fi
			if [ ! -n $ping_rtt_avg ]; then
				del_value $1
				break
			fi

			echo "ping_lost_package{label=\"${ipLabels[$x]}\",env=\"pro\"} $ping_lost_package" >>tmp_push_prometheus.txt
			echo "ping_rtt_avg{label=\"${ipLabels[$x]}\",env=\"pro\"} $ping_rtt_avg" >>tmp_push_prometheus.txt
			break
		done
		x=$(expr $x + 1)
	done

	# 推送指标到pushgateway
	if [ -f tmp_push_prometheus.txt ]; then
		push_prometheus $1
		# 删除临时文件
		rm -f tmp_push_prometheus.txt
	else
		echo "临时文件未生成"
	fi
	
	# 删除历史数据，防止旧数据造成异常
	del_value $1 
}
while true;do
    ips=(172.16.11.198 127.0.0.1 baidu.com)
    ipLabels=(test localhost baidu)
    func x

    ips=(102.932.115.447)
    ipLabels=(law)
    func y
    
    sleep 15
done

smokeping 获取ping指标

搭建smokeping

docker run --name smokeping -d --rm -p 8888:80 -e PUID=1000 -e PGID=1000   -v /data/smokeping/data:/data/ -v /data/smokeping/config:/config -e TZ=Asia/Shanghai linuxserver/smokeping

配置smokeping

cat config/Database
step     = 60
pings    = 60


cat config/Targets 
*** Targets ***
probe = FPing
menu = Top
title = Network Latency Grapher
remark = Welcome to the SmokePing website of WORKS Company. \
         Here you will learn all about the latency of our network.
+ targets
menu = Targets
++ baiduURL
menu = baidu URL
title = baidu URL server
host = www.baidu.com
++ test
menu = test
title = test
host = 172.16.11.198

重启smokeping容器，并查看日志，确保容器运行正常

获取指标

命令行获取，不太懂
rrdtool  fetch  ./data/targets/test.rrd AVERAGE

python3文件获取(未编写推送指标部分)
collection_to_prometheus.py
#coding:utf-8
import rrdtool
import os


paras = {
    'prometheus_gateway' : 'http://192.168.56.101:9091' , 
    'data_dir' : '/etc/smokeping/data/' # 指定rrdtool数据目录，其下为子目录/etc/smokeping/data/targets/*.rrd
}
# 通过rrdtool获取指标
def getMonitorData(rrd_file): 
    rrd_info = rrdtool.info(rrd_file)
    last_update = rrd_info['last_update'] - 60
    args = '-s ' + str(last_update) 
    results = rrdtool.fetch(rrd_file , 'AVERAGE' , args )
    lost_package_num = results[2][0][1]
    average_rrt = 0 if not results[2][0][2] else results[2][0][2] * 1000
    return lost_package_num , round(average_rrt , 4)


if __name__ == '__main__':
    ISP_list = ['targets'] 
    for ISP in ISP_list:
        rrd_data_dir = os.path.join(paras['data_dir'] , ISP)
        for filename in os.listdir(rrd_data_dir):
            (instance , postfix) = os.path.splitext(filename)
            if postfix == '.rrd' :
                (lost_package_num , rrt) = getMonitorData(os.path.join(paras['data_dir'] , ISP , filename))
                print(rrd_data_dir,instance,rrt,lost_package_num)

物理部署Prometheus监控k8s

参考： https://www.jianshu.com/p/fc5624a30580

k8s集群需安装metrics-server state-metrics服务

创建rbac对象

cat prom.rbac.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - services
  - endpoints
  - pods
  - nodes/proxy
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - "extensions"
  resources:
    - ingresses
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - configmaps
  - nodes/metrics
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: kube-system

获取secret信息

kubectl get sa prometheus -n kube-system -o yaml

kubectl describe secret prometheus-token-wj7fb -n kube-system

将token保存到文件中 k8s.token

Prometheus配置收集cadvisor指标

  - job_name: k8s-cadvisor
    honor_timestamps: true
    metrics_path: /metrics
    scheme: https
    kubernetes_sd_configs:  # kubernetes 自动发现
    - api_server: https://172.16.11.198:6443  # apiserver 地址
      role: node  # node 类型的自动发现
      bearer_token_file: k8s.token
      tls_config:
        insecure_skip_verify: true
    bearer_token_file: k8s.token
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
    - separator: ;
      regex: (.*)
      target_label: __address__
      replacement: 172.16.11.198:6443
      action: replace
    - source_labels: [__meta_kubernetes_node_name]
      separator: ;
      regex: (.+)
      target_label: __metrics_path__
      replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
      action: replace

Prometheus配置收集pod指标

需在被收集pod上面做配置，达到自动发现目的

  - job_name: kubernetes-pods
    honor_timestamps: true
    metrics_path: /metrics
    scheme: https
    kubernetes_sd_configs:
    - api_server: https://172.16.11.198:6443
      role: pod
      bearer_token_file: k8s.token
      tls_config:
        insecure_skip_verify: true
    bearer_token_file: k8s.token
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      separator: ;
      regex: "true"
      replacement: $1
      action: keep
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      separator: ;
      regex: (.+)
      target_label: __metrics_path__
      replacement: $1
      action: replace
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      separator: ;
      regex: ([^:]+)(?::\d+)?;(\d+)
      target_label: __address__
      replacement: $1:$2
      action: replace
    - separator: ;
      regex: __meta_kubernetes_pod_label_(.+)
      replacement: $1
      action: labelmap
    - source_labels: [__meta_kubernetes_namespace]
      separator: ;
      regex: (.*)
      target_label: namespace
      replacement: $1
      action: replace
    - source_labels: [__meta_kubernetes_pod_name]
      separator: ;
      regex: (.*)
      target_label: pod_name
      replacement: $1
      action: replace

被收集pods yaml配置

对应pod的自动收集需要在deploy的部署文件中增加metadata

在deployment中的 template->metadata->annotations 增加
prometheus.io/scrape: "true"
prometheus.io/port: "8081"  # pod端口
prometheus.io/path: "/actuator/prometheus"

kubernetes会查找prometheus.io/scrape=true注释的pod 。如果可以使用此注释，则Prometheus会自动获取指标。

Prometheus配置收集kubelet指标

  - job_name: kubelet
    honor_timestamps: true
    metrics_path: /metrics
    scheme: https
    kubernetes_sd_configs:  # kubernetes 自动发现
    - api_server: https://172.16.11.198:6443  # apiserver 地址
      role: node  # node 类型的自动发现
      bearer_token_file: k8s.token
      tls_config:
        insecure_skip_verify: true
    bearer_token_file: k8s.token
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
    - action: replace
      target_label: __address__
      replacement: 172.16.11.198:6443
    - action: replace
      source_labels: [__meta_kubernetes_node_name]
      regex: (.+)
      target_label: __metrics_path__
      replacement: /api/v1/nodes/${1}/proxy/metrics

Prometheus联邦集群

参考：https://www.cnblogs.com/zyyang1993/p/16621158.html

数据采集

服务发现

基于文件服务发现
基于dns服务发现
基于api服务发现
基于consul服务发现

基于文件的服务发现 file_sd_configs

略优于静态配置的服务发现，不依赖任何平台或第三方服务
Prometheus server定期从文件中加载target信息，可用json或yaml

- job_name: 'prometheus'
  file_sd_configs:
  - files:
    - targets/prometheus*.yaml
    refresh_interval: 2m # 每隔2分钟自动重新加载一次文件中定义的targets，默认5分钟
[root@localhost]# cat prometheus_target.yaml
- targets:
  - 172.1.1.1:9100
  - 172.1.1.2:9100
  labels:
    app: node-exporter
    job: node

基于dns的服务发现 dns_sd_configs

基于Dns的服务发现针对一组dns域名进行定期查询

基于api的服务发现 kubernetes_sd_configs

基于consul的服务发现 consul_sd_configs

Prometheus 通过 consul 实现自动服务发现传送门

consul二进制部署

下载地址: https://www.consul.io/downloads 安装：unzip consul_1.9.2_linux_amd64.zip -d /usr/local/bin 启动开发者模式： mkdir -pv /consul/data consul agent -dev -ui -data-dir=/consul/data -config-dir=/etc/consul/ -client=0.0.0.0 正常应该使用server模式

k8s consul部署

https://github.com/gupf0719/consul

zhangyu / prometheus

[TOC]

开箱即用的告警规则

prometheus

组件

客户端采集方式pull/push

prometheus 相关exporter

部署

server部署

二进制部署

docker-compose部署

添加登陆认证

node_exporter部署

pushgateway 数据采集脚本

物理部署Prometheus监控k8s

Prometheus联邦集群

数据采集

服务发现

基于文件的服务发现 file_sd_configs

基于dns的服务发现 dns_sd_configs

基于api的服务发现 kubernetes_sd_configs

基于consul的服务发现 consul_sd_configs

consul二进制部署

k8s consul部署

简介

发行版

贡献者

近期动态

zhangyu / prometheus .gitee-modal { width: 500px !important; }

[TOC]

开箱即用的告警规则

prometheus

组件

客户端采集方式pull/push

prometheus 相关exporter

部署

server部署

二进制部署

docker-compose部署

添加登陆认证

node_exporter部署

pushgateway 数据采集脚本

物理部署Prometheus监控k8s

Prometheus联邦集群

数据采集

服务发现

基于文件的服务发现 file_sd_configs

基于dns的服务发现 dns_sd_configs

基于api的服务发现 kubernetes_sd_configs

基于consul的服务发现 consul_sd_configs

consul二进制部署

k8s consul部署

简介

发行版

贡献者

近期动态

搜索帮助

zhangyu / prometheus