https://awesome-prometheus-alerts.grep.to/
https://blog.csdn.net/weixin_43798031/article/details/127488164
pull主动拉取的形式
指的是 客户端 安装各类已有得exporter(是一个http_server可对http请求做出响应返回k/v数据)在系统上,采集数据;
push被动推送的形式
客户端(服务端)安装官方提供的pushgateway插件,使用运维自行开发的各种脚本,把监控数据组织成k/v的形式 以metrics形式发送给pushgateway后,pushgateway再推送给Prometheus;
K8S 生态的组件都会提供/metric接口以提供自监控,这里列下我们正在使用的:
cadvisor: 集成在 Kubelet 中。
kubelet: 10255为非认证端口,10250为认证端口。
apiserver: 6443端口,关心请求数、延迟等。
scheduler: 10251端口。
controller-manager: 10252端口。
etcd: 如etcd 写入读取延迟、存储容量等。
docker: 需要开启 experimental 实验特性,配置 metrics-addr,如容器创建耗时等指标。
kube-proxy: 默认 127 暴露,10249端口。外部采集时可以修改为 0.0.0.0 监听,会暴露:写入 iptables 规则的耗时等指标。
kube-state-metrics: K8S 官方项目,采集pod、deployment等资源的元信息。
node-exporter: Prometheus 官方项目,采集机器指标如 CPU、内存、磁盘。
blackbox_exporter: Prometheus 官方项目,网络探测,dns、ping、http监控
process-exporter: 采集进程指标
nvidia exporter: 我们有 gpu 任务,需要 gpu 数据监控
node-problem-detector: 即 npd,准确的说不是 exporter,但也会监测机器状态,上报节点异常打 taint
应用层 exporter: mysql、nginx、mq等,看业务需求。
四个黄金信号
:延迟、流量、错误数、饱和度
下载:https://prometheus.io/download/
启动参数:参考
检查配置文件:
./promtool check config ./prometheus.yml
启动:
./prometheus --config.file=./prometheus.yml --web.listen-address="0.0.0.0:9090" --web.enable-lifecycle --log.level=warn --web.enable-admin-api --storage.tsdb.wal-compression --storage.tsdb.path=./data --storage.tsdb.retention.time=15d --web.read-timeout=5m --web.max-connections=512
参数详解:
--storage.tsdb.path=/prometheus # 指标(数据)存储的基本路径
--web.enable-lifecycle # 启用是否通过HTTP请求重新加载
--web.listen-address="0.0.0.0:9090"
--web.read-timeout=5m # 空闲连接的超时时间,防止太多空闲链接占用资源
--web.max-connections=512 # 最大连接数
--web.external-url=<URL> # 可从外部访问Prometheus的URL,比如反向代理
--web.cors.origin=".*"
--log.level=warn
--web.enable-admin-api # 管理api 数据删除清理
--storage.tsdb.retention.time=15d # 数据保留的天数,默认15天
--storage.tsdb.wal-compression # 压缩tsdb WAL
--rules.alert.for-grace-period=10m #警报和恢复的“ for”状态之间的最短持续时间。
version: '3.8'
services:
prom:
container_name: prometheus
image: prom/prometheus:v2.47.1
restart: on-failure:5
hostname: prometheus
ports:
- 9090:9090
command:
- "--config.file=/etc/prometheus/prometheus.yml"
# 用于登陆认证
- "--web.config.file=/etc/prometheus/config.yml"
- "--web.enable-lifecycle"
- "--log.level=warn"
- "--web.enable-admin-api"
- "--web.console.libraries=/etc/prometheus/console_libraries"
- "--web.console.templates=/etc/prometheus/consoles"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=7d"
environment:
TZ: Asia/Shanghai
volumes:
- /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime
- ./config/prometheus.yml:/etc/prometheus/prometheus.yml
# 用于登陆认证 htpasswd -nBC 12 '' | tr -d ':\n'
#cat > ./config.yml<<eof
#basic_auth_users:
# 当前设置的用户名为admin, 可以设置多个
# admin: $2y$12$BKFmICzKaeqjDCJOK/y9e./NcFso6XN10txEKwtzpguI3G.AvSwgS
- ./config/config.yml:/etc/prometheus/config.yml
- ./data:/prometheus
- ./rules:/rules
- ./config/targets:/etc/prometheus/targets
dingtalk-webhook:
container_name: dingtalk-webhook
image: zhangyudd/webhook-dingtalk:v6
restart: on-failure:5
hostname: prometheus
ports:
- 9091:8080
command:
- "https://oapi.dingtalk.com/robot/send?access_token=18ab9ca9251d577c770b771c79f6166a95f069d4299f1c03743f878974025c83"
environment:
TZ: Asia/Shanghai
alert:
container_name: alert
image: prom/alertmanager:v0.26.0
restart: on-failure:5
hostname: prometheus
ports:
- 9092:9093
- 9093:9094
environment:
TZ: Asia/Shanghai
command: --config.file=/etc/alertmanager/alertmanager.yml --log.level=debug
volumes:
- /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime
- ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml
cadvisor:
image: zhangyudd/google_containers:cadvisor_v0.47.1
container_name: monitoring_cadvisor
restart: unless-stopped
privileged: true
environment:
- TZ=Asia/Shanghai
command: --http_auth_file /cadvisor.htpasswd --http_auth_realm localhost
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- ./config/cadvisor.htpasswd:/cadvisor.htpasswd
ports:
- 9094:8080
networks:
default:
external:
name: my_net
yum install httpd-tools -y
htpasswd -nBC 12 '' | tr -d ':\n'
$2y$12$wK39XcvCNgU89Cr/z3SYxupyJUIzKAHP5VAzTcnrOWU04t83PkDG6
vi web-config.yml
basic_auth_users:
admin: $2y$12$wK39XcvCNgU89Cr/z3SYxupyJUIzKAHP5VAzTcnrOWU04t83PkDG6
添加启动参数
- "--web.config.file=/etc/prometheus/config.yml"
web-config.yaml(开启认证)
basic_auth_users:
admin: $2y$12$wK39XcvCNgU89Cr/z3SYxupyJUIzKAHP5VAzTcnrOWU04t83PkDG6
node_export_start.sh
#!/bin/bash
basePath="/usr/local"
softVersion="node_exporter"
nohup $basePath/$softVersion/node_exporter --web.config.file="$basePath/$softVersion/web-config.yaml" > $basePath/$softVersion/node_exporter.log 2>&1 &
pushgateway 是一种采用被动推送的方式,单独运行在任意节点。
优缺点
使用它的原因主要是:
Prometheus 采用 pull 模式,可能由于不在一个子网或者防火墙原因,导致 Prometheus 无法直接拉取各个 target 数据。
在监控业务数据的时候,需要将不同数据汇总, 由 Prometheus 统一收集。
缺点有:
- 将多个节点数据汇总到 pushgateway, 如果 pushgateway 挂了,受影响比多个 target 大。
- Prometheus 拉取状态 up 只针对 pushgateway, 无法做到对每个节点有效。
- Pushgateway 可以持久化推送给它的所有监控数据。因此,即使你的监控已经下线,prometheus 还会拉取到旧的监控数据,需要手动清理 pushgateway 不要的数据。
- curl -X DELETE http://127.0.0.1:9091/metrics/job/Ping_check/instance/y (del单个)
- curl -X DELETE http://172.16.11.198:9091/metrics/job/Ping_check(del一组)
- curl -X PUT http://172.16.11.198:9091/api/v1/admin/wipe(del all)
prometheus.io 下载
./pushgateway --web.enable-admin-api 启动
Prometheus配置
- job_name: 'pushgateway'
static_configs:
- targets:
- 127.0.0.1:9091
labels:
instance: pushgateway
#!/bin/bash
instance_name=`hostname -f | cut -d '.' -f1` # 本机机器名 变量 用于之后得标签
if [ $instance_name == "localhost" ];then # 要求机器名不能是localhost 不然标签无法区分
echo "Must FQDN hostname"
exit 1
fi
label="count_netstat_wait_connections" # 定义一个新的key
count_netstat_wait_connections=`netstat -an | grep -i wait | wc -l` # 定义一个value
# 推送多个数据
echo "$label $count_netstat_wait_connections"
echo "$label $count_netstat_wait_connections" | curl --data-binary @- http://172.16.11.198:9091/metrics/job/pushgateway1/instance/$instance_name
# 自定义label及推送多个指标
cat <<EOF | curl --data-binary @- http://172.16.11.198:9091/metrics/job/pushgateway1/instance/$instance_name
# TYPE test_metrics counter
test_metrics{label="app1",name="demo"} 100.00
# TYPE another_test_metrics gauge
# HELP another_test_metrics Just an example.
another_test_metrics 123.45
EOF
# /metrics/job/<JOB_NAME>{/<LABEL_NAME>/<LABEL_VALUE>} # 推送数据定义
# 延迟
timeout 5 ping -q -A -s 500 -W 1000 -c 10 127.0.0.1 | grep rtt | awk -F '/' '{print $5}'
-s 包大小 默认64
-W 延迟timeout
-c 发送多少个数据包
# 丢包
timeout 5 ping -q -A -s 500 -W 1000 -c 10 127.0.0.1 | grep transmitted | awk '{print $6}'
#!/bin/bash
# 捕获异常信号,执行优雅操作
trap 'rm -f tmp_push_prometheus.txt && curl -X PUT http://$pushgateway_addr/api/v1/admin/wipe && exit' ERR EXIT SIGINT
# 相关指标
ping_lost_package="ping_lost_package" # 定义一个丢包率label
ping_rtt_avg="ping_rtt_avg" # 定义一个延迟lable
pushgateway_addr="172.16.11.198:9091"
# 监控下线,删除指标
del_value() {
curl -X DELETE http://$pushgateway_addr/metrics/job/Ping_check/instance/$1
}
# 推送指标到pushgateway
push_prometheus(){
curl -X POST --data-binary @tmp_push_prometheus.txt http://$pushgateway_addr/metrics/job/Ping_check/instance/$1
}
# 获取指标并上传到pushgateway
func() {
x=0 # 计数器(用于对应两个数组的值)
for ip in ${ips[*]}; do
for label in ${ipLabels[*]}; do
echo $ip ${ipLabels[$x]}
ping_lost_package=$(timeout 5 ping -q -A -s 500 -W 1000 -c 10 $ip | grep transmitted | awk '{print $6}' | awk -F '%' '{print $1}')
ping_rtt_avg=$(timeout 5 ping -q -A -s 500 -W 1000 -c 10 $ip | grep rtt | awk -F '/' '{print $5}')
# 判断value是否有值,为空,清理pushgateway数据
if [ ! -n $ping_lost_package ]; then
del_value $1
break
fi
if [ ! -n $ping_rtt_avg ]; then
del_value $1
break
fi
echo "ping_lost_package{label=\"${ipLabels[$x]}\",env=\"pro\"} $ping_lost_package" >>tmp_push_prometheus.txt
echo "ping_rtt_avg{label=\"${ipLabels[$x]}\",env=\"pro\"} $ping_rtt_avg" >>tmp_push_prometheus.txt
break
done
x=$(expr $x + 1)
done
# 推送指标到pushgateway
if [ -f tmp_push_prometheus.txt ]; then
push_prometheus $1
# 删除临时文件
rm -f tmp_push_prometheus.txt
else
echo "临时文件未生成"
fi
# 删除历史数据,防止旧数据造成异常
del_value $1
}
while true;do
ips=(172.16.11.198 127.0.0.1 baidu.com)
ipLabels=(test localhost baidu)
func x
ips=(102.932.115.447)
ipLabels=(law)
func y
sleep 15
done
docker run --name smokeping -d --rm -p 8888:80 -e PUID=1000 -e PGID=1000 -v /data/smokeping/data:/data/ -v /data/smokeping/config:/config -e TZ=Asia/Shanghai linuxserver/smokeping
cat config/Database
step = 60
pings = 60
cat config/Targets
*** Targets ***
probe = FPing
menu = Top
title = Network Latency Grapher
remark = Welcome to the SmokePing website of WORKS Company. \
Here you will learn all about the latency of our network.
+ targets
menu = Targets
++ baiduURL
menu = baidu URL
title = baidu URL server
host = www.baidu.com
++ test
menu = test
title = test
host = 172.16.11.198
重启smokeping容器,并查看日志,确保容器运行正常
命令行获取,不太懂
rrdtool fetch ./data/targets/test.rrd AVERAGE
python3文件获取(未编写推送指标部分)
collection_to_prometheus.py
#coding:utf-8
import rrdtool
import os
paras = {
'prometheus_gateway' : 'http://192.168.56.101:9091' ,
'data_dir' : '/etc/smokeping/data/' # 指定rrdtool数据目录,其下为子目录/etc/smokeping/data/targets/*.rrd
}
# 通过rrdtool获取指标
def getMonitorData(rrd_file):
rrd_info = rrdtool.info(rrd_file)
last_update = rrd_info['last_update'] - 60
args = '-s ' + str(last_update)
results = rrdtool.fetch(rrd_file , 'AVERAGE' , args )
lost_package_num = results[2][0][1]
average_rrt = 0 if not results[2][0][2] else results[2][0][2] * 1000
return lost_package_num , round(average_rrt , 4)
if __name__ == '__main__':
ISP_list = ['targets']
for ISP in ISP_list:
rrd_data_dir = os.path.join(paras['data_dir'] , ISP)
for filename in os.listdir(rrd_data_dir):
(instance , postfix) = os.path.splitext(filename)
if postfix == '.rrd' :
(lost_package_num , rrt) = getMonitorData(os.path.join(paras['data_dir'] , ISP , filename))
print(rrd_data_dir,instance,rrt,lost_package_num)
参考: https://www.jianshu.com/p/fc5624a30580
k8s集群需安装metrics-server state-metrics服务
cat prom.rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- "extensions"
resources:
- ingresses
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: kube-system
kubectl get sa prometheus -n kube-system -o yaml
kubectl describe secret prometheus-token-wj7fb -n kube-system
将token保存到文件中 k8s.token
- job_name: k8s-cadvisor
honor_timestamps: true
metrics_path: /metrics
scheme: https
kubernetes_sd_configs: # kubernetes 自动发现
- api_server: https://172.16.11.198:6443 # apiserver 地址
role: node # node 类型的自动发现
bearer_token_file: k8s.token
tls_config:
insecure_skip_verify: true
bearer_token_file: k8s.token
tls_config:
insecure_skip_verify: true
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- separator: ;
regex: (.*)
target_label: __address__
replacement: 172.16.11.198:6443
action: replace
- source_labels: [__meta_kubernetes_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
action: replace
需在被收集pod上面做配置,达到自动发现目的
- job_name: kubernetes-pods
honor_timestamps: true
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: https://172.16.11.198:6443
role: pod
bearer_token_file: k8s.token
tls_config:
insecure_skip_verify: true
bearer_token_file: k8s.token
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
separator: ;
regex: "true"
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
separator: ;
regex: ([^:]+)(?::\d+)?;(\d+)
target_label: __address__
replacement: $1:$2
action: replace
- separator: ;
regex: __meta_kubernetes_pod_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: pod_name
replacement: $1
action: replace
被收集pods yaml配置
对应pod的自动收集需要在deploy的部署文件中增加metadata
在deployment中的 template->metadata->annotations 增加
prometheus.io/scrape: "true"
prometheus.io/port: "8081" # pod端口
prometheus.io/path: "/actuator/prometheus"
kubernetes会查找prometheus.io/scrape=true注释的pod 。如果可以使用此注释,则Prometheus会自动获取指标。
- job_name: kubelet
honor_timestamps: true
metrics_path: /metrics
scheme: https
kubernetes_sd_configs: # kubernetes 自动发现
- api_server: https://172.16.11.198:6443 # apiserver 地址
role: node # node 类型的自动发现
bearer_token_file: k8s.token
tls_config:
insecure_skip_verify: true
bearer_token_file: k8s.token
tls_config:
insecure_skip_verify: true
relabel_configs:
- action: replace
target_label: __address__
replacement: 172.16.11.198:6443
- action: replace
source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
参考:https://www.cnblogs.com/zyyang1993/p/16621158.html
- job_name: 'prometheus'
file_sd_configs:
- files:
- targets/prometheus*.yaml
refresh_interval: 2m # 每隔2分钟自动重新加载一次文件中定义的targets,默认5分钟
[root@localhost]# cat prometheus_target.yaml
- targets:
- 172.1.1.1:9100
- 172.1.1.2:9100
labels:
app: node-exporter
job: node
基于Dns的服务发现针对一组dns域名进行定期查询
Prometheus 通过 consul 实现自动服务发现 传送门
下载地址: https://www.consul.io/downloads 安装:unzip consul_1.9.2_linux_amd64.zip -d /usr/local/bin 启动开发者模式: mkdir -pv /consul/data consul agent -dev -ui -data-dir=/consul/data -config-dir=/etc/consul/ -client=0.0.0.0 正常应该使用server模式
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。