Prometheus 配置文件 prometheus.yml 说明和示例
Global 配置
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# 日志文件
query_log_file: "abc.log"
# 告警中将会产生对应的tag
external_labels:
env: test
alerting:
alertmanagers:
basic_auth:
bearer_token:
bearer_token_file:
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'prometheus'
# basic_auth 认证
basic_auth:
username: ""
password: ""
basic_auth_file: ""
bearer_token: "<secret>"
# service discovery configurations
dns_sd_configs:
- ...
openstack_sd_configs:
kubernetes_sd_configs:
file_sd_configs:
# 静态资源发现
static_configs:
- targets: ['localhost:9090']
# 自动添加 labels
labels:
# 修改 labels
relabel_configs:
# 每次抓取的数量
sample_limit:
规则和告警
rules 配置
规则分类:
$ cat /usr/local/prometheus/prometheus.yml
...
rule_files: # <file glob>
- "rules/*.yml"
...
groups:
# 组名,一个文件中唯一标识
- name: <string>
[ interval: <duration> | default = global.evaluation_interval ]
rules:
- alert: <string> # 报警名称
expr: <string> # 报警规则表达式
[ for: <duration> | default = 0s ] # 当触发条件持续指定时间再发送告警,发送前告警为pending
labels: # 告警标签
[ <labelname>: <tmpl_string> ] # 会覆盖已有的标签,用于后面alertmanager中筛选
annotations: # 告警注释
[ <labelname>: <tmpl_string> ] # 一般添加发送邮件的内容、标题等等。常用字段有:summary,description
- 规则配置 cat rules/target.yml
groups:
- name: targetdown
rules:
- alert: target is down
expr: up == 0
for: 30s
labels:
level: warning
annotations:
summary: "节点故障"
description: "节点故障"
- name: node_resource
rules:
- alert: node_cpu_used_percent_alert_info
expr: 1 - avg(irate(node_cpu_seconds_total{mode="idle"}[2m])) by (host,instance,job) >= 0.8
labels:
level: info
type: cpu
annotations:
summary: "cpu used more than 80%"
value: "{{ $value }}"
for: 3m
- alert: node_cpu_used_percent_alert_warn
expr: 1 - avg(irate(node_cpu_seconds_total{mode="idle"}[2m])) by (host,instance,job) >= 0.8
labels:
level: warn
type: cpu
annotations:
summary: "cpu used more than 80%"
value: "{{ $value }}"
for: 20m
- name: node_base_info
rules:
- alert: node_unreachable # 主机不可达
expr: up{job="node"} == 0
labels:
level: error
type: node
annotations:
summary: "host unreachable"
value: "{{ $value }}"
for: 10s
- name: example
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
# 预测磁盘空间在一小时内是否耗尽
- alert: DiskSpaceRunningOut
expr: predict_linear(node_filesystem_avail_bytes[1h], 3600) < 0
for: 5m
labels:
severity: critical
team: infra
environment: prod
annotations:
summary: "Disk space running out on {{ $labels.instance }}"
description: "The available disk space on {{ $labels.instance }} will run out within 1 hour. Current available space: {{ $value }} bytes."
runbook_url: "https://docs.example.com/runbooks/disk-space"
# 异常检测(Anomaly Detection)
- alert: HighMemoryUsageAnomaly
expr: (node_memory_MemAvailable_bytes < avg_over_time(node_memory_MemAvailable_bytes[1h]) - 2 * stddev_over_time(node_memory_MemAvailable_bytes[1h]))
for: 5m
labels:
severity: warning
team: app
environment: staging
annotations:
summary: "Memory usage anomaly on {{ $labels.instance }}"
description: "The available memory on {{ $labels.instance }} has dropped significantly below the historical average."
runbook_url: "https://docs.example.com/runbooks/memory-usage"
# 动态调整 CPU 使用率告警阈值
- alert: AdaptiveHighCPUUsage
expr: (avg(rate(node_cpu_seconds_total[5m])) by (instance)) > (1.5 * avg_over_time(avg(rate(node_cpu_seconds_total[5m]))[1h]))
for: 5m
labels:
severity: warning
team: app
environment: prod
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "The CPU usage on {{ $labels.instance }} is significantly higher than the historical baseline."
runbook_url: "https://docs.example.com/runbooks/high-cpu-usage"
# Kafka 集群整体健康状态告警
- alert: KafkaClusterUnderReplicatedPartitions
expr: sum(kafka_cluster_partition_under_replicated) > 0
for: 3m
labels:
severity: critical
team: infra
environment: prod
annotations:
summary: "Under-replicated partitions in Kafka cluster"
description: "The Kafka cluster has under-replicated partitions. Affected partitions: {{ $value }}"
runbook_url: "https://docs.example.com/runbooks/kafka-under-replication"
# 通过标签 maintenance 动态控制告警
- alert: ServiceUnavailable
expr: up == 0 and on(instance) (labels{maintenance!="true"})
for: 1m
labels:
severity: critical
team: app
environment: prod
annotations:
summary: "Service {{ $labels.instance }} unavailable"
description: "The service {{ $labels.instance }} is down for 1 minute."
- alert: Prometheus offline
expr: |
absent(up{cluster="cluster1", job="prometheus"})
or
absent(up{cluster="cluster2", job="prometheus"})
or
absent(up{cluster="cluster3", job="prometheus"})
for: 2m
labels:
severity: warning
annotations:
description: "prometheus节点{{$labels.cluster}}离线超过2分钟"
Internal labels
Prometheus 提供一些内部 labels,这些 labels 以 两个下划线(__)
开头,并在应用所有重标记步骤后被移除,这意味着除非明确配置,否则这些标签将不可用
Label name |
Description |
__name__ |
The scraped metric’s name |
__address__ |
host:port of the scrape target |
__scheme__ |
URI scheme of the scrape target |
__metrics_path__ |
Metrics endpoint of the scrape target |
__param_<name> |
is the value of the first URL parameter passed to the target |
__scrape_interval__ |
The target’s scrape interval (experimental) |
__scrape_timeout__ |
The target’s timeout (experimental) |
__meta_ |
Special labels set set by the Service Discovery mechanism |
__tmp |
Special prefix used to temporarily store label values before discarding them |
relabel_configs
- 参考
source_labels
:源标签,没有经过 relabel 处理之前的标签名字
target_labels
:通过 relabel 处理之后的标签名字
separator
:源标签的值的连接分隔符。默认是;
module
:取源标签值散列的模数
regex
:正则表达式,匹配源标签的值。默认是(.\*)
replacement
:通过分组替换后标签(target_label)对应的值。默认是$1
action
:根据正则表达式匹配执行的动作。默认是 replace
replace
:替换标签值,根据 regex 正则匹配到原标签值,使用 replacement 来引用表达式匹配的分组
keep
:满足 regex 正则条件的实例进行采集,把 source_labels 中没有匹配到 regex 正则内容的 target 实例丢掉,即只采集匹配成功的实例
drop
:满足 regex 正则条件的实例不采集,把 source_labels 中没有匹配到 regex 正则内容的 target 实例丢掉,即只采集没有匹配成功的实例
hashmod
: 使用 hashmod 计算 source_labels 的 hash 值并进行对比,基于自定义的模数取模,以实现对目标进行分类、重新赋值等功能
labelmap
: 匹配 regex 所有标签名称,然后复制匹配标签的值进行分组,通过 replacement 分组引用($1,$2,…)替代
labeldrop
: 匹配 regex 所有标签名称,对匹配到的实例标签进行删除
labelkeep
: 匹配 regex 所有标签名称,对匹配到的实例标签进行保留
使用示例
- job_name: "nodes"
static_configs:
- targets:
- 192.168.88.201:9100
labels:
__hostname__: node01
__region_id__: "shanghai"
__zone__: a
- targets:
- 192.168.88.202:9100
labels:
__hostname__: node02
__region_id__: "beijing"
__zone__: b
relabel_configs:
- source_labels: # 添加 node_name
- "__hostname__"
regex: "(.*)"
target_label: "node_name"
action: replace
replacement: $1
- source_labels: # 只保留 node01 的值
- "__hostname__"
regex: "node01"
action: keep # drop 是删除
- regex: "__(.*)__" # 去掉所有label 的 __
action: labelmap
- source_labels: # labelkeep 只保留选定的 label,labeldrop 删除
- "__hostname__"
regex: (.*)
target_label: hostname
action: replace
replacement: $1
- source_labels:
- "__region_id__"
regex: (.*)
target_label: region_id
action: replace
replacement: $1
- source_labels:
- "__zone__"
regex: (.*)
target_label: zone
action: replace
replacement: $1
- action: labelkeep # labeldrop
regex: "__.*__|job"
- 添加
randomlabel: ip-192-168-64-30.multipass:9100-randomtext
scrape_configs:
- job_name: 'multipass-nodes'
static_configs:
- targets: ['ip-192-168-64-29.multipass:9100']
labels:
test: 3
- targets: ['ip-192-168-64-30.multipass:9100']
labels:
test: 3
relabel_configs:
- source_labels: [__address__]
regex: '(.+)'
replacement: '${1}-randomtext'
target_label: randomlabel
- 添加
instance: ip-192-168-64-30.multipass:9100
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 15s
static_configs:
- targets: ['localhost:9090']
- job_name: 'multipass-nodes'
static_configs:
- targets: ['ip-192-168-64-29.multipass:9100']
labels:
test: 4
- targets: ['ip-192-168-64-30.multipass:9100']
labels:
test: 4
relabel_configs:
- source_labels: [__address__]
separator: ':'
regex: '(.*):(.*)'
replacement: '${1}'
target_label: instance
- promtool 时 prometheus 附带的一个 check config 命令,用来检测相关配置是否正常,后来还支持查询指标、debug 服务、对数据库进行检查等
promtool --help
promtool --version
# 检测 prometheus.yml 配置
promtool check config [<flags>] <config-files>...
promtool check config prometheus.yml
# 检查告警规则和记录规则
promtool check rules [<flags>] <rule-files>...
promtool check rules rules/target.yml
# 检测 tsdb
promtool tsdb list /var/lib/prometheus/metrics2/
# 检查服务发现
promtool check service-discovery [<flags>] <config-file> <job>
# 检查 web-config
promtool check web-config <web-config-files>...
# 检查指标
promtool check metrics
$ cat metrics.prom | promtool check metrics
$ curl -s http://localhost:9090/metrics | promtool check metrics
alerts 级别说明
- Inactive 告警恢复后,状态变为 inactive
- Pending 在 for 指定的时间范围内,还没有发出告警
- Firing 发生中的报警
alertmanager 配置
alertmanager 安装
- prometheus.yml 中配置 alerting
$ cat /usr/local/prometheus/prometheus.yml
...
alerting:
alertmanagers:
- static_configs:
- targets:
- 100.80.0.128:9093
...
重载配置
systemctl reload prometheus.service
# 或
curl -XPOST 127.0.0.1:9100/-/reload
可以在 http://100.80.0.128:9090/status 中查看 Alertmanagers 的 Endpoint
模拟服务故障
systemctl stop node-exporter.service
可以在 Alertmanagers 中查看告警和静默 http://100.80.0.128:9093/#/alerts
生成新的记录
groups:
- name: node
rules:
# CPU 个数
- record: node_cpu_total
expr: count(node_cpu_seconds_total{mode="system"}) by (instance)
# CPU 使用率
- record: node_cpu_avg
expr: avg(1-irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance,job)
# Memory 使用率
- record: node_memory_precent
expr: (1- node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes)
- name: node_alert
rules:
- alert: "CPU使用率大于0.1%"
expr: node_cpu_avg > 0.001
for: 3m
labels:
level: warning
annotations:
summary: "节点{{ $labels.instance }}CPU使用率{{ $value }}大于0.1%"
description: "CPU使用率大于0.1%"
选中 Show annotations
可以查看到具体信息
服务发现
- Pormetheus 服务发现支持静态文件、Consul、DNS、Kubernetes、部分公有用等
- 动态服务发现主要解决大型系统中资源变化频繁问题
基于文件的服务发现
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# labels:
# app: prometheus
# job: xxx
file_sd_configs
自动从 yaml/json 文件加载数据,这些文件也可以开发程序从 CMDB 自动拉取数据生成:
scrape_configs:
- job_name: "node2"
file_sd_configs:
- files:
- sd/file/node/*.yml
# refresh_interval: 2m # 每隔2分钟重新加载,默认为5m
- targets:
- 100.80.0.128:9100
promtool check config prometheus.yml
基于 DNS 的服务发现
针对 DNS 域名进行检查以发现服务,依赖于 A、AAAA、SRV 记录
Kubernetes 服务发现
原生支持对 Kubernetes 的各种资源自动发现,如 Node、Pod、Service、Endpoint、Ingress(在 prometheus 中以 role 定义)等,也支持以 daemonSet 方式部署 node-explorter 采集节点信息。