Prometheus 配置和示例

发布时间: 更新时间: 总字数:2681 阅读时间:6m 作者: IP上海 分享 网址

Prometheus 配置文件 prometheus.yml 说明和示例

Global 配置

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

  # 日志文件
  query_log_file: "abc.log"
  # 告警中将会产生对应的tag
  external_labels:
    env: test

alerting:
  alertmanagers:
  basic_auth:
  bearer_token:
  bearer_token_file:


scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'prometheus'
    # basic_auth 认证
    basic_auth:
      username: ""
      password: ""
    basic_auth_file: ""
    bearer_token: "<secret>"
    # service discovery configurations
    dns_sd_configs:
      - ...
    openstack_sd_configs:
    kubernetes_sd_configs:
    file_sd_configs:
    # 静态资源发现
    static_configs:
    - targets: ['localhost:9090']
    # 自动添加 labels
    labels:
    # 修改 labels
    relabel_configs:
    # 每次抓取的数量
    sample_limit:

规则和告警

rules 配置

规则分类:

  • 告警规则(alert rule) 产生告警

    • {{ $value }} 获取触发规则的 value 值
  • 记录规则(recording rule) 生成新的时间序列(可基于此序列生成告警),避免实时查询性能慢问题

  • prometheus.yml 中配置 rule

$ cat /usr/local/prometheus/prometheus.yml
...
rule_files: # <file glob>
  - "rules/*.yml"
...
  • 模板
groups:
  # 组名,一个文件中唯一标识
  - name: <string>
    [ interval: <duration> | default = global.evaluation_interval ]
    rules:
      - alert: <string> # 报警名称
        expr: <string>  # 报警规则表达式
  [ for: <duration> | default = 0s ] # 当触发条件持续指定时间再发送告警,发送前告警为pending
  labels:         # 告警标签
      [ <labelname>: <tmpl_string> ] # 会覆盖已有的标签,用于后面alertmanager中筛选
  annotations:         # 告警注释
    [ <labelname>: <tmpl_string> ]   # 一般添加发送邮件的内容、标题等等。常用字段有:summary,description
  • 规则配置 cat rules/target.yml
groups:
  - name: targetdown
    rules:
      - alert: target is down
        expr: up == 0
        for: 30s
        labels:
          level: warning
        annotations:
          summary: "节点故障"
          description: "节点故障"

  - name: node_resource
    rules:
      - alert: node_cpu_used_percent_alert_info
        expr: 1 - avg(irate(node_cpu_seconds_total{mode="idle"}[2m])) by (host,instance,job) >= 0.8
        labels:
          level: info
          type: cpu
        annotations:
          summary: "cpu used more than 80%"
          value: "{{ $value }}"
        for: 3m
      - alert: node_cpu_used_percent_alert_warn
        expr: 1 - avg(irate(node_cpu_seconds_total{mode="idle"}[2m])) by (host,instance,job) >= 0.8
        labels:
          level: warn
          type: cpu
        annotations:
          summary: "cpu used more than 80%"
          value: "{{ $value }}"
        for: 20m
  - name: node_base_info
    rules:
      - alert: node_unreachable # 主机不可达
        expr: up{job="node"} == 0
        labels:
          level: error
          type: node
        annotations:
          summary: "host unreachable"
          value: "{{ $value }}"
        for: 10s

  - name: example
    rules:

    # Alert for any instance that is unreachable for >5 minutes.
    - alert: InstanceDown
      expr: up == 0
      for: 5m
      labels:
        severity: page
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

    # Alert for any instance that has a median request latency >1s.
    - alert: APIHighRequestLatency
      expr: api_http_request_latencies_second{quantile="0.5"} > 1
      for: 10m
      annotations:
        summary: "High request latency on {{ $labels.instance }}"
        description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

    # 预测磁盘空间在一小时内是否耗尽
    - alert: DiskSpaceRunningOut
      expr: predict_linear(node_filesystem_avail_bytes[1h], 3600) < 0
      for: 5m
      labels:
        severity: critical
        team: infra
        environment: prod
      annotations:
        summary: "Disk space running out on {{ $labels.instance }}"
        description: "The available disk space on {{ $labels.instance }} will run out within 1 hour. Current available space: {{ $value }} bytes."
        runbook_url: "https://docs.example.com/runbooks/disk-space"

    # 异常检测(Anomaly Detection)
    - alert: HighMemoryUsageAnomaly
      expr: (node_memory_MemAvailable_bytes < avg_over_time(node_memory_MemAvailable_bytes[1h]) - 2 * stddev_over_time(node_memory_MemAvailable_bytes[1h]))
      for: 5m
      labels:
        severity: warning
        team: app
        environment: staging
      annotations:
        summary: "Memory usage anomaly on {{ $labels.instance }}"
        description: "The available memory on {{ $labels.instance }} has dropped significantly below the historical average."
        runbook_url: "https://docs.example.com/runbooks/memory-usage"

      # 动态调整 CPU 使用率告警阈值
      - alert: AdaptiveHighCPUUsage
        expr: (avg(rate(node_cpu_seconds_total[5m])) by (instance)) > (1.5 * avg_over_time(avg(rate(node_cpu_seconds_total[5m]))[1h]))
        for: 5m
        labels:
          severity: warning
          team: app
          environment: prod
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "The CPU usage on {{ $labels.instance }} is significantly higher than the historical baseline."
          runbook_url: "https://docs.example.com/runbooks/high-cpu-usage"

      # Kafka 集群整体健康状态告警
      - alert: KafkaClusterUnderReplicatedPartitions
        expr: sum(kafka_cluster_partition_under_replicated) > 0
        for: 3m
        labels:
          severity: critical
          team: infra
          environment: prod
        annotations:
          summary: "Under-replicated partitions in Kafka cluster"
          description: "The Kafka cluster has under-replicated partitions. Affected partitions: {{ $value }}"
          runbook_url: "https://docs.example.com/runbooks/kafka-under-replication"

      # 通过标签 maintenance 动态控制告警
      - alert: ServiceUnavailable
        expr: up == 0 and on(instance) (labels{maintenance!="true"})
        for: 1m
        labels:
          severity: critical
          team: app
          environment: prod
        annotations:
          summary: "Service {{ $labels.instance }} unavailable"
          description: "The service {{ $labels.instance }} is down for 1 minute."
- alert: Prometheus offline
  expr: |
    absent(up{cluster="cluster1", job="prometheus"})
    or
    absent(up{cluster="cluster2", job="prometheus"})
    or
    absent(up{cluster="cluster3", job="prometheus"})
  for: 2m
  labels:
    severity: warning
  annotations:
    description: "prometheus节点{{$labels.cluster}}离线超过2分钟"

Internal labels

Prometheus 提供一些内部 labels,这些 labels 以 两个下划线(__) 开头,并在应用所有重标记步骤后被移除,这意味着除非明确配置,否则这些标签将不可用

Label name Description
__name__ The scraped metric’s name
__address__ host:port of the scrape target
__scheme__ URI scheme of the scrape target
__metrics_path__ Metrics endpoint of the scrape target
__param_<name> is the value of the first URL parameter passed to the target
__scrape_interval__ The target’s scrape interval (experimental)
__scrape_timeout__ The target’s timeout (experimental)
__meta_ Special labels set set by the Service Discovery mechanism
__tmp Special prefix used to temporarily store label values before discarding them

relabel_configs

  • 参考
  • source_labels:源标签,没有经过 relabel 处理之前的标签名字
  • target_labels:通过 relabel 处理之后的标签名字
  • separator:源标签的值的连接分隔符。默认是;
  • module:取源标签值散列的模数
  • regex:正则表达式,匹配源标签的值。默认是(.\*)
  • replacement:通过分组替换后标签(target_label)对应的值。默认是$1
  • action:根据正则表达式匹配执行的动作。默认是 replace
    • replace:替换标签值,根据 regex 正则匹配到原标签值,使用 replacement 来引用表达式匹配的分组
    • keep:满足 regex 正则条件的实例进行采集,把 source_labels 中没有匹配到 regex 正则内容的 target 实例丢掉,即只采集匹配成功的实例
    • drop:满足 regex 正则条件的实例不采集,把 source_labels 中没有匹配到 regex 正则内容的 target 实例丢掉,即只采集没有匹配成功的实例
    • hashmod: 使用 hashmod 计算 source_labels 的 hash 值并进行对比,基于自定义的模数取模,以实现对目标进行分类、重新赋值等功能
    • labelmap: 匹配 regex 所有标签名称,然后复制匹配标签的值进行分组,通过 replacement 分组引用($1,$2,…)替代
    • labeldrop: 匹配 regex 所有标签名称,对匹配到的实例标签进行删除
    • labelkeep: 匹配 regex 所有标签名称,对匹配到的实例标签进行保留

使用示例

  • 添加 label
  - job_name: "nodes"
    static_configs:
      - targets:
        - 192.168.88.201:9100
        labels:
          __hostname__: node01
          __region_id__: "shanghai"
          __zone__: a
      - targets:
        - 192.168.88.202:9100
        labels:
          __hostname__: node02
          __region_id__: "beijing"
          __zone__: b
    relabel_configs:
    - source_labels:  # 添加 node_name
      - "__hostname__"
      regex: "(.*)"
      target_label: "node_name"
      action: replace
      replacement: $1

    - source_labels: # 只保留 node01 的值
      - "__hostname__"
      regex: "node01"
      action: keep  # drop 是删除

    - regex: "__(.*)__"  # 去掉所有label 的 __
      action: labelmap

    - source_labels:  # labelkeep 只保留选定的 label,labeldrop 删除
      - "__hostname__"
      regex: (.*)
      target_label: hostname
      action: replace
      replacement: $1
    - source_labels:
      - "__region_id__"
      regex: (.*)
      target_label: region_id
      action: replace
      replacement: $1
    - source_labels:
      - "__zone__"
      regex: (.*)
      target_label: zone
      action: replace
      replacement: $1
    - action: labelkeep  # labeldrop
      regex: "__.*__|job"
  • 添加 randomlabel: ip-192-168-64-30.multipass:9100-randomtext
scrape_configs:
  - job_name: 'multipass-nodes'
    static_configs:
    - targets: ['ip-192-168-64-29.multipass:9100']
      labels:
        test: 3
    - targets: ['ip-192-168-64-30.multipass:9100']
      labels:
        test: 3
    relabel_configs:
    - source_labels: [__address__]
      regex: '(.+)'
      replacement: '${1}-randomtext'
      target_label: randomlabel
  • 添加 instance: ip-192-168-64-30.multipass:9100
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 15s
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'multipass-nodes'
    static_configs:
    - targets: ['ip-192-168-64-29.multipass:9100']
      labels:
        test: 4
    - targets: ['ip-192-168-64-30.multipass:9100']
      labels:
        test: 4
    relabel_configs:
    - source_labels: [__address__]
      separator: ':'
      regex: '(.*):(.*)'
      replacement: '${1}'
      target_label: instance

promtool 检查配置

  • promtool 时 prometheus 附带的一个 check config 命令,用来检测相关配置是否正常,后来还支持查询指标、debug 服务、对数据库进行检查等
promtool --help
promtool --version

# 检测 prometheus.yml 配置
promtool check config [<flags>] <config-files>...
promtool check config prometheus.yml

# 检查告警规则和记录规则
promtool check rules [<flags>] <rule-files>...
promtool check rules rules/target.yml

# 检测 tsdb
promtool tsdb list /var/lib/prometheus/metrics2/

# 检查服务发现
promtool check service-discovery [<flags>] <config-file> <job>

# 检查 web-config
promtool check web-config <web-config-files>...

# 检查指标
promtool check metrics
$ cat metrics.prom | promtool check metrics
$ curl -s http://localhost:9090/metrics | promtool check metrics

alerts 级别说明

  • Inactive 告警恢复后,状态变为 inactive
  • Pending 在 for 指定的时间范围内,还没有发出告警
  • Firing 发生中的报警

alertmanager 配置

alertmanager 安装

  • prometheus.yml 中配置 alerting
$ cat /usr/local/prometheus/prometheus.yml
...
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 100.80.0.128:9093
...

重载配置

systemctl reload prometheus.service
# 或
curl -XPOST 127.0.0.1:9100/-/reload

可以在 http://100.80.0.128:9090/status 中查看 Alertmanagers 的 Endpoint

模拟服务故障

systemctl stop node-exporter.service

可以在 Alertmanagers 中查看告警和静默 http://100.80.0.128:9093/#/alerts

生成新的记录

  • rules/node.yml
groups:
  - name: node
    rules:
      # CPU 个数
      - record: node_cpu_total
        expr: count(node_cpu_seconds_total{mode="system"}) by (instance)
      # CPU 使用率
      - record: node_cpu_avg
        expr: avg(1-irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance,job)
      # Memory 使用率
      - record: node_memory_precent
        expr: (1- node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes)

  - name: node_alert
    rules:
      - alert: "CPU使用率大于0.1%"
        expr: node_cpu_avg > 0.001
        for: 3m
        labels:
          level: warning
        annotations:
          summary: "节点{{ $labels.instance }}CPU使用率{{ $value }}大于0.1%"
          description: "CPU使用率大于0.1%"

选中 Show annotations 可以查看到具体信息

服务发现

  • Pormetheus 服务发现支持静态文件、Consul、DNS、Kubernetes、部分公有用等
  • 动态服务发现主要解决大型系统中资源变化频繁问题

基于文件的服务发现

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
    # labels:
    #   app: prometheus
    #   job: xxx

file_sd_configs

自动从 yaml/json 文件加载数据,这些文件也可以开发程序从 CMDB 自动拉取数据生成:

scrape_configs:
  - job_name: "node2"
    file_sd_configs:
      - files:
        - sd/file/node/*.yml
        # refresh_interval: 2m # 每隔2分钟重新加载,默认为5m
  • sd/file/node/name.yml
- targets:
  - 100.80.0.128:9100
  • 检测
promtool check config prometheus.yml

基于 DNS 的服务发现

针对 DNS 域名进行检查以发现服务,依赖于 A、AAAA、SRV 记录

Kubernetes 服务发现

原生支持对 Kubernetes 的各种资源自动发现,如 Node、Pod、Service、Endpoint、Ingress(在 prometheus 中以 role 定义)等,也支持以 daemonSet 方式部署 node-explorter 采集节点信息。

参考

  1. https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule
  2. https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
  3. https://grafana.com/blog/2022/03/21/how-relabeling-in-prometheus-works/#internal-labels
  4. https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/best-practices-for-configuring-alert-rules-in-prometheus#p-zyd-s4g-kem
  5. https://blog.csdn.net/qq_28513801/article/details/144402379
Home Archives Categories Tags Statistics
本文总阅读量 次 本站总访问量 次 本站总访客数