prometheus之记录规则(recording rules)与告警规则(alerting rule)

Post author:xfxia
Post published:2023年10月16日
Post category:其他

Prometheus支持两种类型的规则：记录规则和警报规则。要在Prometheus中包含规则，请创建一个包含必要规则语句的文件，并让Prometheus通过Prometheus配置中的
rule_files
字段加载规则文件。

通过将
SIGHUP
发送到Prometheus进程，可以在运行时重新加载规则文件。这些更改仅适用于所有规则文件格式良好的情况。

语法检查规则

要在不启动Prometheus进程的情况下快速检查规则文件是否在语法上正确，可以通过安装并运行Prometheus的promtool命令行工具来校验：

go get github.com/prometheus/prometheus/cmd/promtool

使用例子

[root@fabric-cli prometheus-2.2.1.linux-amd64]# ls -l
总用量 108104
drwxrwxr-x 2 1000 1000       38 3月  14 22:14 console_libraries
drwxrwxr-x 2 1000 1000      173 3月  14 22:14 consoles
drwxr-xr-x 5 root root       85 5月  12 00:05 data
-rw-rw-r-- 1 1000 1000    11357 3月  14 22:14 LICENSE
-rw-rw-r-- 1 1000 1000     2769 3月  14 22:14 NOTICE
-rwxr-xr-x 1 1000 1000 66176282 3月  14 22:17 prometheus
-rw-r--r-- 1 root root      167 5月   4 10:47 prometheus.rules.yml
-rw-rw-r-- 1 1000 1000      879 5月   4 10:49 prometheus.yml
-rwxr-xr-x 1 1000 1000 44492910 3月  14 22:18 promtool

[root@fabric-cli prometheus-2.2.1.linux-amd64]# ./promtool check rules prometheus.rules.yml 
Checking prometheus.rules.yml
  SUCCESS: 1 rules found

规则语法:

groups:
  [ - <rule_group> ]

<rule_group>的语法
# 规则组名 必须是唯一的
name: <string>

# 规则评估间隔时间
[ interval: <duration> | default = global.evaluation_interval ]

rules:
  [ - <rule> ... ]

<rule>的语法
# 收集的指标名称
record: <string>

# 评估时间
# evaluated at the current time, and the result recorded as a new set of
# time series with the metric name as given by 'record'.
expr: <string>

# Labels to add or overwrite before storing the result.
labels:
  [ <labelname>: <labelvalue> ]

例子

groups:
  - name: example
    rules:
    - record: job:http_inprogress_requests:sum
      expr: sum(http_inprogress_requests) by (job)

另告警规则语法如下

# The name of the alert. Must be a valid metric name.
alert: <string>

# The PromQL expression to evaluate. Every evaluation cycle this is
# evaluated at the current time, and all resultant time series become
# pending/firing alerts.
expr: <string>

# Alerts are considered firing once they have been returned for this long.
# Alerts which have not yet fired for long enough are considered pending.
[ for: <duration> | default = 0s ]

# Labels to add or overwrite for each alert.
labels:
  [ <labelname>: <tmpl_string> ]

# Annotations to add to each alert.
annotations:
  [ <labelname>: <tmpl_string> ]

告警规则例子

groups:
- name: example
  rules:
  - alert: HighErrorRate
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
    for: 10m
    labels:
      severity: page
    annotations:
      summary: High request latency

for: 10m
表示超过10min访问不到服务就报警

参考

recording_rules

alerting_rules

你可能也喜欢