Prometheus支持两种类型的规则:记录规则和警报规则。 要在Prometheus中包含规则,请创建一个包含必要规则语句的文件,并让Prometheus通过Prometheus配置中的
rule_files
字段加载规则文件。
通过将
SIGHUP
发送到Prometheus进程,可以在运行时重新加载规则文件。 这些更改仅适用于所有规则文件格式良好的情况。
语法检查规则
要在不启动Prometheus进程的情况下快速检查规则文件是否在语法上正确,可以通过安装并运行Prometheus的promtool命令行工具来校验:
go get github.com/prometheus/prometheus/cmd/promtool
使用例子
[root@fabric-cli prometheus-2.2.1.linux-amd64]# ls -l
总用量 108104
drwxrwxr-x 2 1000 1000 38 3月 14 22:14 console_libraries
drwxrwxr-x 2 1000 1000 173 3月 14 22:14 consoles
drwxr-xr-x 5 root root 85 5月 12 00:05 data
-rw-rw-r-- 1 1000 1000 11357 3月 14 22:14 LICENSE
-rw-rw-r-- 1 1000 1000 2769 3月 14 22:14 NOTICE
-rwxr-xr-x 1 1000 1000 66176282 3月 14 22:17 prometheus
-rw-r--r-- 1 root root 167 5月 4 10:47 prometheus.rules.yml
-rw-rw-r-- 1 1000 1000 879 5月 4 10:49 prometheus.yml
-rwxr-xr-x 1 1000 1000 44492910 3月 14 22:18 promtool
[root@fabric-cli prometheus-2.2.1.linux-amd64]# ./promtool check rules prometheus.rules.yml
Checking prometheus.rules.yml
SUCCESS: 1 rules found
规则语法:
groups:
[ - <rule_group> ]
<rule_group>的语法
# 规则组名 必须是唯一的
name: <string>
# 规则评估间隔时间
[ interval: <duration> | default = global.evaluation_interval ]
rules:
[ - <rule> ... ]
<rule>的语法
# 收集的指标名称
record: <string>
# 评估时间
# evaluated at the current time, and the result recorded as a new set of
# time series with the metric name as given by 'record'.
expr: <string>
# Labels to add or overwrite before storing the result.
labels:
[ <labelname>: <labelvalue> ]
例子
groups:
- name: example
rules:
- record: job:http_inprogress_requests:sum
expr: sum(http_inprogress_requests) by (job)
另告警规则语法如下
# The name of the alert. Must be a valid metric name.
alert: <string>
# The PromQL expression to evaluate. Every evaluation cycle this is
# evaluated at the current time, and all resultant time series become
# pending/firing alerts.
expr: <string>
# Alerts are considered firing once they have been returned for this long.
# Alerts which have not yet fired for long enough are considered pending.
[ for: <duration> | default = 0s ]
# Labels to add or overwrite for each alert.
labels:
[ <labelname>: <tmpl_string> ]
# Annotations to add to each alert.
annotations:
[ <labelname>: <tmpl_string> ]
告警规则例子
groups:
- name: example
rules:
- alert: HighErrorRate
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
for: 10m
labels:
severity: page
annotations:
summary: High request latency
for: 10m
表示超过10min访问不到服务就报警
参考
recording_rules
alerting_rules