Alertmanager中可以使用repeat_interval选项指定在一个告警重复发送前必须等待的时间间隔。可以通过添加一条额外的规则来实现基于时间的升级机制,该规则定义了在指定的时间后发送进一步的警报通知,直到警报被解决。以下是示例代码:
groups:
-
name: example
rules:
Base alert rule
- alert: ServiceFailed
expr: job_failed{job="example-service"} > 0
for: 5m
labels:
severity: page
Time-based escalation
- alert: ServiceFailedEscalation
expr: job_failed{job="example-service"} > 0
for: 10m
annotations:
message: "Service failed for more than 10 minutes!"
labels:
severity: page
Send a new notification every 2 minutes until resolved
repeat_interval: 2m
Escalate to a higher severity after 20 minutes
and then every 10 minutes thereafter
routes:
- match:
severity: page
repeat_interval: 10m
routes:
- match:
severity: critical
continue: true
在上面的示例中,“ServiceFailed”是基本告警规则。如果此规则持续不断地(每5分钟)触发,则将启动“ServiceFailedEscalation”规则。这条规则包含一个repeate_interval选项,该选项每2分钟发送一次通知,直到警报被解决。如果此问题在10分钟内未解决,则“ServiceFailedEscalation”将通过重复检查“job_failed”表达式来检查旧警报,并以10分钟为时间间隔将警报转发到“critical”紧急程度。