Prometheus Alerting Rule Reference¶
Prometheus alerts are defined in a PrometheusRule
resource. This resource is part of the Prometheus Operator and is used to define alerts that should be sent to the Alertmanager.
Alertmanager is a component of the Prometheus project that handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct Slack channel.
Prometheus alerts are defined using the PromQL query language. The query language is used to specify when an alert should fire, and the PrometheusRule
resource is used to specify the alert and its properties.
Prometheus alerts are sent to the team's Slack channel configured in nais teams when the alert fires.
graph LR
alerts.yaml --> Prometheus
Prometheus --> Alertmanager
Alertmanager --> Slack
PrometheusRule
¶
.nais/alert.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
team: my-team
name: my-alerts
namespace: my-team
spec:
groups:
- name: my-alert-rules
rules:
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
consequence: The high CPU usage may impact the performance of the application.
action: Investigate the cause of high CPU usage and optimize the application if necessary.
summary: High CPU usage for more than 10 minutes.
groups[]
¶
A PrometheusRule
can contain multiple groups of rules. Each group can contain multiple alert rules.
groups[].name
¶
The name of the group. This is used to group alerts in the Alertmanager.
groups[].rules[]
¶
groups[].rules[].alert
¶
The name of the alert. This is used to identify the alert in the Alertmanager. Typically this is a short, descriptive name on the form CamelCase
.
groups[].rules[].expr
¶
The expression that defines when the alert should fire. This is a PromQL expression that should evaluate to true
when the alert should fire.
We suggest using the Explore page in Grafana to build and test your PromQL expressions before creating a PrometheusRule
.
groups[].rules[].for
¶
For how long time the expr
must evaluate to true
before firing the alert. This is used to prevent flapping alerts and alerting on temporary spikes in metrics.
When the expr
first evaluates to true
the alert will be in pending
state for the duration specified.
Example values: 30s
, 5m
, 1h
.
groups[].rules[].labels
¶
Labels to attach to the alert. These are used to group and filter alerts in the Alertmanager.
groups[].rules[].labels.severity
¶
This will affect what color the notification gets. Possible values are critical
(🔴), warning
(🟡) and notice
(🟢).
groups[].rules[].annotations
¶
groups[].rules[].annotations.consequence
(optional)¶
The consequence annotation is used to describe what happens in the world when this alert fires. This is useful for the one receiving the alert to understand the impact of the alert.
groups[].rules[].annotations.action
(optional)¶
The action annotation is used to describe what the best course of action is to resolve the issue. Good alerts should have a clear action that can be taken to resolve the issue.
groups[].rules[].annotations.summary
(optional)¶
The summary annotation is used to give a short description of the alert. This is useful for the one receiving the alert to understand what the alert is about.