Alerting¶
Alert Rules¶
17 rules across 6 categories, defined in both Prometheus (prometheus/alerts/) and Grafana Unified Alerting.
Node (Infrastructure)¶
| Alert | Condition | Severity | Rationale |
|---|---|---|---|
HighCpuUsage |
> 80% for 5 min | warning | Sustained high CPU impacts all containers. 80% leaves headroom before saturation. |
HighMemoryUsage |
> 85% for 5 min | warning | OOM kills start above 90%. 85% gives time to react. |
DiskSpaceLow |
< 15% free for 5 min | critical | Below 15% risks database corruption (WAL writes fail). |
Containers¶
| Alert | Condition | Severity | Rationale |
|---|---|---|---|
ContainerRestarted |
New restart detected | warning | Catches crash loops and OOM kills early. |
ContainerDown |
Aletheia web/celery absent > 2 min | critical | Core services — immediate attention required. |
ContainerHighMemory |
> 90% of memory limit for 5 min | warning | Approaching OOM kill territory. |
PostgreSQL¶
| Alert | Condition | Severity | Rationale |
|---|---|---|---|
PostgresConnectionsHigh |
> 80% of max_connections for 5 min |
warning | Connection exhaustion causes new requests to fail. |
PostgresDown |
Exporter reports down | critical | Database outage — all apps affected. |
Application¶
| Alert | Condition | Severity | Rationale |
|---|---|---|---|
HealthCheckFailing |
Non-2xx for 2 min | critical | App is unreachable from outside. |
NginxHigh5xxRate |
> 5% of requests are 5xx for 5 min | warning | Signals backend errors or overload. |
CeleryQueueBacklog |
> 50 pending messages for 10 min | warning | Tasks piling up — worker may be stuck or down. |
CeleryWorkerDown |
Worker offline > 2 min | critical | Background tasks stop processing entirely. |
CeleryHighFailRate |
> 10% failure rate over 5 min | warning | Systematic task failures (bad data, external service down). |
SSL¶
| Alert | Condition | Severity | Rationale |
|---|---|---|---|
SslCertExpiringSoon |
Expires in < 14 days | warning | Certbot should auto-renew at 30 days. If it hasn't by 14 days, renewal is broken. |
Helios¶
| Alert | Condition | Severity | Rationale |
|---|---|---|---|
HeliosContainerDown |
No web container > 2 min | critical | Practice websites are down. |
HeliosContainerRestarting |
> 2 restarts in 15 min | warning | Crash loop — investigate logs. |
HeliosHealthCheckFailing |
Practice site probes failing > 5 min | critical | Specific practice site unreachable. |
HeliosSSLExpiringSoon |
Practice domain SSL < 14 days to expiry | warning | Practice site will show browser warning. |
Notification Routing¶
Alerts route through two channels:
| Channel | Type | Details |
|---|---|---|
| Email (primary) | SMTP via Brevo | alerts@groupe-suffren.com |
| Teams (secondary) | Webhook via Power Automate | Posts to configured Teams channel |
Timing:
- Group wait: 30 seconds (batch initial alerts)
- Group interval: 5 minutes (batch follow-up updates)
- Repeat interval: 4 hours (re-send if still unresolved)
Blackbox Probes¶
External monitoring via HTTP health checks and SSL certificate probes:
Targets probed (30s interval for health, 15s for SSL):
- Aletheia: prod, staging, dev
- Helios practice sites: cabinet-dentaire-aubagne.fr, le-canet.chirurgiens-dentistes.fr, cabinet-bodin.fr, dr-david-simon.chirurgiens-dentistes.fr
- Helios subdomains: prod/staging/dev variants on groupe-suffren.com
- Monitoring: monitoring.groupe-suffren.com
Modifying Alert Rules¶
- Edit the relevant file in
monitoring/prometheus/alerts/ - Run
make deploy— this copies files and reloads Prometheus via/-/reload - For Grafana-side rules: edit
monitoring/grafana/provisioning/alerting/rules.yml - Grafana auto-reloads alerting rules on file change
Keep rules in sync
Alert rules are duplicated in Prometheus (for evaluation) and Grafana (for UI display and routing). When modifying thresholds, update both files.