Alerting¶
Alert Rules¶
17 rules across 6 categories, defined in both Prometheus (prometheus/alerts/) and Grafana Unified Alerting.
Node (Infrastructure)¶
| Alert | Condition | Severity | Rationale |
|---|---|---|---|
HighCpuUsage |
> 80% for 5 min | warning | Sustained high CPU impacts all containers. 80% leaves headroom before saturation. |
HighMemoryUsage |
> 85% for 5 min | warning | OOM kills start above 90%. 85% gives time to react. |
DiskSpaceLow |
< 15% free for 5 min | critical | Below 15% risks database corruption (WAL writes fail). |
Containers¶
| Alert | Condition | Severity | Rationale |
|---|---|---|---|
ContainerRestarted |
New restart detected | warning | Catches crash loops and OOM kills early. |
ContainerDown |
Aletheia web/celery absent > 2 min | critical | Core services — immediate attention required. |
ContainerHighMemory |
> 90% of memory limit for 5 min | warning | Approaching OOM kill territory. |
ContainerHighCpu |
> 150% (1.5 cores) for 5 min | warning | Catches single-container runaways (e.g. self-queueing Celery loops) that HighCpuUsage misses because they don't move the host-wide average. |
PostgreSQL¶
| Alert | Condition | Severity | Rationale |
|---|---|---|---|
PostgresConnectionsHigh |
> 80% of max_connections for 5 min |
warning | Connection exhaustion causes new requests to fail. |
PostgresDown |
Exporter reports down | critical | Database outage — all apps affected. |
Application¶
| Alert | Condition | Severity | Rationale |
|---|---|---|---|
HealthCheckFailing |
Any blackbox-health (Aletheia-only) probe non-200 for 2 min |
critical | Aletheia backend is unreachable, or its API surface (/api/v1/websites/.../config/) is broken. Frontend roots are a separate job — not this alert. |
NginxHigh5xxRate |
> 5% of requests are 5xx for 5 min | warning | Signals backend errors or overload. |
CeleryQueueBacklog |
> 50 pending messages for 10 min | warning | Tasks piling up — worker may be stuck or down. |
CeleryWorkerDown |
Worker offline > 2 min | critical | Background tasks stop processing entirely. |
CeleryHighFailRate |
> 10% failure rate over 5 min | warning | Systematic task failures (bad data, external service down). |
SSL¶
| Alert | Condition | Severity | Rationale |
|---|---|---|---|
SslCertExpiringSoon |
Expires in < 14 days | warning | Certbot should auto-renew at 30 days. If it hasn't by 14 days, renewal is broken. |
Helios¶
| Alert | Condition | Severity | Rationale |
|---|---|---|---|
HeliosContainerDown |
No web container > 2 min | critical | Practice websites are down. |
HeliosContainerRestarting |
> 2 restarts in 15 min | warning | Crash loop — investigate logs. |
HeliosHealthCheckFailing |
Practice site root probe non-200 > 5 min | critical | Frontend reachability only (DNS/cert/process/nginx). Does not detect Aletheia outages — see the soft-404 caveat below. |
HeliosSSLExpiringSoon |
Practice domain SSL < 14 days to expiry | warning | Practice site will show browser warning. |
Notification Routing¶
Alerts route through two channels:
| Channel | Type | Details |
|---|---|---|
| Email (primary) | SMTP via Brevo | alerts@groupe-suffren.com |
| Teams (secondary) | Webhook via Power Automate | Posts to configured Teams channel |
Timing:
- Group wait: 30 seconds (batch initial alerts)
- Group interval: 5 minutes (batch follow-up updates)
- Repeat interval: 4 hours (re-send if still unresolved)
Blackbox Probes¶
External monitoring via HTTP health checks and SSL certificate probes:
SSL certificate monitoring:
https://aletheia.groupe-suffren.comhttps://aletheia-staging.groupe-suffren.comhttps://aletheia-dev.groupe-suffren.comhttps://monitoring.groupe-suffren.comhttps://cabinet-dentaire-aubagne.frhttps://le-canet.chirurgiens-dentistes.frhttps://cabinet-bodin.frhttps://dr-david-simon.chirurgiens-dentistes.frhttps://cda.groupe-suffren.comhttps://vsm.groupe-suffren.comhttps://pds.groupe-suffren.comhttps://ths.groupe-suffren.comhttps://cda-staging.groupe-suffren.comhttps://vsm-staging.groupe-suffren.comhttps://cda-dev.groupe-suffren.comhttps://vsm-dev.groupe-suffren.com
Aletheia backend health monitoring (HealthCheckFailing):
https://aletheia.groupe-suffren.com/health/https://aletheia-staging.groupe-suffren.com/health/https://aletheia-dev.groupe-suffren.com/health/https://aletheia-staging.groupe-suffren.com/api/v1/websites/sites/cabinet-dentaire-aubagne.fr/config/
Helios frontend reachability (HeliosHealthCheckFailing — does not detect backend outages):
https://cabinet-dentaire-aubagne.frhttps://le-canet.chirurgiens-dentistes.fr
Helios masks backend outages — the authoritative Aletheia signal is the /health/ + API config probes
Helios (the Next.js frontend) returns HTTP 200 with a generic soft-404 error UI when the Aletheia API throws on its home/blog/team pages. So a 200 from a Helios practice-site root (cabinet-dentaire-aubagne.fr, le-canet.chirurgiens-dentistes.fr) is a false green during an Aletheia outage. Those root probes are therefore treated as frontend-reachability only (they catch a fully-down site — DNS/cert/process/nginx — e.g. a probe value of 0/timeout). They live in a dedicated blackbox-helios-frontend scrape job, separate from the Aletheia blackbox-health job, so they drive HeliosHealthCheckFailing alone and never the backend HealthCheckFailing.
The independent, authoritative backend signal is the blackbox-health job's Aletheia targets:
…/health/— deep-checks DB + Redis + Celery (200/503).…/api/v1/websites/sites/<domain>/config/— exercises theapps/websitesDRF view/serializer layer end-to-end./health/stays 200 if only that API layer breaks, so this probe closes the gap. Thehttps_2xxmodule accepts only HTTP 200, so any non-200 (404/500/503) tripsprobe_success→HealthCheckFailing(critical).
Coverage caveat (2026-06-03): the API config probe runs against staging only — staging is seeded with the 4 practice SiteConfigs. Prod has no SiteConfig seeded yet (practice sites unlaunched; helios-prod in maintenance), so the prod config probe is committed but commented out in prometheus.yml; enable it at practice-site launch. Staging runs the same image, so it still catches API-layer regressions before they reach prod.
Modifying Alert Rules¶
- Edit the relevant file in
monitoring/prometheus/alerts/ - Run
make deploy— this copies files and reloads Prometheus via/-/reload - For Grafana-side rules: edit
monitoring/grafana/provisioning/alerting/rules.yml - Grafana auto-reloads alerting rules on file change
Keep rules in sync
Alert rules are duplicated in Prometheus (for evaluation) and Grafana (for UI display and routing). When modifying thresholds, update both files.