Alerting¶

Alert Rules¶

17 rules across 6 categories, defined in both Prometheus (prometheus/alerts/) and Grafana Unified Alerting.

Node (Infrastructure)¶

Alert	Condition	Severity	Rationale
`HighCpuUsage`	> 80% for 5 min	warning	Sustained high CPU impacts all containers. 80% leaves headroom before saturation.
`HighMemoryUsage`	> 85% for 5 min	warning	OOM kills start above 90%. 85% gives time to react.
`DiskSpaceLow`	< 15% free for 5 min	critical	Below 15% risks database corruption (WAL writes fail).

Containers¶

Alert	Condition	Severity	Rationale
`ContainerRestarted`	New restart detected	warning	Catches crash loops and OOM kills early.
`ContainerDown`	Aletheia web/celery absent > 2 min	critical	Core services — immediate attention required.
`ContainerHighMemory`	> 90% of memory limit for 5 min	warning	Approaching OOM kill territory.
`ContainerHighCpu`	> 150% (1.5 cores) for 5 min	warning	Catches single-container runaways (e.g. self-queueing Celery loops) that `HighCpuUsage` misses because they don't move the host-wide average.

PostgreSQL¶

Alert	Condition	Severity	Rationale
`PostgresConnectionsHigh`	> 80% of `max_connections` for 5 min	warning	Connection exhaustion causes new requests to fail.
`PostgresDown`	Exporter reports down	critical	Database outage — all apps affected.

Application¶

Alert	Condition	Severity	Rationale
`HealthCheckFailing`	Any `blackbox-health` (Aletheia-only) probe non-200 for 2 min	critical	Aletheia backend is unreachable, or its API surface (`/api/v1/websites/.../config/`) is broken. Frontend roots are a separate job — not this alert.
`NginxHigh5xxRate`	> 5% of requests are 5xx for 5 min	warning	Signals backend errors or overload.
`CeleryQueueBacklog`	> 50 pending messages for 10 min	warning	Tasks piling up — worker may be stuck or down.
`CeleryWorkerDown`	Worker offline > 2 min	critical	Background tasks stop processing entirely.
`CeleryHighFailRate`	> 10% failure rate over 5 min	warning	Systematic task failures (bad data, external service down).

SSL¶

Alert	Condition	Severity	Rationale
`SslCertExpiringSoon`	Expires in < 14 days	warning	Certbot should auto-renew at 30 days. If it hasn't by 14 days, renewal is broken.

Helios¶

Alert	Condition	Severity	Rationale
`HeliosContainerDown`	No web container > 2 min	critical	Practice websites are down.
`HeliosContainerRestarting`	> 2 restarts in 15 min	warning	Crash loop — investigate logs.
`HeliosHealthCheckFailing`	Practice site root probe non-200 > 5 min	critical	Frontend reachability only (DNS/cert/process/nginx). Does not detect Aletheia outages — see the soft-404 caveat below.
`HeliosSSLExpiringSoon`	Practice domain SSL < 14 days to expiry	warning	Practice site will show browser warning.

Notification Routing¶

Alerts route through two channels:

Channel	Type	Details
Email (primary)	SMTP via Brevo	`alerts@groupe-suffren.com`
Teams (secondary)	Webhook via Power Automate	Posts to configured Teams channel

Timing:

Group wait: 30 seconds (batch initial alerts)
Group interval: 5 minutes (batch follow-up updates)
Repeat interval: 4 hours (re-send if still unresolved)

Blackbox Probes¶

External monitoring via HTTP health checks and SSL certificate probes:

SSL certificate monitoring:

https://aletheia.groupe-suffren.com
https://aletheia-staging.groupe-suffren.com
https://aletheia-dev.groupe-suffren.com
https://monitoring.groupe-suffren.com
https://cabinet-dentaire-aubagne.fr
https://le-canet.chirurgiens-dentistes.fr
https://cabinet-bodin.fr
https://dr-david-simon.chirurgiens-dentistes.fr
https://cda.groupe-suffren.com
https://vsm.groupe-suffren.com
https://pds.groupe-suffren.com
https://ths.groupe-suffren.com
https://cda-staging.groupe-suffren.com
https://vsm-staging.groupe-suffren.com
https://cda-dev.groupe-suffren.com
https://vsm-dev.groupe-suffren.com

Aletheia backend health monitoring (HealthCheckFailing):

https://aletheia.groupe-suffren.com/health/
https://aletheia-staging.groupe-suffren.com/health/
https://aletheia-dev.groupe-suffren.com/health/
https://aletheia-staging.groupe-suffren.com/api/v1/websites/sites/cabinet-dentaire-aubagne.fr/config/

Helios frontend reachability (HeliosHealthCheckFailing — does not detect backend outages):

https://cabinet-dentaire-aubagne.fr
https://le-canet.chirurgiens-dentistes.fr

Helios masks backend outages — the authoritative Aletheia signal is the /health/ + API config probes

Helios (the Next.js frontend) returns HTTP 200 with a generic soft-404 error UI when the Aletheia API throws on its home/blog/team pages. So a 200 from a Helios practice-site root (cabinet-dentaire-aubagne.fr, le-canet.chirurgiens-dentistes.fr) is a false green during an Aletheia outage. Those root probes are therefore treated as frontend-reachability only (they catch a fully-down site — DNS/cert/process/nginx — e.g. a probe value of 0/timeout). They live in a dedicated blackbox-helios-frontend scrape job, separate from the Aletheia blackbox-health job, so they drive HeliosHealthCheckFailing alone and never the backend HealthCheckFailing.

The independent, authoritative backend signal is the blackbox-health job's Aletheia targets:

…/health/ — deep-checks DB + Redis + Celery (200/503).
…/api/v1/websites/sites/<domain>/config/ — exercises the apps/websites DRF view/serializer layer end-to-end. /health/ stays 200 if only that API layer breaks, so this probe closes the gap. The https_2xx module accepts only HTTP 200, so any non-200 (404/500/503) trips probe_success → HealthCheckFailing (critical).

Coverage caveat (2026-06-03): the API config probe runs against staging only — staging is seeded with the 4 practice SiteConfigs. Prod has no SiteConfig seeded yet (practice sites unlaunched; helios-prod in maintenance), so the prod config probe is committed but commented out in prometheus.yml; enable it at practice-site launch. Staging runs the same image, so it still catches API-layer regressions before they reach prod.

Modifying Alert Rules¶

Edit the relevant file in monitoring/prometheus/alerts/
Run make deploy — this copies files and reloads Prometheus via /-/reload
For Grafana-side rules: edit monitoring/grafana/provisioning/alerting/rules.yml
Grafana auto-reloads alerting rules on file change

Keep rules in sync

Alert rules are duplicated in Prometheus (for evaluation) and Grafana (for UI display and routing). When modifying thresholds, update both files.