Aller au contenu

Alerting

Alert Rules

17 rules across 6 categories, defined in both Prometheus (prometheus/alerts/) and Grafana Unified Alerting.

Node (Infrastructure)

Alert Condition Severity Rationale
HighCpuUsage > 80% for 5 min warning Sustained high CPU impacts all containers. 80% leaves headroom before saturation.
HighMemoryUsage > 85% for 5 min warning OOM kills start above 90%. 85% gives time to react.
DiskSpaceLow < 15% free for 5 min critical Below 15% risks database corruption (WAL writes fail).

Containers

Alert Condition Severity Rationale
ContainerRestarted New restart detected warning Catches crash loops and OOM kills early.
ContainerDown Aletheia web/celery absent > 2 min critical Core services — immediate attention required.
ContainerHighMemory > 90% of memory limit for 5 min warning Approaching OOM kill territory.
ContainerHighCpu > 150% (1.5 cores) for 5 min warning Catches single-container runaways (e.g. self-queueing Celery loops) that HighCpuUsage misses because they don't move the host-wide average.

PostgreSQL

Alert Condition Severity Rationale
PostgresConnectionsHigh > 80% of max_connections for 5 min warning Connection exhaustion causes new requests to fail.
PostgresDown Exporter reports down critical Database outage — all apps affected.

Application

Alert Condition Severity Rationale
HealthCheckFailing Any blackbox-health (Aletheia-only) probe non-200 for 2 min critical Aletheia backend is unreachable, or its API surface (/api/v1/websites/.../config/) is broken. Frontend roots are a separate job — not this alert.
NginxHigh5xxRate > 5% of requests are 5xx for 5 min warning Signals backend errors or overload.
CeleryQueueBacklog > 50 pending messages for 10 min warning Tasks piling up — worker may be stuck or down.
CeleryWorkerDown Worker offline > 2 min critical Background tasks stop processing entirely.
CeleryHighFailRate > 10% failure rate over 5 min warning Systematic task failures (bad data, external service down).

SSL

Alert Condition Severity Rationale
SslCertExpiringSoon Expires in < 14 days warning Certbot should auto-renew at 30 days. If it hasn't by 14 days, renewal is broken.

Helios

Alert Condition Severity Rationale
HeliosContainerDown No web container > 2 min critical Practice websites are down.
HeliosContainerRestarting > 2 restarts in 15 min warning Crash loop — investigate logs.
HeliosHealthCheckFailing Practice site root probe non-200 > 5 min critical Frontend reachability only (DNS/cert/process/nginx). Does not detect Aletheia outages — see the soft-404 caveat below.
HeliosSSLExpiringSoon Practice domain SSL < 14 days to expiry warning Practice site will show browser warning.

Notification Routing

Alerts route through two channels:

Channel Type Details
Email (primary) SMTP via Brevo alerts@groupe-suffren.com
Teams (secondary) Webhook via Power Automate Posts to configured Teams channel

Timing:

  • Group wait: 30 seconds (batch initial alerts)
  • Group interval: 5 minutes (batch follow-up updates)
  • Repeat interval: 4 hours (re-send if still unresolved)

Blackbox Probes

External monitoring via HTTP health checks and SSL certificate probes:

SSL certificate monitoring:

  • https://aletheia.groupe-suffren.com
  • https://aletheia-staging.groupe-suffren.com
  • https://aletheia-dev.groupe-suffren.com
  • https://monitoring.groupe-suffren.com
  • https://cabinet-dentaire-aubagne.fr
  • https://le-canet.chirurgiens-dentistes.fr
  • https://cabinet-bodin.fr
  • https://dr-david-simon.chirurgiens-dentistes.fr
  • https://cda.groupe-suffren.com
  • https://vsm.groupe-suffren.com
  • https://pds.groupe-suffren.com
  • https://ths.groupe-suffren.com
  • https://cda-staging.groupe-suffren.com
  • https://vsm-staging.groupe-suffren.com
  • https://cda-dev.groupe-suffren.com
  • https://vsm-dev.groupe-suffren.com

Aletheia backend health monitoring (HealthCheckFailing):

  • https://aletheia.groupe-suffren.com/health/
  • https://aletheia-staging.groupe-suffren.com/health/
  • https://aletheia-dev.groupe-suffren.com/health/
  • https://aletheia-staging.groupe-suffren.com/api/v1/websites/sites/cabinet-dentaire-aubagne.fr/config/

Helios frontend reachability (HeliosHealthCheckFailing — does not detect backend outages):

  • https://cabinet-dentaire-aubagne.fr
  • https://le-canet.chirurgiens-dentistes.fr

Helios masks backend outages — the authoritative Aletheia signal is the /health/ + API config probes

Helios (the Next.js frontend) returns HTTP 200 with a generic soft-404 error UI when the Aletheia API throws on its home/blog/team pages. So a 200 from a Helios practice-site root (cabinet-dentaire-aubagne.fr, le-canet.chirurgiens-dentistes.fr) is a false green during an Aletheia outage. Those root probes are therefore treated as frontend-reachability only (they catch a fully-down site — DNS/cert/process/nginx — e.g. a probe value of 0/timeout). They live in a dedicated blackbox-helios-frontend scrape job, separate from the Aletheia blackbox-health job, so they drive HeliosHealthCheckFailing alone and never the backend HealthCheckFailing.

The independent, authoritative backend signal is the blackbox-health job's Aletheia targets:

  • …/health/ — deep-checks DB + Redis + Celery (200/503).
  • …/api/v1/websites/sites/<domain>/config/ — exercises the apps/websites DRF view/serializer layer end-to-end. /health/ stays 200 if only that API layer breaks, so this probe closes the gap. The https_2xx module accepts only HTTP 200, so any non-200 (404/500/503) trips probe_successHealthCheckFailing (critical).

Coverage caveat (2026-06-03): the API config probe runs against staging only — staging is seeded with the 4 practice SiteConfigs. Prod has no SiteConfig seeded yet (practice sites unlaunched; helios-prod in maintenance), so the prod config probe is committed but commented out in prometheus.yml; enable it at practice-site launch. Staging runs the same image, so it still catches API-layer regressions before they reach prod.

Modifying Alert Rules

  1. Edit the relevant file in monitoring/prometheus/alerts/
  2. Run make deploy — this copies files and reloads Prometheus via /-/reload
  3. For Grafana-side rules: edit monitoring/grafana/provisioning/alerting/rules.yml
  4. Grafana auto-reloads alerting rules on file change

Keep rules in sync

Alert rules are duplicated in Prometheus (for evaluation) and Grafana (for UI display and routing). When modifying thresholds, update both files.