Aller au contenu

Alerting

Alert Rules

17 rules across 6 categories, defined in both Prometheus (prometheus/alerts/) and Grafana Unified Alerting.

Node (Infrastructure)

Alert Condition Severity Rationale
HighCpuUsage > 80% for 5 min warning Sustained high CPU impacts all containers. 80% leaves headroom before saturation.
HighMemoryUsage > 85% for 5 min warning OOM kills start above 90%. 85% gives time to react.
DiskSpaceLow < 15% free for 5 min critical Below 15% risks database corruption (WAL writes fail).

Containers

Alert Condition Severity Rationale
ContainerRestarted New restart detected warning Catches crash loops and OOM kills early.
ContainerDown Aletheia web/celery absent > 2 min critical Core services — immediate attention required.
ContainerHighMemory > 90% of memory limit for 5 min warning Approaching OOM kill territory.

PostgreSQL

Alert Condition Severity Rationale
PostgresConnectionsHigh > 80% of max_connections for 5 min warning Connection exhaustion causes new requests to fail.
PostgresDown Exporter reports down critical Database outage — all apps affected.

Application

Alert Condition Severity Rationale
HealthCheckFailing Non-2xx for 2 min critical App is unreachable from outside.
NginxHigh5xxRate > 5% of requests are 5xx for 5 min warning Signals backend errors or overload.
CeleryQueueBacklog > 50 pending messages for 10 min warning Tasks piling up — worker may be stuck or down.
CeleryWorkerDown Worker offline > 2 min critical Background tasks stop processing entirely.
CeleryHighFailRate > 10% failure rate over 5 min warning Systematic task failures (bad data, external service down).

SSL

Alert Condition Severity Rationale
SslCertExpiringSoon Expires in < 14 days warning Certbot should auto-renew at 30 days. If it hasn't by 14 days, renewal is broken.

Helios

Alert Condition Severity Rationale
HeliosContainerDown No web container > 2 min critical Practice websites are down.
HeliosContainerRestarting > 2 restarts in 15 min warning Crash loop — investigate logs.
HeliosHealthCheckFailing Practice site probes failing > 5 min critical Specific practice site unreachable.
HeliosSSLExpiringSoon Practice domain SSL < 14 days to expiry warning Practice site will show browser warning.

Notification Routing

Alerts route through two channels:

Channel Type Details
Email (primary) SMTP via Brevo alerts@groupe-suffren.com
Teams (secondary) Webhook via Power Automate Posts to configured Teams channel

Timing:

  • Group wait: 30 seconds (batch initial alerts)
  • Group interval: 5 minutes (batch follow-up updates)
  • Repeat interval: 4 hours (re-send if still unresolved)

Blackbox Probes

External monitoring via HTTP health checks and SSL certificate probes:

Targets probed (30s interval for health, 15s for SSL):

  • Aletheia: prod, staging, dev
  • Helios practice sites: cabinet-dentaire-aubagne.fr, le-canet.chirurgiens-dentistes.fr, cabinet-bodin.fr, dr-david-simon.chirurgiens-dentistes.fr
  • Helios subdomains: prod/staging/dev variants on groupe-suffren.com
  • Monitoring: monitoring.groupe-suffren.com

Modifying Alert Rules

  1. Edit the relevant file in monitoring/prometheus/alerts/
  2. Run make deploy — this copies files and reloads Prometheus via /-/reload
  3. For Grafana-side rules: edit monitoring/grafana/provisioning/alerting/rules.yml
  4. Grafana auto-reloads alerting rules on file change

Keep rules in sync

Alert rules are duplicated in Prometheus (for evaluation) and Grafana (for UI display and routing). When modifying thresholds, update both files.