Runbooks¶
Common monitoring scenarios and how to handle them.
Container Down / Restarting¶
Alert: ContainerDown, ContainerRestarted, HeliosContainerDown, HeliosContainerRestarting
-
Check which container is affected:
-
Check container logs for the crash reason:
-
Check if OOM killed:
-
Restart if needed:
Systemd Unit Failed¶
Alert: SystemdUnitFailed
A systemd unit is in the failed state. This catches silent service failures
like the Apr 2026 iptables outage where the firewall service failed for 10 days
undetected.
-
Identify which unit failed (the
namelabel on the alert): -
Get the failure reason:
-
Check recent logs for context:
-
Common causes:
- ExecStart path no longer exists (like the iptables incident) — check the unit file
- Dependency unavailable — e.g. service needs docker but docker is down
- Config syntax error — unit file itself is malformed
- Resource limit hit — memory, tasks, or time limits
-
Once fixed, restart the unit to clear the failed state:
-
Verify no other units are failed:
sudo systemctl --failedshould show zero.
High CPU Usage¶
Alert: HighCpuUsage, ContainerHighCpu
- Check which process/container is consuming CPU:
- Check the Node Exporter Grafana dashboard for CPU breakdown (user, system, iowait)
- If iowait is high: disk I/O bottleneck — check for large database queries or backup running
- If user CPU is high: identify the container and check its logs for processing-heavy operations
- For Celery workers: check for CPU-intensive tasks in the Celery dashboard
ContainerHighCpuspecifically: a single container is pinned >1.5 cores while the host is otherwise fine. Almost always a self-feeding task loop or stuck process — check the alerting container's logs for the same task name / endpoint repeating. The 2026-05-13MediaFilesignal loop is the canonical example:process_media_variantsre-firing post_save indefinitely.
High Memory Usage¶
Alert: HighMemoryUsage, ContainerHighMemory
- Check per-container memory in the Docker Grafana dashboard
- Identify the offending container:
- For Celery workers: check for memory leaks in long-running tasks
- For PostgreSQL: check
work_memand active query count in the PostgreSQL dashboard
Disk Space Low¶
Alert: DiskSpaceLow
-
Check disk usage:
-
Common space consumers:
- Docker images:
docker system df→docker system prune - Loki logs: check
/opt/docker/monitoring/loki/volume - PostgreSQL WAL: check
/var/lib/postgresql/18/docker/pg_wal/ -
Backup archives: check
/opt/docker/backups/ -
Clean Docker resources:
PostgreSQL Issues¶
Alert: PostgresConnectionsHigh, PostgresDown
-
Check connection count:
-
Find long-running queries:
-
Kill a stuck query (last resort):
Celery Queue Backlog¶
Alert: CeleryQueueBacklog, CeleryWorkerDown, CeleryHighFailRate
- Check queue depth in the Celery Grafana dashboard
- Check worker status:
- Check for stuck tasks: look for tasks running longer than expected in the dashboard (p99 runtime panel)
- Restart workers if needed:
SSL Certificate Expiring¶
Alert: SslCertExpiringSoon, HeliosSSLExpiringSoon
Certificates should auto-renew via certbot. If they're not:
-
Check certbot logs:
-
Force renewal:
-
Reload nginx after renewal:
Health Check Failing¶
Alert: HealthCheckFailing, HeliosHealthCheckFailing
First decide which signal fired — they mean different things:
HealthCheckFailing(anyblackbox-healthtarget non-200) is the authoritative backend signal. That job is Aletheia-only:/health/(DB + Redis + Celery) and the API config probe…/api/v1/websites/sites/<domain>/config/(theapps/websitesDRF layer). A 200 from/health/with a failing config probe means DB/Redis/Celery are fine but the websites view/serializer layer is broken. The Helios frontend roots are not in this job, so this alert never fires on a frontend-only outage.-
HeliosHealthCheckFailingis frontend reachability only (its ownblackbox-helios-frontendjob). Helios returns HTTP 200 with a soft-404 error UI when Aletheia is down, so this alert does not detect an Aletheia outage — it only fires when the practice site is fully unreachable (DNS/cert/process/nginx). For backend health, look atHealthCheckFailing, not this. -
Check from the server itself:
curl -sI https://aletheia.groupe-suffren.com/health/ # backend deep health (200/503) curl -s -o /dev/null -w '%{http_code}\n' \ https://aletheia-staging.groupe-suffren.com/api/v1/websites/sites/cabinet-dentaire-aubagne.fr/config/ # API/DRF layer (expect 200) curl -sI https://cabinet-dentaire-aubagne.fr/ # Helios frontend (200 even if backend is down!) -
If
/health/is 200 but the config probe is non-200: the DRF layer is broken (view/serializer/migration) — check Aletheia web logs, not the DB. The live config probe targets staging (aletheia-staging…), so a404there is a real regression — the probedSiteConfig.domain(Aubagne) was unseeded/deleted on staging; reseed it (restore_website_seed). (A 404 is only expected for the prod config probe, which is why that target stays commented out inprometheus.ymluntil practice-site launch.) - If the app responds locally but not externally: check nginx config and DNS
- If the app doesn't respond locally: check container status and logs (see "Container Down" above)
- Check the blackbox targets in Prometheus UI (
http://localhost:9090/targets) for specific probe failures
High 5xx Error Rate¶
Alert: NginxHigh5xxRate
-
Check nginx error log for the failing upstream:
-
Identify which backend is returning errors:
-
If a specific app is failing: check that app's container logs and health
- If all backends are failing: check shared services (PostgreSQL, Redis) — a database outage causes 500s across all apps
- If nginx itself is the issue:
docker exec nginx-proxy nginx -tto validate config
Redis Issues¶
Alert: RedisDown (if configured), or detected via app errors
-
Check Redis container status:
-
Test connectivity:
-
Check memory usage:
-
Check per-database key counts (prod=0/1, staging=2/3, dev=4/5):
-
If Redis is unresponsive, restart:
Note
Redis is used for Celery broker and Django cache. A Redis outage will cause Celery tasks to stop processing and may degrade app response times.
Nginx Routing Issues¶
Symptom: 502 Bad Gateway, 504 Gateway Timeout, or requests reaching the wrong service
-
Test nginx config syntax:
-
Check which config is active (
.conf.full= HTTPS,.conf.temp= maintenance): -
Check nginx error log for upstream failures:
-
Verify the upstream container is running and on the correct network:
-
After config changes, reload (not restart) nginx:
Monitoring Stack Down¶
Symptom: Grafana unreachable, no alerts firing, Prometheus targets showing as down
-
Check all monitoring containers:
-
If Prometheus is down:
-
If Loki is down (log gap risk):
-
If Alloy (log collector) is down:
-
Restart the full monitoring stack:
Warning
While the monitoring stack is down, no alerts will fire. Check containers
manually with docker ps -a until monitoring is restored.
Grafana Access Issues¶
If Grafana is unreachable at monitoring.groupe-suffren.com:
-
Check the container:
-
Check nginx proxy config:
-
Check Grafana logs: