Runbooks¶

Common monitoring scenarios and how to handle them.

Container Down / Restarting¶

Alert: ContainerDown, ContainerRestarted, HeliosContainerDown, HeliosContainerRestarting

Check which container is affected:

docker ps -a --filter "status=exited" --format "{{.Names}}\t{{.Status}}"

Check container logs for the crash reason:

docker logs --tail 100 <container_name>

Check if OOM killed:

docker inspect <container_name> | grep -A5 "State"

Restart if needed:

# For app containers — use the app's Makefile
cd /opt/docker/aletheia/repo && make restart ENV=prod

# For infra containers — use Aether
cd /opt/docker/aether/repo && make restart-monitoring

Systemd Unit Failed¶

Alert: SystemdUnitFailed

A systemd unit is in the failed state. This catches silent service failures like the Apr 2026 iptables outage where the firewall service failed for 10 days undetected.

Identify which unit failed (the name label on the alert):
```
sudo systemctl --failed
```

Get the failure reason:

sudo systemctl status <unit-name> --no-pager

Check recent logs for context:

sudo journalctl -u <unit-name> --since "1 hour ago" --no-pager

Common causes:
- ExecStart path no longer exists (like the iptables incident) — check the unit file
- Dependency unavailable — e.g. service needs docker but docker is down
- Config syntax error — unit file itself is malformed
- Resource limit hit — memory, tasks, or time limits

Once fixed, restart the unit to clear the failed state:

sudo systemctl restart <unit-name>
sudo systemctl status <unit-name>

Verify no other units are failed: sudo systemctl --failed should show zero.

High CPU Usage¶

Alert: HighCpuUsage, ContainerHighCpu

Check which process/container is consuming CPU:

docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" | sort -k2 -rh

Check the Node Exporter Grafana dashboard for CPU breakdown (user, system, iowait)
If iowait is high: disk I/O bottleneck — check for large database queries or backup running
If user CPU is high: identify the container and check its logs for processing-heavy operations
For Celery workers: check for CPU-intensive tasks in the Celery dashboard
ContainerHighCpu specifically: a single container is pinned >1.5 cores while the host is otherwise fine. Almost always a self-feeding task loop or stuck process — check the alerting container's logs for the same task name / endpoint repeating. The 2026-05-13 MediaFile signal loop is the canonical example: process_media_variants re-firing post_save indefinitely.

High Memory Usage¶

Alert: HighMemoryUsage, ContainerHighMemory

Check per-container memory in the Docker Grafana dashboard

Identify the offending container:

docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"

For Celery workers: check for memory leaks in long-running tasks
For PostgreSQL: check work_mem and active query count in the PostgreSQL dashboard

Disk Space Low¶

Alert: DiskSpaceLow

Check disk usage:

df -h /
du -sh /opt/docker/*/  | sort -rh | head -20

Common space consumers:
Docker images: docker system df → docker system prune
Loki logs: check /opt/docker/monitoring/loki/ volume
PostgreSQL WAL: check /var/lib/postgresql/18/docker/pg_wal/
Backup archives: check /opt/docker/backups/

Clean Docker resources:

docker system prune -f          # dangling images, stopped containers
docker image prune -a -f        # unused images (use with caution)

PostgreSQL Issues¶

Alert: PostgresConnectionsHigh, PostgresDown

Check connection count:

docker exec shared_postgres psql -U admin -c \
  "SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;"

Find long-running queries:

docker exec shared_postgres psql -U admin -c \
  "SELECT pid, now() - pg_stat_activity.query_start AS duration, query
   FROM pg_stat_activity
   WHERE state != 'idle'
   ORDER BY duration DESC
   LIMIT 10;"

Kill a stuck query (last resort):

docker exec shared_postgres psql -U admin -c "SELECT pg_terminate_backend(<pid>);"

Celery Queue Backlog¶

Alert: CeleryQueueBacklog, CeleryWorkerDown, CeleryHighFailRate

Check queue depth in the Celery Grafana dashboard

Check worker status:

cd /opt/docker/aletheia/repo && make celery-logs ENV=prod

Check for stuck tasks: look for tasks running longer than expected in the dashboard (p99 runtime panel)

Restart workers if needed:

cd /opt/docker/aletheia/repo && make restart ENV=prod

SSL Certificate Expiring¶

Alert: SslCertExpiringSoon, HeliosSSLExpiringSoon

Certificates should auto-renew via certbot. If they're not:

Check certbot logs:
```
docker logs certbot --tail 50
```

Force renewal:

docker exec certbot certbot renew --force-renewal

Reload nginx after renewal:

docker exec nginx-proxy nginx -s reload

Health Check Failing¶

Alert: HealthCheckFailing, HeliosHealthCheckFailing

First decide which signal fired — they mean different things:

HealthCheckFailing (any blackbox-health target non-200) is the authoritative backend signal. That job is Aletheia-only: /health/ (DB + Redis + Celery) and the API config probe …/api/v1/websites/sites/<domain>/config/ (the apps/websites DRF layer). A 200 from /health/ with a failing config probe means DB/Redis/Celery are fine but the websites view/serializer layer is broken. The Helios frontend roots are not in this job, so this alert never fires on a frontend-only outage.
HeliosHealthCheckFailing is frontend reachability only (its own blackbox-helios-frontend job). Helios returns HTTP 200 with a soft-404 error UI when Aletheia is down, so this alert does not detect an Aletheia outage — it only fires when the practice site is fully unreachable (DNS/cert/process/nginx). For backend health, look at HealthCheckFailing, not this.

Check from the server itself:

curl -sI https://aletheia.groupe-suffren.com/health/                                              # backend deep health (200/503)
curl -s  -o /dev/null -w '%{http_code}\n' \
  https://aletheia-staging.groupe-suffren.com/api/v1/websites/sites/cabinet-dentaire-aubagne.fr/config/   # API/DRF layer (expect 200)
curl -sI https://cabinet-dentaire-aubagne.fr/                                                     # Helios frontend (200 even if backend is down!)

If /health/ is 200 but the config probe is non-200: the DRF layer is broken (view/serializer/migration) — check Aletheia web logs, not the DB. The live config probe targets staging (aletheia-staging…), so a 404 there is a real regression — the probed SiteConfig.domain (Aubagne) was unseeded/deleted on staging; reseed it (restore_website_seed). (A 404 is only expected for the prod config probe, which is why that target stays commented out in prometheus.yml until practice-site launch.)
If the app responds locally but not externally: check nginx config and DNS
If the app doesn't respond locally: check container status and logs (see "Container Down" above)
Check the blackbox targets in Prometheus UI (http://localhost:9090/targets) for specific probe failures

High 5xx Error Rate¶

Alert: NginxHigh5xxRate

Check nginx error log for the failing upstream:

docker logs --tail 200 nginx-proxy 2>&1 | grep -E "5[0-9]{2}|error|upstream"

Identify which backend is returning errors:

# Check per-service in Grafana → Nginx dashboard, or:
docker logs --tail 100 nginx-proxy 2>&1 | grep " 50[0-9] " | awk '{print $7}' | sort | uniq -c | sort -rn

If a specific app is failing: check that app's container logs and health
If all backends are failing: check shared services (PostgreSQL, Redis) — a database outage causes 500s across all apps
If nginx itself is the issue: docker exec nginx-proxy nginx -t to validate config

Redis Issues¶

Alert: RedisDown (if configured), or detected via app errors

Check Redis container status:

docker ps --filter name=shared_redis
docker logs --tail 50 shared_redis

Test connectivity:

docker exec shared_redis redis-cli ping
# Expected: PONG

Check memory usage:

docker exec shared_redis redis-cli info memory | grep used_memory_human

Check per-database key counts (prod=0/1, staging=2/3, dev=4/5):
```
docker exec shared_redis redis-cli info keyspace
```

If Redis is unresponsive, restart:

cd /opt/docker/aether/repo && make restart-shared

Note

Redis is used for Celery broker and Django cache. A Redis outage will cause Celery tasks to stop processing and may degrade app response times.

Nginx Routing Issues¶

Symptom: 502 Bad Gateway, 504 Gateway Timeout, or requests reaching the wrong service

Test nginx config syntax:
```
docker exec nginx-proxy nginx -t
```
Check which config is active (.conf.full = HTTPS, .conf.temp = maintenance):
```
ls -la /opt/docker/nginx/conf.d/*.conf
```

Check nginx error log for upstream failures:

docker logs --tail 100 nginx-proxy 2>&1 | grep -E "error|upstream"

Verify the upstream container is running and on the correct network:

# Check the container exists and is on the backend network
docker inspect <container_name> --format '{{range $net,$conf := .NetworkSettings.Networks}}{{$net}} {{end}}'

After config changes, reload (not restart) nginx:
```
docker exec nginx-proxy nginx -s reload
```

Monitoring Stack Down¶

Symptom: Grafana unreachable, no alerts firing, Prometheus targets showing as down

Check all monitoring containers:

docker ps -a --filter "name=monitoring_" --format "table {{.Names}}\t{{.Status}}"

If Prometheus is down:

docker logs --tail 50 monitoring_prometheus
# Common cause: bad config syntax after editing alert rules
cd /opt/docker/aether/repo && make restart-monitoring

If Loki is down (log gap risk):

docker logs --tail 50 monitoring_loki
# Check disk space — Loki will stop ingesting if disk is full
df -h /
du -sh /opt/docker/monitoring/loki/

If Alloy (log collector) is down:

docker logs --tail 50 monitoring_alloy
# Alloy needs access to the Docker socket
ls -la /var/run/docker.sock

Restart the full monitoring stack:

cd /opt/docker/aether/repo && make restart-monitoring

Warning

While the monitoring stack is down, no alerts will fire. Check containers manually with docker ps -a until monitoring is restored.

Grafana Access Issues¶

If Grafana is unreachable at monitoring.groupe-suffren.com:

Check the container:

docker ps --filter name=monitoring_grafana

Check nginx proxy config:

docker exec nginx-proxy nginx -t
cat /opt/docker/nginx/conf.d/monitoring.conf

Check Grafana logs:

docker logs monitoring_grafana --tail 50