Runbooks¶
Common monitoring scenarios and how to handle them.
Container Down / Restarting¶
Alert: ContainerDown, ContainerRestarted, HeliosContainerDown
-
Check which container is affected:
-
Check container logs for the crash reason:
-
Check if OOM killed:
-
Restart if needed:
High Memory Usage¶
Alert: HighMemoryUsage, ContainerHighMemory
- Check per-container memory in the Docker Grafana dashboard
- Identify the offending container:
- For Celery workers: check for memory leaks in long-running tasks
- For PostgreSQL: check
work_memand active query count in the PostgreSQL dashboard
Disk Space Low¶
Alert: DiskSpaceLow
-
Check disk usage:
-
Common space consumers:
- Docker images:
docker system df→docker system prune - Loki logs: check
/opt/docker/monitoring/loki/volume - PostgreSQL WAL: check
/var/lib/postgresql/18/docker/pg_wal/ -
Backup archives: check
/opt/docker/backups/ -
Clean Docker resources:
PostgreSQL Issues¶
Alert: PostgresConnectionsHigh, PostgresDown
-
Check connection count:
-
Find long-running queries:
-
Kill a stuck query (last resort):
Celery Queue Backlog¶
Alert: CeleryQueueBacklog, CeleryWorkerDown, CeleryHighFailRate
- Check queue depth in the Celery Grafana dashboard
- Check worker status:
- Check for stuck tasks: look for tasks running longer than expected in the dashboard (p99 runtime panel)
- Restart workers if needed:
SSL Certificate Expiring¶
Alert: SslCertExpiringSoon, HeliosSSLExpiringSoon
Certificates should auto-renew via certbot. If they're not:
-
Check certbot logs:
-
Force renewal:
-
Reload nginx after renewal:
Health Check Failing¶
Alert: HealthCheckFailing, HeliosHealthCheckFailing
-
Check from the server itself:
-
If the app responds locally but not externally: check nginx config and DNS
- If the app doesn't respond locally: check container status and logs (see "Container Down" above)
- Check the blackbox targets in Prometheus UI (
http://localhost:9090/targets) for specific probe failures
Grafana Access Issues¶
If Grafana is unreachable at monitoring.groupe-suffren.com:
-
Check the container:
-
Check nginx proxy config:
-
Check Grafana logs: