Aller au contenu

Runbooks

Common monitoring scenarios and how to handle them.

Container Down / Restarting

Alert: ContainerDown, ContainerRestarted, HeliosContainerDown, HeliosContainerRestarting

  1. Check which container is affected:

    docker ps -a --filter "status=exited" --format "{{.Names}}\t{{.Status}}"
    

  2. Check container logs for the crash reason:

    docker logs --tail 100 <container_name>
    

  3. Check if OOM killed:

    docker inspect <container_name> | grep -A5 "State"
    

  4. Restart if needed:

    # For app containers — use the app's Makefile
    cd /opt/docker/aletheia/repo && make restart ENV=prod
    
    # For infra containers — use Aether
    cd /opt/docker/aether/repo && make restart-monitoring
    

Systemd Unit Failed

Alert: SystemdUnitFailed

A systemd unit is in the failed state. This catches silent service failures like the Apr 2026 iptables outage where the firewall service failed for 10 days undetected.

  1. Identify which unit failed (the name label on the alert):

    sudo systemctl --failed
    

  2. Get the failure reason:

    sudo systemctl status <unit-name> --no-pager
    

  3. Check recent logs for context:

    sudo journalctl -u <unit-name> --since "1 hour ago" --no-pager
    

  4. Common causes:

    • ExecStart path no longer exists (like the iptables incident) — check the unit file
    • Dependency unavailable — e.g. service needs docker but docker is down
    • Config syntax error — unit file itself is malformed
    • Resource limit hit — memory, tasks, or time limits
  5. Once fixed, restart the unit to clear the failed state:

    sudo systemctl restart <unit-name>
    sudo systemctl status <unit-name>
    

  6. Verify no other units are failed: sudo systemctl --failed should show zero.

High CPU Usage

Alert: HighCpuUsage, ContainerHighCpu

  1. Check which process/container is consuming CPU:
    docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" | sort -k2 -rh
    
  2. Check the Node Exporter Grafana dashboard for CPU breakdown (user, system, iowait)
  3. If iowait is high: disk I/O bottleneck — check for large database queries or backup running
  4. If user CPU is high: identify the container and check its logs for processing-heavy operations
  5. For Celery workers: check for CPU-intensive tasks in the Celery dashboard
  6. ContainerHighCpu specifically: a single container is pinned >1.5 cores while the host is otherwise fine. Almost always a self-feeding task loop or stuck process — check the alerting container's logs for the same task name / endpoint repeating. The 2026-05-13 MediaFile signal loop is the canonical example: process_media_variants re-firing post_save indefinitely.

High Memory Usage

Alert: HighMemoryUsage, ContainerHighMemory

  1. Check per-container memory in the Docker Grafana dashboard
  2. Identify the offending container:
    docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"
    
  3. For Celery workers: check for memory leaks in long-running tasks
  4. For PostgreSQL: check work_mem and active query count in the PostgreSQL dashboard

Disk Space Low

Alert: DiskSpaceLow

  1. Check disk usage:

    df -h /
    du -sh /opt/docker/*/  | sort -rh | head -20
    

  2. Common space consumers:

  3. Docker images: docker system dfdocker system prune
  4. Loki logs: check /opt/docker/monitoring/loki/ volume
  5. PostgreSQL WAL: check /var/lib/postgresql/18/docker/pg_wal/
  6. Backup archives: check /opt/docker/backups/

  7. Clean Docker resources:

    docker system prune -f          # dangling images, stopped containers
    docker image prune -a -f        # unused images (use with caution)
    

PostgreSQL Issues

Alert: PostgresConnectionsHigh, PostgresDown

  1. Check connection count:

    docker exec shared_postgres psql -U admin -c \
      "SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;"
    

  2. Find long-running queries:

    docker exec shared_postgres psql -U admin -c \
      "SELECT pid, now() - pg_stat_activity.query_start AS duration, query
       FROM pg_stat_activity
       WHERE state != 'idle'
       ORDER BY duration DESC
       LIMIT 10;"
    

  3. Kill a stuck query (last resort):

    docker exec shared_postgres psql -U admin -c "SELECT pg_terminate_backend(<pid>);"
    

Celery Queue Backlog

Alert: CeleryQueueBacklog, CeleryWorkerDown, CeleryHighFailRate

  1. Check queue depth in the Celery Grafana dashboard
  2. Check worker status:
    cd /opt/docker/aletheia/repo && make celery-logs ENV=prod
    
  3. Check for stuck tasks: look for tasks running longer than expected in the dashboard (p99 runtime panel)
  4. Restart workers if needed:
    cd /opt/docker/aletheia/repo && make restart ENV=prod
    

SSL Certificate Expiring

Alert: SslCertExpiringSoon, HeliosSSLExpiringSoon

Certificates should auto-renew via certbot. If they're not:

  1. Check certbot logs:

    docker logs certbot --tail 50
    

  2. Force renewal:

    docker exec certbot certbot renew --force-renewal
    

  3. Reload nginx after renewal:

    docker exec nginx-proxy nginx -s reload
    

Health Check Failing

Alert: HealthCheckFailing, HeliosHealthCheckFailing

First decide which signal fired — they mean different things:

  • HealthCheckFailing (any blackbox-health target non-200) is the authoritative backend signal. That job is Aletheia-only: /health/ (DB + Redis + Celery) and the API config probe …/api/v1/websites/sites/<domain>/config/ (the apps/websites DRF layer). A 200 from /health/ with a failing config probe means DB/Redis/Celery are fine but the websites view/serializer layer is broken. The Helios frontend roots are not in this job, so this alert never fires on a frontend-only outage.
  • HeliosHealthCheckFailing is frontend reachability only (its own blackbox-helios-frontend job). Helios returns HTTP 200 with a soft-404 error UI when Aletheia is down, so this alert does not detect an Aletheia outage — it only fires when the practice site is fully unreachable (DNS/cert/process/nginx). For backend health, look at HealthCheckFailing, not this.

  • Check from the server itself:

    curl -sI https://aletheia.groupe-suffren.com/health/                                              # backend deep health (200/503)
    curl -s  -o /dev/null -w '%{http_code}\n' \
      https://aletheia-staging.groupe-suffren.com/api/v1/websites/sites/cabinet-dentaire-aubagne.fr/config/   # API/DRF layer (expect 200)
    curl -sI https://cabinet-dentaire-aubagne.fr/                                                     # Helios frontend (200 even if backend is down!)
    

  • If /health/ is 200 but the config probe is non-200: the DRF layer is broken (view/serializer/migration) — check Aletheia web logs, not the DB. The live config probe targets staging (aletheia-staging…), so a 404 there is a real regression — the probed SiteConfig.domain (Aubagne) was unseeded/deleted on staging; reseed it (restore_website_seed). (A 404 is only expected for the prod config probe, which is why that target stays commented out in prometheus.yml until practice-site launch.)

  • If the app responds locally but not externally: check nginx config and DNS
  • If the app doesn't respond locally: check container status and logs (see "Container Down" above)
  • Check the blackbox targets in Prometheus UI (http://localhost:9090/targets) for specific probe failures

High 5xx Error Rate

Alert: NginxHigh5xxRate

  1. Check nginx error log for the failing upstream:

    docker logs --tail 200 nginx-proxy 2>&1 | grep -E "5[0-9]{2}|error|upstream"
    

  2. Identify which backend is returning errors:

    # Check per-service in Grafana → Nginx dashboard, or:
    docker logs --tail 100 nginx-proxy 2>&1 | grep " 50[0-9] " | awk '{print $7}' | sort | uniq -c | sort -rn
    

  3. If a specific app is failing: check that app's container logs and health

  4. If all backends are failing: check shared services (PostgreSQL, Redis) — a database outage causes 500s across all apps
  5. If nginx itself is the issue: docker exec nginx-proxy nginx -t to validate config

Redis Issues

Alert: RedisDown (if configured), or detected via app errors

  1. Check Redis container status:

    docker ps --filter name=shared_redis
    docker logs --tail 50 shared_redis
    

  2. Test connectivity:

    docker exec shared_redis redis-cli ping
    # Expected: PONG
    

  3. Check memory usage:

    docker exec shared_redis redis-cli info memory | grep used_memory_human
    

  4. Check per-database key counts (prod=0/1, staging=2/3, dev=4/5):

    docker exec shared_redis redis-cli info keyspace
    

  5. If Redis is unresponsive, restart:

    cd /opt/docker/aether/repo && make restart-shared
    

Note

Redis is used for Celery broker and Django cache. A Redis outage will cause Celery tasks to stop processing and may degrade app response times.

Nginx Routing Issues

Symptom: 502 Bad Gateway, 504 Gateway Timeout, or requests reaching the wrong service

  1. Test nginx config syntax:

    docker exec nginx-proxy nginx -t
    

  2. Check which config is active (.conf.full = HTTPS, .conf.temp = maintenance):

    ls -la /opt/docker/nginx/conf.d/*.conf
    

  3. Check nginx error log for upstream failures:

    docker logs --tail 100 nginx-proxy 2>&1 | grep -E "error|upstream"
    

  4. Verify the upstream container is running and on the correct network:

    # Check the container exists and is on the backend network
    docker inspect <container_name> --format '{{range $net,$conf := .NetworkSettings.Networks}}{{$net}} {{end}}'
    

  5. After config changes, reload (not restart) nginx:

    docker exec nginx-proxy nginx -s reload
    

Monitoring Stack Down

Symptom: Grafana unreachable, no alerts firing, Prometheus targets showing as down

  1. Check all monitoring containers:

    docker ps -a --filter "name=monitoring_" --format "table {{.Names}}\t{{.Status}}"
    

  2. If Prometheus is down:

    docker logs --tail 50 monitoring_prometheus
    # Common cause: bad config syntax after editing alert rules
    cd /opt/docker/aether/repo && make restart-monitoring
    

  3. If Loki is down (log gap risk):

    docker logs --tail 50 monitoring_loki
    # Check disk space — Loki will stop ingesting if disk is full
    df -h /
    du -sh /opt/docker/monitoring/loki/
    

  4. If Alloy (log collector) is down:

    docker logs --tail 50 monitoring_alloy
    # Alloy needs access to the Docker socket
    ls -la /var/run/docker.sock
    

  5. Restart the full monitoring stack:

    cd /opt/docker/aether/repo && make restart-monitoring
    

Warning

While the monitoring stack is down, no alerts will fire. Check containers manually with docker ps -a until monitoring is restored.

Grafana Access Issues

If Grafana is unreachable at monitoring.groupe-suffren.com:

  1. Check the container:

    docker ps --filter name=monitoring_grafana
    

  2. Check nginx proxy config:

    docker exec nginx-proxy nginx -t
    cat /opt/docker/nginx/conf.d/monitoring.conf
    

  3. Check Grafana logs:

    docker logs monitoring_grafana --tail 50