Aller au contenu

Runbooks

Common monitoring scenarios and how to handle them.

Container Down / Restarting

Alert: ContainerDown, ContainerRestarted, HeliosContainerDown

  1. Check which container is affected:

    docker ps -a --filter "status=exited" --format "{{.Names}}\t{{.Status}}"
    

  2. Check container logs for the crash reason:

    docker logs --tail 100 <container_name>
    

  3. Check if OOM killed:

    docker inspect <container_name> | grep -A5 "State"
    

  4. Restart if needed:

    # For app containers — use the app's Makefile
    cd /opt/docker/aletheia/repo && make restart ENV=prod
    
    # For infra containers — use Aether
    cd /opt/docker/aether/repo && make restart-monitoring
    

High Memory Usage

Alert: HighMemoryUsage, ContainerHighMemory

  1. Check per-container memory in the Docker Grafana dashboard
  2. Identify the offending container:
    docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"
    
  3. For Celery workers: check for memory leaks in long-running tasks
  4. For PostgreSQL: check work_mem and active query count in the PostgreSQL dashboard

Disk Space Low

Alert: DiskSpaceLow

  1. Check disk usage:

    df -h /
    du -sh /opt/docker/*/  | sort -rh | head -20
    

  2. Common space consumers:

  3. Docker images: docker system dfdocker system prune
  4. Loki logs: check /opt/docker/monitoring/loki/ volume
  5. PostgreSQL WAL: check /var/lib/postgresql/18/docker/pg_wal/
  6. Backup archives: check /opt/docker/backups/

  7. Clean Docker resources:

    docker system prune -f          # dangling images, stopped containers
    docker image prune -a -f        # unused images (use with caution)
    

PostgreSQL Issues

Alert: PostgresConnectionsHigh, PostgresDown

  1. Check connection count:

    docker exec shared_postgres psql -U admin -c \
      "SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;"
    

  2. Find long-running queries:

    docker exec shared_postgres psql -U admin -c \
      "SELECT pid, now() - pg_stat_activity.query_start AS duration, query
       FROM pg_stat_activity
       WHERE state != 'idle'
       ORDER BY duration DESC
       LIMIT 10;"
    

  3. Kill a stuck query (last resort):

    docker exec shared_postgres psql -U admin -c "SELECT pg_terminate_backend(<pid>);"
    

Celery Queue Backlog

Alert: CeleryQueueBacklog, CeleryWorkerDown, CeleryHighFailRate

  1. Check queue depth in the Celery Grafana dashboard
  2. Check worker status:
    cd /opt/docker/aletheia/repo && make celery-logs ENV=prod
    
  3. Check for stuck tasks: look for tasks running longer than expected in the dashboard (p99 runtime panel)
  4. Restart workers if needed:
    cd /opt/docker/aletheia/repo && make restart ENV=prod
    

SSL Certificate Expiring

Alert: SslCertExpiringSoon, HeliosSSLExpiringSoon

Certificates should auto-renew via certbot. If they're not:

  1. Check certbot logs:

    docker logs certbot --tail 50
    

  2. Force renewal:

    docker exec certbot certbot renew --force-renewal
    

  3. Reload nginx after renewal:

    docker exec nginx-proxy nginx -s reload
    

Health Check Failing

Alert: HealthCheckFailing, HeliosHealthCheckFailing

  1. Check from the server itself:

    curl -sI https://aletheia.groupe-suffren.com/health/
    curl -sI https://cabinet-dentaire-aubagne.fr/
    

  2. If the app responds locally but not externally: check nginx config and DNS

  3. If the app doesn't respond locally: check container status and logs (see "Container Down" above)
  4. Check the blackbox targets in Prometheus UI (http://localhost:9090/targets) for specific probe failures

Grafana Access Issues

If Grafana is unreachable at monitoring.groupe-suffren.com:

  1. Check the container:

    docker ps --filter name=monitoring_grafana
    

  2. Check nginx proxy config:

    docker exec nginx-proxy nginx -t
    cat /opt/docker/nginx/conf.d/monitoring.conf
    

  3. Check Grafana logs:

    docker logs monitoring_grafana --tail 50