Step-by-step playbooks for when things go sideways. Interactive checklists with copy-paste commands.
Diagnose memory pressure and OOM killer events on Linux. Covers identifying memory-hungry processes, leak detection, cache management, and service recovery.
Diagnose and fix a systemd service that fails to start or keeps crashing. Covers unit file inspection, dependency resolution, port conflicts, and manual testing.
Diagnose and fix Nginx 502 errors caused by unreachable or crashed backend services. Covers error log analysis, upstream health checks, and service recovery.
Troubleshoot PostgreSQL connection failures. Covers service status, listen address configuration, pg_hba.conf authentication rules, connection limits, and restart procedures.
Fix a certbot renewal that's stuck failing. Covers HTTP-01 + DNS-01 challenges, rate limits, webroot + authenticator mismatches, and getting the cert pushed to services that cached the old chain.
Diagnose why a Kubernetes service can't be reached. Walks through pod connectivity, service selectors, endpoints, network policies, and ingress configuration.