Step-by-step playbooks for when things go sideways. Interactive checklists with copy-paste commands.
Diagnose and resolve a full disk on Linux. Covers finding large files, cleaning logs, package caches, and reclaiming space from deleted-but-open files.
Identify and resolve processes causing high CPU utilization on Linux. Covers real-time monitoring, runaway process detection, and safe mitigation options.
Diagnose memory pressure and OOM killer events on Linux. Covers identifying memory-hungry processes, leak detection, cache management, and service recovery.
Diagnose and fix SSH connection failures. Covers network connectivity, sshd configuration, firewall rules, fail2ban, and daemon troubleshooting.
Diagnose and fix a systemd service that fails to start or keeps crashing. Covers unit file inspection, dependency resolution, port conflicts, and manual testing.
Diagnose and fix a Docker container that keeps restarting or exiting immediately. Covers logs, OOM detection, interactive debugging, volume permissions, and configuration errors.
Reclaim disk space consumed by unused Docker images, volumes, and build cache. Covers safe incremental cleanup and emergency full prune.
Troubleshoot Docker container networking failures. Covers DNS resolution, inter-container communication, network inspection, and firewall rules.
Diagnose and fix Nginx 502 errors caused by unreachable or crashed backend services. Covers error log analysis, upstream health checks, and service recovery.
Diagnose and fix Nginx 504 timeout errors caused by slow backend responses. Covers response time measurement, backend diagnostics, timeout tuning, and database query analysis.
Renew an expired or expiring SSL/TLS certificate. Covers Let's Encrypt/certbot renewal, manual certificate replacement, web server reload, and auto-renewal setup.
Troubleshoot PostgreSQL connection failures. Covers service status, listen address configuration, pg_hba.conf authentication rules, connection limits, and restart procedures.
Diagnose and fix replication lag between MySQL/MariaDB primary and replica servers. Covers IO/SQL thread status, binary log analysis, and safe error recovery.
Diagnose why a Kubernetes pod won't schedule. Covers insufficient resources, node conditions, taints/tolerations, affinity rules, and PVC binding issues.
Diagnose and fix a Kubernetes node in NotReady state. Covers kubelet health, container runtime, resource exhaustion, and node condition analysis.
Diagnose and fix pods killed by the OOM (Out Of Memory) killer. Covers memory limit analysis, resource tuning, memory leak detection, and node memory pressure.
Diagnose why Kubernetes can't pull a container image. Covers image name typos, registry auth, pull secrets, network issues, and rate limits.
Diagnose and recover an unhealthy etcd cluster. Covers health checks, disk I/O issues, compaction, defragmentation, member recovery, and backup/restore.
Diagnose TLS/SSL connection failures. Covers certificate issues, version mismatches, cipher incompatibility, chain problems, SNI, and mTLS failures.
Diagnose DNS resolution failures on Linux systems and Kubernetes clusters. Covers resolver config, upstream DNS, systemd-resolved, CoreDNS, and DNSSEC issues.
Diagnose high memory usage on Linux. Covers understanding free vs available memory, finding memory-hungry processes, detecting leaks, managing swap, and cache behavior.
Diagnose and fix a Docker daemon (dockerd) that won't start. Covers configuration errors, storage driver issues, disk space, socket problems, and containerd dependency.
Diagnose and fix OOM errors in Redis. Covers reading memory stats, finding big keys, choosing a sane eviction policy, and spotting fragmentation before it triggers a kernel OOM kill.
Fix a certbot renewal that's stuck failing. Covers HTTP-01 + DNS-01 challenges, rate limits, webroot + authenticator mismatches, and getting the cert pushed to services that cached the old chain.
Fix a target stuck in `DOWN` on Prometheus's /targets page. Covers network reachability, /metrics format, relabeling mistakes, TLS/auth, and scrape timing.
Recover from a stuck DynamoDB/Consul/S3 state lock without corrupting state. Covers identifying the holder, verifying the prior run actually died, and when force-unlock is safe.
Fix a WireGuard tunnel where peers can't reach each other. Covers handshake failure, AllowedIPs + routing, UDP firewalling, IP forwarding, and NAT/masquerade for site-to-site setups.
Click through symptoms to diagnose why your Kubernetes pod won't start. Covers CrashLoopBackOff, ImagePullBackOff, Pending, and Error states with targeted fix commands.
Diagnose why a Kubernetes service can't be reached. Walks through pod connectivity, service selectors, endpoints, network policies, and ingress configuration.
Interactively diagnose and fix a full disk on Linux. Branches into /var issues, Docker cleanup, log rotation, inode exhaustion, and deleted-but-open files.
Diagnose high CPU or memory usage on Linux. Branches into process identification, OOM killer analysis, memory leaks, CPU steal, and swap management with targeted fix commands.