Step-by-step playbooks for when things go sideways. Interactive checklists with copy-paste commands.
Diagnose memory pressure and OOM killer events on Linux. Covers identifying memory-hungry processes, leak detection, cache management, and service recovery.
Diagnose and fix Nginx 502 errors caused by unreachable or crashed backend services. Covers error log analysis, upstream health checks, and service recovery.
Diagnose and fix replication lag between MySQL/MariaDB primary and replica servers. Covers IO/SQL thread status, binary log analysis, and safe error recovery.
Diagnose and recover an unhealthy etcd cluster. Covers health checks, disk I/O issues, compaction, defragmentation, member recovery, and backup/restore.