P1
etcd Unhealthy — Kubernetes Cluster Recovery Guide
Diagnose and recover an unhealthy etcd cluster. Covers health checks, disk I/O issues, compaction, defragmentation, member recovery, and backup/restore.
20 min8 steps
Progress: 0/8 steps
0%
Verify the health status of all etcd endpoints.
# If using kubeadm (etcd in pods): kubectl -n kube-system exec etcd-$(hostname) -- etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ endpoint health --write-out=table
Expected: Each endpoint shows 'true' for healthy. If any show 'false', note which member is unhealthy.
etcd requires a quorum (majority of members). A 3-node cluster can tolerate 1 failure. Losing quorum means the cluster is read-only.
Identify all cluster members and which is the current leader.
etcdctl member list --write-out=table etcdctl endpoint status --write-out=table
Expected: Shows member ID, name, peer URLs, and client URLs. STATUS table shows who is leader, DB size, and raft index.
Look at etcd logs for error patterns.
# Kubeadm: kubectl -n kube-system logs etcd-$(hostname) --tail=50 # Systemd: journalctl -u etcd --since '30 minutes ago' --no-pager | tail -50
Expected: Common errors: 'request timed out' (slow disk), 'database space exceeded' (needs compaction), 'raft: lost leader' (network partition).
etcd is very sensitive to disk latency — fsync must complete within 10ms.
# Check disk latency on etcd data directory: iostat -x 1 3 | grep -A1 Device # Benchmark etcd disk: fio --name=etcd-bench --filename=/var/lib/etcd/bench --size=22m --rw=write --ioengine=sync --fdatasync=1 --bs=2300 --runtime=10 2>&1 | grep 'fsync/fdatasync'
Expected: 99th percentile fdatasync should be <10ms. If higher, move etcd data to SSD or dedicated disk.
Remove the benchmark file after testing: rm /var/lib/etcd/bench
etcd DB grows with revisions. Compaction removes old revisions to reclaim space.
# Check current DB size:
etcdctl endpoint status --write-out=table | awk '{print $6}'
# Get current revision:
REV=$(etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision')
# Compact old revisions:
etcdctl compact $REVExpected: DB size should be under 2GB (default quota is 2GB). After compaction, old revisions are removed.
After compaction, defrag reclaims the actual disk space.
etcdctl defrag --endpoints=https://ENDPOINT1:2379,https://ENDPOINT2:2379,https://ENDPOINT3:2379
Expected: Each endpoint reports the freed space. DB file size on disk should decrease.
Defrag locks etcd briefly. Run on non-leader members first, then the leader. Do NOT run on all members simultaneously.
Confirm etcd is healthy and the Kubernetes API server is responsive.
etcdctl endpoint health --write-out=table && echo '---' && kubectl get cs 2>/dev/null || kubectl get --raw='/readyz?verbose'
Expected: All endpoints healthy. API server responds. Cluster components are running.
Always keep a recent etcd snapshot for disaster recovery.
etcdctl snapshot save /tmp/etcd-backup-$(date +%Y%m%d-%H%M%S).db && etcdctl snapshot status /tmp/etcd-backup-*.db --write-out=table
Expected: Snapshot saved with hash, revision, total keys, and DB size. Store this backup off-cluster.