Skip to main content

rawops.dev

P1

etcd Unhealthy — Kubernetes Cluster Recovery Guide

Diagnose and recover an unhealthy etcd cluster. Covers health checks, disk I/O issues, compaction, defragmentation, member recovery, and backup/restore.

20 min8 steps
Progress: 0/8 steps
0%

Verify the health status of all etcd endpoints.

# If using kubeadm (etcd in pods):
kubectl -n kube-system exec etcd-$(hostname) -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health --write-out=table
Expected: Each endpoint shows 'true' for healthy. If any show 'false', note which member is unhealthy.
etcd requires a quorum (majority of members). A 3-node cluster can tolerate 1 failure. Losing quorum means the cluster is read-only.

Identify all cluster members and which is the current leader.

etcdctl member list --write-out=table
etcdctl endpoint status --write-out=table
Expected: Shows member ID, name, peer URLs, and client URLs. STATUS table shows who is leader, DB size, and raft index.

Look at etcd logs for error patterns.

# Kubeadm:
kubectl -n kube-system logs etcd-$(hostname) --tail=50

# Systemd:
journalctl -u etcd --since '30 minutes ago' --no-pager | tail -50
Expected: Common errors: 'request timed out' (slow disk), 'database space exceeded' (needs compaction), 'raft: lost leader' (network partition).

etcd is very sensitive to disk latency — fsync must complete within 10ms.

# Check disk latency on etcd data directory:
iostat -x 1 3 | grep -A1 Device

# Benchmark etcd disk:
fio --name=etcd-bench --filename=/var/lib/etcd/bench --size=22m --rw=write --ioengine=sync --fdatasync=1 --bs=2300 --runtime=10 2>&1 | grep 'fsync/fdatasync'
Expected: 99th percentile fdatasync should be <10ms. If higher, move etcd data to SSD or dedicated disk.
Remove the benchmark file after testing: rm /var/lib/etcd/bench

etcd DB grows with revisions. Compaction removes old revisions to reclaim space.

# Check current DB size:
etcdctl endpoint status --write-out=table | awk '{print $6}'

# Get current revision:
REV=$(etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision')

# Compact old revisions:
etcdctl compact $REV
Expected: DB size should be under 2GB (default quota is 2GB). After compaction, old revisions are removed.

After compaction, defrag reclaims the actual disk space.

etcdctl defrag --endpoints=https://ENDPOINT1:2379,https://ENDPOINT2:2379,https://ENDPOINT3:2379
Expected: Each endpoint reports the freed space. DB file size on disk should decrease.
Defrag locks etcd briefly. Run on non-leader members first, then the leader. Do NOT run on all members simultaneously.

Confirm etcd is healthy and the Kubernetes API server is responsive.

etcdctl endpoint health --write-out=table && echo '---' && kubectl get cs 2>/dev/null || kubectl get --raw='/readyz?verbose'
Expected: All endpoints healthy. API server responds. Cluster components are running.

Always keep a recent etcd snapshot for disaster recovery.

etcdctl snapshot save /tmp/etcd-backup-$(date +%Y%m%d-%H%M%S).db && etcdctl snapshot status /tmp/etcd-backup-*.db --write-out=table
Expected: Snapshot saved with hash, revision, total keys, and DB size. Store this backup off-cluster.