Kubernetes is the de facto standard for container orchestration, but its complexity means things break in non-obvious ways. When a deployment fails at 2 AM, you need a systematic approach -- not frantic Googling. This guide covers the troubleshooting mindset, essential debugging commands, the most common errors you will encounter, and when to escalate.
Effective Kubernetes debugging follows a bottom-up approach: pods, then nodes, then cluster. Most issues originate at the pod level, so start there before expanding your investigation.
Pod Level → Is the container starting? Crashing? OOMKilled?
↓
Node Level → Is the node healthy? Enough resources? Disk pressure?
↓
Cluster Level → Is the scheduler working? DNS resolving? Network policies blocking traffic?
The three questions to answer first:
Running, Pending, CrashLoopBackOff, ImagePullBackOff, or Error?kubectl describe almost always reveals the root causeThese are the commands you will use in almost every debugging session. Build them interactively with the kubectl Builder if you need help with flags and output formats.
# Quick cluster health check
kubectl get nodes -o wide
kubectl get pods --all-namespaces | grep -v Running
# All pods in a namespace with resource usage
kubectl top pods -n my-namespace --sort-by=memory
# Recent cluster events (sorted by time)
kubectl get events --all-namespaces --sort-by=.metadata.creationTimestamp | tail -30
# Full pod details including events
kubectl describe pod my-pod -n my-namespace
# Current logs
kubectl logs my-pod -n my-namespace
# Logs from the previous crash (critical for CrashLoopBackOff)
kubectl logs my-pod -n my-namespace --previous
# Multi-container pod -- specify container
kubectl logs my-pod -n my-namespace -c sidecar-container
# Follow logs in real time
kubectl logs my-pod -n my-namespace -f --tail=100
# Node conditions and resource allocation
kubectl describe node my-node | grep -A10 "Conditions:"
kubectl describe node my-node | grep -A6 "Allocated resources:"
# Pods running on a specific node
kubectl get pods --all-namespaces --field-selector spec.nodeName=my-node
# Node resource usage
kubectl top node
# Open a shell in a running pod
kubectl exec -it my-pod -n my-namespace -- /bin/sh
# Run a one-off command
kubectl exec my-pod -n my-namespace -- cat /etc/resolv.conf
# Debug a pod with no shell (ephemeral container)
kubectl debug my-pod -n my-namespace -it --image=busybox
Tip: The kubectl Builder has 17 actions with all flags pre-configured. Use it to build complex commands like
exec,port-forward, orrolloutwithout memorizing syntax.
The container starts, crashes, restarts, and crashes again with exponential backoff (10s, 20s, 40s, up to 5 minutes).
Quick check:
kubectl logs my-pod -n my-namespace --previous
kubectl describe pod my-pod -n my-namespace | grep -A5 "Last State"
Common causes: OOMKilled, missing config, bad image entrypoint, failed liveness probe, missing dependencies.
For a complete debugging guide, see How to Fix CrashLoopBackOff in Kubernetes.
Kubernetes cannot pull the container image.
kubectl describe pod my-pod -n my-namespace | grep -A3 "Events"
# Look for: "Failed to pull image" or "401 Unauthorized"
Fixes:
imagePullSecret and reference it in the pod specngnix instead of nginx is more common than you thinkThe container exceeded its memory limit and was terminated by the kernel.
kubectl describe pod my-pod -n my-namespace | grep -i oom
kubectl get pod my-pod -n my-namespace -o jsonpath='{.status.containerStatuses[0].lastState}'
Fixes:
resources.limits.memory in the pod specresources.requests.memory to the application's steady-state usageThe pod is created but cannot be scheduled to any node.
kubectl describe pod my-pod -n my-namespace | tail -15
# Look at the Events section for scheduling failures
Common causes: insufficient resources, node selector mismatch, taints/tolerations, PVC not bound.
For a deep dive, see Kubernetes Pod Stuck in Pending: Complete Debugging Guide. There is also an interactive Pod Stuck Pending Runbook with step-by-step commands.
The container cannot be created because of a configuration problem.
kubectl describe pod my-pod -n my-namespace | grep -A5 "Warning"
Causes:
kubectl get configmap my-config -n my-namespacekubectl get secret my-secret -n my-namespacedata fieldA node has stopped communicating with the control plane.
kubectl get nodes
kubectl describe node my-node | grep -A5 "Conditions"
Common causes: kubelet crashed, container runtime (containerd/Docker) down, disk pressure, memory exhaustion, network partition.
See the interactive Node NotReady Runbook for step-by-step diagnosis and resolution.
The application is running but other pods or external clients cannot connect.
# Verify the service exists and has endpoints
kubectl get svc my-service -n my-namespace
kubectl get endpoints my-service -n my-namespace
# If endpoints is empty, the selector does not match any pods
kubectl get pods -n my-namespace --show-labels
# Test connectivity from inside the cluster
kubectl run debug --rm -it --image=busybox -- wget -qO- http://my-service.my-namespace.svc.cluster.local:8080
Fixes:
The ingress controller returns 502 Bad Gateway or 504 Gateway Timeout.
# Check ingress configuration
kubectl describe ingress my-ingress -n my-namespace
# Check ingress controller logs
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=50
# Verify backend service is healthy
kubectl get endpoints my-service -n my-namespace
502 (Bad Gateway): The backend pod is not responding. Check if the pod is running and the service port matches.
504 (Gateway Timeout): The backend is too slow. Increase timeout annotations on the ingress, or investigate application performance.
For networking-specific debugging, see Kubernetes Networking Troubleshooting.
When a Helm release goes wrong, fast rollback is your best friend. Use the Helm CLI Builder to construct these commands interactively.
# Check release status
helm status my-release -n my-namespace
# View release history (find the last working revision)
helm history my-release -n my-namespace
# Rollback to a specific revision
helm rollback my-release 3 -n my-namespace
# Rollback to the previous revision
helm rollback my-release 0 -n my-namespace
# Debug a failed install -- render templates without installing
helm template my-release my-chart/ --debug
# Diff between current and new values (requires helm-diff plugin)
helm diff upgrade my-release my-chart/ -f values.yaml
Best practices for Helm troubleshooting:
--wait with helm upgrade so failures are caught immediately--history-max set (default 10) to avoid infinite release history consuming etcdhelm get values my-release to see what values are actually appliedhelm get manifest my-release to see the rendered YAML that Kubernetes receivedFor a detailed guide, see Helm Charts: Install, Upgrade, and Rollback.
YAML errors are a frequent source of deployment failures. A misplaced indent or wrong type can cause cryptic errors.
# Validate YAML before applying
kubectl apply -f deployment.yaml --dry-run=client
# Server-side validation (catches more issues)
kubectl apply -f deployment.yaml --dry-run=server
# Diff against the live cluster state
kubectl diff -f deployment.yaml
Use the YAML/JSON Converter to validate YAML syntax and convert between YAML and JSON when you need to inspect a manifest programmatically.
Kubernetes secrets are Base64-encoded, not encrypted. Decoding them is a common debugging step.
# List secrets in a namespace
kubectl get secrets -n my-namespace
# View secret data (base64-encoded)
kubectl get secret my-secret -n my-namespace -o yaml
# Decode a specific key
kubectl get secret my-secret -n my-namespace -o jsonpath='{.data.password}' | base64 -d
# Decode all keys at once
kubectl get secret my-secret -n my-namespace -o json | jq -r '.data | to_entries[] | "\(.key): \(.value | @base64d)"'
Tip: Paste the entire
kubectl get secret -o yamloutput into the Base64 Encoder/Decoder with batch mode enabled. It decodes every key-value pair at once.
Pod not starting?
├── Status: Pending
│ ├── "Insufficient cpu/memory" → Scale nodes or reduce requests
│ ├── "MatchNodeSelector" → Fix nodeSelector or add labels to nodes
│ ├── "PodToleratesNodeTaints" → Add tolerations or remove taints
│ └── PVC Pending → Check StorageClass, PV availability
├── Status: ImagePullBackOff
│ ├── 401 Unauthorized → Create/fix imagePullSecret
│ ├── 404 Not Found → Check image name and tag
│ └── Rate limited → Use registry mirror or authenticated pull
├── Status: CrashLoopBackOff
│ ├── OOMKilled → Increase memory limits
│ ├── Exit code 1 → Check logs (--previous) for application error
│ ├── Exit code 137 → OOM or SIGKILL (check node memory)
│ └── Exit code 127 → Command not found (wrong entrypoint)
├── Status: CreateContainerConfigError
│ └── Missing ConfigMap/Secret → Create the referenced resource
└── Status: Running but not working
├── Service has no endpoints → Fix label selector
├── Readiness probe failing → Fix probe or application
└── NetworkPolicy blocking → Check policy rules
Not every issue can be resolved at the application level. Escalate to cluster administrators when:
Before escalating, gather:
kubectl cluster-info dump (full diagnostic bundle)