Kubernetes Troubleshooting Guide — Debug Pods, Services & Networking

Kubernetes is the de facto standard for container orchestration, but its complexity means things break in non-obvious ways. When a deployment fails at 2 AM, you need a systematic approach -- not frantic Googling. This guide covers the troubleshooting mindset, essential debugging commands, the most common errors you will encounter, and when to escalate.

The Kubernetes Troubleshooting Mindset

Effective Kubernetes debugging follows a bottom-up approach: pods, then nodes, then cluster. Most issues originate at the pod level, so start there before expanding your investigation.

Pod Level          → Is the container starting? Crashing? OOMKilled?
  ↓
Node Level         → Is the node healthy? Enough resources? Disk pressure?
  ↓
Cluster Level      → Is the scheduler working? DNS resolving? Network policies blocking traffic?

The three questions to answer first:

What is the pod's status? -- Running, Pending, CrashLoopBackOff, ImagePullBackOff, or Error?
What do the events say? -- kubectl describe almost always reveals the root cause
What do the logs say? -- If the container started at all, the application logs tell you why it crashed

Essential kubectl Commands for Debugging

These are the commands you will use in almost every debugging session. Build them interactively with the kubectl Builder if you need help with flags and output formats.

Get an Overview

# Quick cluster health check
kubectl get nodes -o wide
kubectl get pods --all-namespaces | grep -v Running

# All pods in a namespace with resource usage
kubectl top pods -n my-namespace --sort-by=memory

# Recent cluster events (sorted by time)
kubectl get events --all-namespaces --sort-by=.metadata.creationTimestamp | tail -30

Investigate a Specific Pod

# Full pod details including events
kubectl describe pod my-pod -n my-namespace

# Current logs
kubectl logs my-pod -n my-namespace

# Logs from the previous crash (critical for CrashLoopBackOff)
kubectl logs my-pod -n my-namespace --previous

# Multi-container pod -- specify container
kubectl logs my-pod -n my-namespace -c sidecar-container

# Follow logs in real time
kubectl logs my-pod -n my-namespace -f --tail=100

Investigate a Node

# Node conditions and resource allocation
kubectl describe node my-node | grep -A10 "Conditions:"
kubectl describe node my-node | grep -A6 "Allocated resources:"

# Pods running on a specific node
kubectl get pods --all-namespaces --field-selector spec.nodeName=my-node

# Node resource usage
kubectl top node

Execute Commands in a Running Container

# Open a shell in a running pod
kubectl exec -it my-pod -n my-namespace -- /bin/sh

# Run a one-off command
kubectl exec my-pod -n my-namespace -- cat /etc/resolv.conf

# Debug a pod with no shell (ephemeral container)
kubectl debug my-pod -n my-namespace -it --image=busybox

Tip: The kubectl Builder has 17 actions with all flags pre-configured. Use it to build complex commands like exec, port-forward, or rollout without memorizing syntax.

The 8 Most Common Kubernetes Errors

1. CrashLoopBackOff

The container starts, crashes, restarts, and crashes again with exponential backoff (10s, 20s, 40s, up to 5 minutes).

Quick check:

kubectl logs my-pod -n my-namespace --previous
kubectl describe pod my-pod -n my-namespace | grep -A5 "Last State"

Common causes: OOMKilled, missing config, bad image entrypoint, failed liveness probe, missing dependencies.

For a complete debugging guide, see How to Fix CrashLoopBackOff in Kubernetes.

2. ImagePullBackOff

Kubernetes cannot pull the container image.

kubectl describe pod my-pod -n my-namespace | grep -A3 "Events"
# Look for: "Failed to pull image" or "401 Unauthorized"

Fixes:

Wrong image tag: verify the tag exists in your registry
Private registry: create an imagePullSecret and reference it in the pod spec
Rate limiting: Docker Hub limits pulls for anonymous users (100/6h). Use a paid account or mirror
Typo in image name: ngnix instead of nginx is more common than you think

3. OOMKilled

The container exceeded its memory limit and was terminated by the kernel.

kubectl describe pod my-pod -n my-namespace | grep -i oom
kubectl get pod my-pod -n my-namespace -o jsonpath='{.status.containerStatuses[0].lastState}'

Fixes:

Increase resources.limits.memory in the pod spec
Profile your application's actual memory usage before setting limits
Check for memory leaks (Java heap, Go goroutine leaks, Node.js event listeners)
Set resources.requests.memory to the application's steady-state usage

4. Pod Pending

The pod is created but cannot be scheduled to any node.

kubectl describe pod my-pod -n my-namespace | tail -15
# Look at the Events section for scheduling failures

Common causes: insufficient resources, node selector mismatch, taints/tolerations, PVC not bound.

For a deep dive, see Kubernetes Pod Stuck in Pending: Complete Debugging Guide. There is also an interactive Pod Stuck Pending Runbook with step-by-step commands.

5. CreateContainerConfigError

The container cannot be created because of a configuration problem.

kubectl describe pod my-pod -n my-namespace | grep -A5 "Warning"

Causes:

Referenced ConfigMap does not exist: kubectl get configmap my-config -n my-namespace
Referenced Secret does not exist: kubectl get secret my-secret -n my-namespace
Key missing from ConfigMap/Secret: check the data field
ServiceAccount not found

6. Node NotReady

A node has stopped communicating with the control plane.

kubectl get nodes
kubectl describe node my-node | grep -A5 "Conditions"

Common causes: kubelet crashed, container runtime (containerd/Docker) down, disk pressure, memory exhaustion, network partition.

See the interactive Node NotReady Runbook for step-by-step diagnosis and resolution.

7. Service Not Reachable

The application is running but other pods or external clients cannot connect.

# Verify the service exists and has endpoints
kubectl get svc my-service -n my-namespace
kubectl get endpoints my-service -n my-namespace

# If endpoints is empty, the selector does not match any pods
kubectl get pods -n my-namespace --show-labels

# Test connectivity from inside the cluster
kubectl run debug --rm -it --image=busybox -- wget -qO- http://my-service.my-namespace.svc.cluster.local:8080

Fixes:

Service selector does not match pod labels (most common)
Target port does not match the container's listening port
NetworkPolicy blocking traffic
Pod readiness probe failing (endpoint not registered)

8. Ingress 502/504 Errors

The ingress controller returns 502 Bad Gateway or 504 Gateway Timeout.

# Check ingress configuration
kubectl describe ingress my-ingress -n my-namespace

# Check ingress controller logs
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=50

# Verify backend service is healthy
kubectl get endpoints my-service -n my-namespace

502 (Bad Gateway): The backend pod is not responding. Check if the pod is running and the service port matches.

504 (Gateway Timeout): The backend is too slow. Increase timeout annotations on the ingress, or investigate application performance.

For networking-specific debugging, see Kubernetes Networking Troubleshooting.

Using Helm for Release Management and Rollbacks

When a Helm release goes wrong, fast rollback is your best friend. Use the Helm CLI Builder to construct these commands interactively.

# Check release status
helm status my-release -n my-namespace

# View release history (find the last working revision)
helm history my-release -n my-namespace

# Rollback to a specific revision
helm rollback my-release 3 -n my-namespace

# Rollback to the previous revision
helm rollback my-release 0 -n my-namespace

# Debug a failed install -- render templates without installing
helm template my-release my-chart/ --debug

# Diff between current and new values (requires helm-diff plugin)
helm diff upgrade my-release my-chart/ -f values.yaml

Best practices for Helm troubleshooting:

Always use --wait with helm upgrade so failures are caught immediately
Keep --history-max set (default 10) to avoid infinite release history consuming etcd
Use helm get values my-release to see what values are actually applied
Use helm get manifest my-release to see the rendered YAML that Kubernetes received

For a detailed guide, see Helm Charts: Install, Upgrade, and Rollback.

Working with Kubernetes Manifests

YAML errors are a frequent source of deployment failures. A misplaced indent or wrong type can cause cryptic errors.

# Validate YAML before applying
kubectl apply -f deployment.yaml --dry-run=client

# Server-side validation (catches more issues)
kubectl apply -f deployment.yaml --dry-run=server

# Diff against the live cluster state
kubectl diff -f deployment.yaml

Use the YAML/JSON Converter to validate YAML syntax and convert between YAML and JSON when you need to inspect a manifest programmatically.

Decoding Kubernetes Secrets

Kubernetes secrets are Base64-encoded, not encrypted. Decoding them is a common debugging step.

# List secrets in a namespace
kubectl get secrets -n my-namespace

# View secret data (base64-encoded)
kubectl get secret my-secret -n my-namespace -o yaml

# Decode a specific key
kubectl get secret my-secret -n my-namespace -o jsonpath='{.data.password}' | base64 -d

# Decode all keys at once
kubectl get secret my-secret -n my-namespace -o json | jq -r '.data | to_entries[] | "\(.key): \(.value | @base64d)"'

Tip: Paste the entire kubectl get secret -o yaml output into the Base64 Encoder/Decoder with batch mode enabled. It decodes every key-value pair at once.

Quick Reference: Troubleshooting Decision Tree

Pod not starting?
├── Status: Pending
│   ├── "Insufficient cpu/memory" → Scale nodes or reduce requests
│   ├── "MatchNodeSelector" → Fix nodeSelector or add labels to nodes
│   ├── "PodToleratesNodeTaints" → Add tolerations or remove taints
│   └── PVC Pending → Check StorageClass, PV availability
├── Status: ImagePullBackOff
│   ├── 401 Unauthorized → Create/fix imagePullSecret
│   ├── 404 Not Found → Check image name and tag
│   └── Rate limited → Use registry mirror or authenticated pull
├── Status: CrashLoopBackOff
│   ├── OOMKilled → Increase memory limits
│   ├── Exit code 1 → Check logs (--previous) for application error
│   ├── Exit code 137 → OOM or SIGKILL (check node memory)
│   └── Exit code 127 → Command not found (wrong entrypoint)
├── Status: CreateContainerConfigError
│   └── Missing ConfigMap/Secret → Create the referenced resource
└── Status: Running but not working
    ├── Service has no endpoints → Fix label selector
    ├── Readiness probe failing → Fix probe or application
    └── NetworkPolicy blocking → Check policy rules

When to Escalate

Not every issue can be resolved at the application level. Escalate to cluster administrators when:

All nodes are NotReady -- likely a control plane or network issue
etcd is unhealthy -- data consistency at risk, do not attempt fixes without backup
Certificate expiration -- API server or kubelet certs expired, cluster is degraded
Cloud provider issues -- load balancer not provisioning, PV stuck in provisioning, IAM permissions
CNI plugin failures -- pod-to-pod networking broken across all pods, not just one application
Resource exhaustion at cluster level -- all nodes maxed out, autoscaler not scaling

Before escalating, gather:

kubectl cluster-info dump (full diagnostic bundle)
Relevant pod descriptions and logs
Timeline of when the issue started
Any recent changes (deployments, config changes, cluster upgrades)

Tools for Your Kubernetes Workflow

kubectl Builder -- Build kubectl commands interactively with 17 actions, 25 resource types, and 15 ready-made recipes
Helm CLI Builder -- Construct Helm commands for install, upgrade, rollback, and debugging
Docker CLI Builder -- Debug container images locally before deploying to Kubernetes
YAML/JSON Converter -- Validate and convert Kubernetes manifests
Base64 Encoder/Decoder -- Decode Kubernetes secrets in batch mode

The Kubernetes Troubleshooting Mindset

Effective Kubernetes debugging follows a bottom-up approach: pods, then nodes, then cluster. Most issues originate at the pod level, so start there before expanding your investigation.

Pod Level          → Is the container starting? Crashing? OOMKilled?
  ↓
Node Level         → Is the node healthy? Enough resources? Disk pressure?
  ↓
Cluster Level      → Is the scheduler working? DNS resolving? Network policies blocking traffic?

The three questions to answer first:

What is the pod's status? -- Running, Pending, CrashLoopBackOff, ImagePullBackOff, or Error?
What do the events say? -- kubectl describe almost always reveals the root cause
What do the logs say? -- If the container started at all, the application logs tell you why it crashed

Essential kubectl Commands for Debugging

These are the commands you will use in almost every debugging session. Build them interactively with the kubectl Builder if you need help with flags and output formats.

Get an Overview

# Quick cluster health check
kubectl get nodes -o wide
kubectl get pods --all-namespaces | grep -v Running

# All pods in a namespace with resource usage
kubectl top pods -n my-namespace --sort-by=memory

# Recent cluster events (sorted by time)
kubectl get events --all-namespaces --sort-by=.metadata.creationTimestamp | tail -30

Investigate a Specific Pod

# Full pod details including events
kubectl describe pod my-pod -n my-namespace

# Current logs
kubectl logs my-pod -n my-namespace

# Logs from the previous crash (critical for CrashLoopBackOff)
kubectl logs my-pod -n my-namespace --previous

# Multi-container pod -- specify container
kubectl logs my-pod -n my-namespace -c sidecar-container

# Follow logs in real time
kubectl logs my-pod -n my-namespace -f --tail=100

Investigate a Node

# Node conditions and resource allocation
kubectl describe node my-node | grep -A10 "Conditions:"
kubectl describe node my-node | grep -A6 "Allocated resources:"

# Pods running on a specific node
kubectl get pods --all-namespaces --field-selector spec.nodeName=my-node

# Node resource usage
kubectl top node

Execute Commands in a Running Container

# Open a shell in a running pod
kubectl exec -it my-pod -n my-namespace -- /bin/sh

# Run a one-off command
kubectl exec my-pod -n my-namespace -- cat /etc/resolv.conf

# Debug a pod with no shell (ephemeral container)
kubectl debug my-pod -n my-namespace -it --image=busybox

Tip: The kubectl Builder has 17 actions with all flags pre-configured. Use it to build complex commands like exec, port-forward, or rollout without memorizing syntax.

The 8 Most Common Kubernetes Errors

1. CrashLoopBackOff

The container starts, crashes, restarts, and crashes again with exponential backoff (10s, 20s, 40s, up to 5 minutes).

Quick check:

kubectl logs my-pod -n my-namespace --previous
kubectl describe pod my-pod -n my-namespace | grep -A5 "Last State"

Common causes: OOMKilled, missing config, bad image entrypoint, failed liveness probe, missing dependencies.

For a complete debugging guide, see How to Fix CrashLoopBackOff in Kubernetes.

2. ImagePullBackOff

Kubernetes cannot pull the container image.

kubectl describe pod my-pod -n my-namespace | grep -A3 "Events"
# Look for: "Failed to pull image" or "401 Unauthorized"

Fixes:

Wrong image tag: verify the tag exists in your registry
Private registry: create an imagePullSecret and reference it in the pod spec
Rate limiting: Docker Hub limits pulls for anonymous users (100/6h). Use a paid account or mirror
Typo in image name: ngnix instead of nginx is more common than you think

3. OOMKilled

The container exceeded its memory limit and was terminated by the kernel.

kubectl describe pod my-pod -n my-namespace | grep -i oom
kubectl get pod my-pod -n my-namespace -o jsonpath='{.status.containerStatuses[0].lastState}'

Fixes:

Increase resources.limits.memory in the pod spec
Profile your application's actual memory usage before setting limits
Check for memory leaks (Java heap, Go goroutine leaks, Node.js event listeners)
Set resources.requests.memory to the application's steady-state usage

4. Pod Pending

The pod is created but cannot be scheduled to any node.

kubectl describe pod my-pod -n my-namespace | tail -15
# Look at the Events section for scheduling failures

Common causes: insufficient resources, node selector mismatch, taints/tolerations, PVC not bound.

For a deep dive, see Kubernetes Pod Stuck in Pending: Complete Debugging Guide. There is also an interactive Pod Stuck Pending Runbook with step-by-step commands.

5. CreateContainerConfigError

The container cannot be created because of a configuration problem.

kubectl describe pod my-pod -n my-namespace | grep -A5 "Warning"

Causes:

Referenced ConfigMap does not exist: kubectl get configmap my-config -n my-namespace
Referenced Secret does not exist: kubectl get secret my-secret -n my-namespace
Key missing from ConfigMap/Secret: check the data field
ServiceAccount not found

6. Node NotReady

A node has stopped communicating with the control plane.

kubectl get nodes
kubectl describe node my-node | grep -A5 "Conditions"

Common causes: kubelet crashed, container runtime (containerd/Docker) down, disk pressure, memory exhaustion, network partition.

See the interactive Node NotReady Runbook for step-by-step diagnosis and resolution.

7. Service Not Reachable

The application is running but other pods or external clients cannot connect.

# Verify the service exists and has endpoints
kubectl get svc my-service -n my-namespace
kubectl get endpoints my-service -n my-namespace

# If endpoints is empty, the selector does not match any pods
kubectl get pods -n my-namespace --show-labels

# Test connectivity from inside the cluster
kubectl run debug --rm -it --image=busybox -- wget -qO- http://my-service.my-namespace.svc.cluster.local:8080

Fixes:

Service selector does not match pod labels (most common)
Target port does not match the container's listening port
NetworkPolicy blocking traffic
Pod readiness probe failing (endpoint not registered)

8. Ingress 502/504 Errors

The ingress controller returns 502 Bad Gateway or 504 Gateway Timeout.

# Check ingress configuration
kubectl describe ingress my-ingress -n my-namespace

# Check ingress controller logs
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=50

# Verify backend service is healthy
kubectl get endpoints my-service -n my-namespace

502 (Bad Gateway): The backend pod is not responding. Check if the pod is running and the service port matches.

504 (Gateway Timeout): The backend is too slow. Increase timeout annotations on the ingress, or investigate application performance.

For networking-specific debugging, see Kubernetes Networking Troubleshooting.

Using Helm for Release Management and Rollbacks

When a Helm release goes wrong, fast rollback is your best friend. Use the Helm CLI Builder to construct these commands interactively.

# Check release status
helm status my-release -n my-namespace

# View release history (find the last working revision)
helm history my-release -n my-namespace

# Rollback to a specific revision
helm rollback my-release 3 -n my-namespace

# Rollback to the previous revision
helm rollback my-release 0 -n my-namespace

# Debug a failed install -- render templates without installing
helm template my-release my-chart/ --debug

# Diff between current and new values (requires helm-diff plugin)
helm diff upgrade my-release my-chart/ -f values.yaml

Best practices for Helm troubleshooting:

Always use --wait with helm upgrade so failures are caught immediately
Keep --history-max set (default 10) to avoid infinite release history consuming etcd
Use helm get values my-release to see what values are actually applied
Use helm get manifest my-release to see the rendered YAML that Kubernetes received

For a detailed guide, see Helm Charts: Install, Upgrade, and Rollback.

Working with Kubernetes Manifests

YAML errors are a frequent source of deployment failures. A misplaced indent or wrong type can cause cryptic errors.

# Validate YAML before applying
kubectl apply -f deployment.yaml --dry-run=client

# Server-side validation (catches more issues)
kubectl apply -f deployment.yaml --dry-run=server

# Diff against the live cluster state
kubectl diff -f deployment.yaml

Use the YAML/JSON Converter to validate YAML syntax and convert between YAML and JSON when you need to inspect a manifest programmatically.

Decoding Kubernetes Secrets

Kubernetes secrets are Base64-encoded, not encrypted. Decoding them is a common debugging step.

# List secrets in a namespace
kubectl get secrets -n my-namespace

# View secret data (base64-encoded)
kubectl get secret my-secret -n my-namespace -o yaml

# Decode a specific key
kubectl get secret my-secret -n my-namespace -o jsonpath='{.data.password}' | base64 -d

# Decode all keys at once
kubectl get secret my-secret -n my-namespace -o json | jq -r '.data | to_entries[] | "\(.key): \(.value | @base64d)"'

Tip: Paste the entire kubectl get secret -o yaml output into the Base64 Encoder/Decoder with batch mode enabled. It decodes every key-value pair at once.

Quick Reference: Troubleshooting Decision Tree

Pod not starting?
├── Status: Pending
│   ├── "Insufficient cpu/memory" → Scale nodes or reduce requests
│   ├── "MatchNodeSelector" → Fix nodeSelector or add labels to nodes
│   ├── "PodToleratesNodeTaints" → Add tolerations or remove taints
│   └── PVC Pending → Check StorageClass, PV availability
├── Status: ImagePullBackOff
│   ├── 401 Unauthorized → Create/fix imagePullSecret
│   ├── 404 Not Found → Check image name and tag
│   └── Rate limited → Use registry mirror or authenticated pull
├── Status: CrashLoopBackOff
│   ├── OOMKilled → Increase memory limits
│   ├── Exit code 1 → Check logs (--previous) for application error
│   ├── Exit code 137 → OOM or SIGKILL (check node memory)
│   └── Exit code 127 → Command not found (wrong entrypoint)
├── Status: CreateContainerConfigError
│   └── Missing ConfigMap/Secret → Create the referenced resource
└── Status: Running but not working
    ├── Service has no endpoints → Fix label selector
    ├── Readiness probe failing → Fix probe or application
    └── NetworkPolicy blocking → Check policy rules

When to Escalate

Not every issue can be resolved at the application level. Escalate to cluster administrators when:

All nodes are NotReady -- likely a control plane or network issue
etcd is unhealthy -- data consistency at risk, do not attempt fixes without backup
Certificate expiration -- API server or kubelet certs expired, cluster is degraded
Cloud provider issues -- load balancer not provisioning, PV stuck in provisioning, IAM permissions
CNI plugin failures -- pod-to-pod networking broken across all pods, not just one application
Resource exhaustion at cluster level -- all nodes maxed out, autoscaler not scaling

Before escalating, gather:

kubectl cluster-info dump (full diagnostic bundle)
Relevant pod descriptions and logs
Timeline of when the issue started
Any recent changes (deployments, config changes, cluster upgrades)

Tools for Your Kubernetes Workflow

kubectl Builder -- Build kubectl commands interactively with 17 actions, 25 resource types, and 15 ready-made recipes
Helm CLI Builder -- Construct Helm commands for install, upgrade, rollback, and debugging
Docker CLI Builder -- Debug container images locally before deploying to Kubernetes
YAML/JSON Converter -- Validate and convert Kubernetes manifests
Base64 Encoder/Decoder -- Decode Kubernetes secrets in batch mode

In this series

The Kubernetes Troubleshooting Mindset

Essential kubectl Commands for Debugging

Get an Overview

Investigate a Specific Pod

Investigate a Node

Execute Commands in a Running Container

The 8 Most Common Kubernetes Errors

1. CrashLoopBackOff

2. ImagePullBackOff

3. OOMKilled

4. Pod Pending

5. CreateContainerConfigError

6. Node NotReady

7. Service Not Reachable

8. Ingress 502/504 Errors

Using Helm for Release Management and Rollbacks

Working with Kubernetes Manifests

Decoding Kubernetes Secrets

Quick Reference: Troubleshooting Decision Tree

When to Escalate

Tools for Your Kubernetes Workflow

Related Resources

The Kubernetes Troubleshooting Mindset

Essential kubectl Commands for Debugging

Get an Overview

Investigate a Specific Pod

Investigate a Node

Execute Commands in a Running Container

The 8 Most Common Kubernetes Errors

1. CrashLoopBackOff

2. ImagePullBackOff

3. OOMKilled

4. Pod Pending

5. CreateContainerConfigError

6. Node NotReady

7. Service Not Reachable

8. Ingress 502/504 Errors

Using Helm for Release Management and Rollbacks

Working with Kubernetes Manifests

Decoding Kubernetes Secrets

Quick Reference: Troubleshooting Decision Tree

When to Escalate

Tools for Your Kubernetes Workflow