Fix CrashLoopBackOff in Kubernetes — Systematic Debugging Guide

The CrashLoopBackOff status is one of the most common and most frustrating Kubernetes errors. It means your container starts, crashes, Kubernetes restarts it, it crashes again, and the restart delay grows exponentially. Understanding the backoff cycle and the most common root causes will help you resolve it in minutes instead of hours.

What CrashLoopBackOff Actually Means

When a container exits with a non-zero exit code, Kubernetes restarts it according to the pod's restartPolicy (which defaults to Always for Deployments). Each restart adds an increasing delay:

Restart 1: wait 10 seconds
Restart 2: wait 20 seconds
Restart 3: wait 40 seconds
Restart 4: wait 80 seconds
Restart 5: wait 160 seconds
Restart 6+: wait 300 seconds (5 minute cap)

The "BackOff" part is this exponential delay. The pod status shows CrashLoopBackOff when Kubernetes is waiting between restarts. During the brief moments when the container is actually running (and then crashing), the status shows Error or Running.

# You will see something like this
kubectl get pods -n my-namespace
# NAME          READY   STATUS             RESTARTS      AGE
# my-app-abc    0/1     CrashLoopBackOff   7 (3m ago)    15m

The restart count and time-since-last-restart give you clues. High restart count (50+) with a recent restart means it has been crashing for a while. The backoff timer resets after the container runs successfully for 10 minutes.

6 Common Causes and How to Fix Them

1. OOMKilled -- Memory Limit Exceeded

The most common cause. The container used more memory than its resources.limits.memory allows, and the kernel killed it.

Diagnose:

# Check if the last termination was OOMKilled
kubectl describe pod my-pod -n my-namespace | grep -A3 "Last State"
#     Last State:  Terminated
#       Reason:    OOMKilled
#       Exit Code: 137

# Check current memory limits
kubectl get pod my-pod -n my-namespace -o jsonpath='{.spec.containers[0].resources}' | jq .

Fix:

# Increase the memory limit in your deployment spec
resources:
  requests:
    memory: "256Mi"    # What the scheduler uses for placement
  limits:
    memory: "512Mi"    # Hard ceiling -- OOMKilled if exceeded

# Apply the change
kubectl apply -f deployment.yaml
# Or patch directly
kubectl set resources deployment/my-app -n my-namespace --limits=memory=512Mi

How to find the right limit: Run the application under load and observe actual usage with kubectl top pod. Set the limit to 1.5-2x the peak observed usage.

2. Missing ConfigMap or Secret

The container depends on a ConfigMap or Secret that does not exist, is in the wrong namespace, or is missing a required key.

Diagnose:

kubectl describe pod my-pod -n my-namespace | grep -A10 "Events"
# Warning  Failed  ... Error: configmap "my-config" not found
# Warning  Failed  ... Error: secret "my-secret" not found

Fix:

# Check if the configmap/secret exists
kubectl get configmap my-config -n my-namespace
kubectl get secret my-secret -n my-namespace

# If missing, create it
kubectl create configmap my-config --from-file=config.yaml -n my-namespace
kubectl create secret generic my-secret --from-literal=password=mypass -n my-namespace

Tip: To inspect the actual values in a secret, use the Base64 Encoder/Decoder with batch mode. Paste the kubectl get secret -o yaml output and it decodes every field.

3. Bad Container Image

The image exists and pulls successfully, but the entrypoint or command is wrong.

Diagnose:

# Check the exit code
kubectl describe pod my-pod -n my-namespace | grep "Exit Code"
# Exit Code 127 = command not found
# Exit Code 126 = permission denied on executable
# Exit Code 1   = application error
# Exit Code 2   = shell builtin misuse

# Check the logs
kubectl logs my-pod -n my-namespace --previous

Common issues:

Wrong command or args in the pod spec overriding the image's entrypoint
Image tag pointing to a version with a breaking change
Alpine-based image missing required libraries (glibc vs musl)

Fix:

# Test the image locally first
docker run --rm -it my-image:tag /bin/sh

# Verify the entrypoint
docker inspect my-image:tag --format='{{.Config.Entrypoint}} {{.Config.Cmd}}'

Use the Docker CLI Builder to construct docker run and docker inspect commands with the right flags.

4. Failed Liveness Probe

The liveness probe is configured to check the application's health, but it fails -- so Kubernetes kills and restarts the container.

Diagnose:

kubectl describe pod my-pod -n my-namespace | grep -A5 "Liveness"
# Liveness:    http-get http://:8080/healthz delay=10s timeout=1s period=10s #success=1 #failure=3

kubectl describe pod my-pod -n my-namespace | grep "Unhealthy"
# Warning  Unhealthy  Liveness probe failed: HTTP probe failed with statuscode: 503

Common fixes:

Increase initialDelaySeconds: application needs more time to start
Increase timeoutSeconds: probe times out before the app responds
Fix the health endpoint: application is returning 5xx on the health check
Check the port: liveness probe port does not match the application's listening port

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30    # Give the app time to start
  periodSeconds: 10
  timeoutSeconds: 5          # Allow slow responses
  failureThreshold: 3        # Fail 3 times before restarting

5. Permission Denied

The container process cannot access files, sockets, or ports due to SecurityContext restrictions or file permission mismatches.

Diagnose:

kubectl logs my-pod -n my-namespace --previous
# Look for: "Permission denied", "EACCES", "Operation not permitted"

# Check the security context
kubectl get pod my-pod -n my-namespace -o jsonpath='{.spec.containers[0].securityContext}' | jq .
kubectl get pod my-pod -n my-namespace -o jsonpath='{.spec.securityContext}' | jq .

Common causes:

runAsNonRoot: true but the image's entrypoint requires root
readOnlyRootFilesystem: true but the app writes to the filesystem
Volume mounted with wrong fsGroup -- files not readable by the container user
Port below 1024 requires root (or NET_BIND_SERVICE capability)

Fix:

securityContext:
  runAsUser: 1000
  runAsGroup: 1000
  fsGroup: 1000              # Set group ownership on mounted volumes

6. Missing Dependency

The container starts but immediately exits because it cannot reach a required service -- database, message queue, external API.

Diagnose:

kubectl logs my-pod -n my-namespace --previous
# Look for: "connection refused", "ECONNREFUSED", "no such host", "timeout"

Fix: Use init containers to wait for dependencies before starting the main container:

initContainers:
  - name: wait-for-db
    image: busybox:1.36
    command: ['sh', '-c', 'until nc -z postgres-service 5432; do echo waiting for db; sleep 2; done']

Or use readiness probes on the dependency service so Kubernetes DNS only resolves healthy backends.

Step-by-Step Debugging Workflow

When you encounter CrashLoopBackOff, follow this sequence:

# Step 1: Get the pod status and restart count
kubectl get pods -n my-namespace | grep my-app

# Step 2: Check events for scheduling or config errors
kubectl describe pod my-pod -n my-namespace | tail -20

# Step 3: Check the previous crash logs
kubectl logs my-pod -n my-namespace --previous

# Step 4: If no --previous logs, the container is dying too fast
# Check the exit code:
kubectl get pod my-pod -n my-namespace -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

# Step 5: Check cluster events for context
kubectl get events -n my-namespace --sort-by=.metadata.creationTimestamp | tail -20

# Step 6: If you can catch it while running, exec in
kubectl exec -it my-pod -n my-namespace -- /bin/sh

Tip: Build these commands quickly using the kubectl Builder -- select the action (logs, describe, get, exec), set the namespace and resource name, and copy the result.

Prevention Strategies

Set Proper Resource Limits

resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "500m"        # Optional -- some teams skip CPU limits
    memory: "256Mi"    # Always set memory limits

Use Readiness and Liveness Probes Correctly

Readiness probe: controls whether the pod receives traffic. Failing readiness does NOT restart the pod.
Liveness probe: controls whether the pod is restarted. Only use it for situations where restarting actually helps.
Startup probe: gives slow-starting containers extra time. Replaces overly generous initialDelaySeconds.

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10          # Up to 300s to start
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5

Use Init Containers for Dependencies

Do not let your main container crash because a database is not ready. Use init containers to wait for dependencies, keeping the main container's restart count clean.

Roll Back with Helm

When a new release causes CrashLoopBackOff, roll back immediately:

helm rollback my-release 0 -n my-namespace

Use the Helm CLI Builder to construct rollback commands with the right flags.

Quick Reference Table

Cause	Exit Code	Key Symptom	Fix
OOMKilled	137	`Reason: OOMKilled` in describe	Increase memory limit
Missing ConfigMap/Secret	--	`CreateContainerConfigError`	Create the missing resource
Bad image/entrypoint	127 or 126	`command not found` in logs	Fix command/args or rebuild image
Failed liveness probe	137	`Unhealthy` events	Increase delay/timeout or fix health endpoint
Permission denied	1 or 126	`EACCES` in logs	Fix securityContext or file permissions
Missing dependency	1	`connection refused` in logs	Add init containers or retry logic

Next Steps

Kubernetes Troubleshooting Guide -- The complete debugging framework for all Kubernetes issues
kubectl Builder -- Build the debugging commands used in this article interactively
Pod Stuck in Pending -- If your pod never reaches the starting phase at all

What CrashLoopBackOff Actually Means

When a container exits with a non-zero exit code, Kubernetes restarts it according to the pod's restartPolicy (which defaults to Always for Deployments). Each restart adds an increasing delay:

Restart 1: wait 10 seconds
Restart 2: wait 20 seconds
Restart 3: wait 40 seconds
Restart 4: wait 80 seconds
Restart 5: wait 160 seconds
Restart 6+: wait 300 seconds (5 minute cap)

# You will see something like this
kubectl get pods -n my-namespace
# NAME          READY   STATUS             RESTARTS      AGE
# my-app-abc    0/1     CrashLoopBackOff   7 (3m ago)    15m

6 Common Causes and How to Fix Them

1. OOMKilled -- Memory Limit Exceeded

The most common cause. The container used more memory than its resources.limits.memory allows, and the kernel killed it.

Diagnose:

# Check if the last termination was OOMKilled
kubectl describe pod my-pod -n my-namespace | grep -A3 "Last State"
#     Last State:  Terminated
#       Reason:    OOMKilled
#       Exit Code: 137

# Check current memory limits
kubectl get pod my-pod -n my-namespace -o jsonpath='{.spec.containers[0].resources}' | jq .

Fix:

# Increase the memory limit in your deployment spec
resources:
  requests:
    memory: "256Mi"    # What the scheduler uses for placement
  limits:
    memory: "512Mi"    # Hard ceiling -- OOMKilled if exceeded

# Apply the change
kubectl apply -f deployment.yaml
# Or patch directly
kubectl set resources deployment/my-app -n my-namespace --limits=memory=512Mi

How to find the right limit: Run the application under load and observe actual usage with kubectl top pod. Set the limit to 1.5-2x the peak observed usage.

2. Missing ConfigMap or Secret

The container depends on a ConfigMap or Secret that does not exist, is in the wrong namespace, or is missing a required key.

Diagnose:

kubectl describe pod my-pod -n my-namespace | grep -A10 "Events"
# Warning  Failed  ... Error: configmap "my-config" not found
# Warning  Failed  ... Error: secret "my-secret" not found

Fix:

# Check if the configmap/secret exists
kubectl get configmap my-config -n my-namespace
kubectl get secret my-secret -n my-namespace

# If missing, create it
kubectl create configmap my-config --from-file=config.yaml -n my-namespace
kubectl create secret generic my-secret --from-literal=password=mypass -n my-namespace

Tip: To inspect the actual values in a secret, use the Base64 Encoder/Decoder with batch mode. Paste the kubectl get secret -o yaml output and it decodes every field.

3. Bad Container Image

The image exists and pulls successfully, but the entrypoint or command is wrong.

Diagnose:

# Check the exit code
kubectl describe pod my-pod -n my-namespace | grep "Exit Code"
# Exit Code 127 = command not found
# Exit Code 126 = permission denied on executable
# Exit Code 1   = application error
# Exit Code 2   = shell builtin misuse

# Check the logs
kubectl logs my-pod -n my-namespace --previous

Common issues:

Wrong command or args in the pod spec overriding the image's entrypoint
Image tag pointing to a version with a breaking change
Alpine-based image missing required libraries (glibc vs musl)

Fix:

# Test the image locally first
docker run --rm -it my-image:tag /bin/sh

# Verify the entrypoint
docker inspect my-image:tag --format='{{.Config.Entrypoint}} {{.Config.Cmd}}'

Use the Docker CLI Builder to construct docker run and docker inspect commands with the right flags.

4. Failed Liveness Probe

The liveness probe is configured to check the application's health, but it fails -- so Kubernetes kills and restarts the container.

Diagnose:

kubectl describe pod my-pod -n my-namespace | grep -A5 "Liveness"
# Liveness:    http-get http://:8080/healthz delay=10s timeout=1s period=10s #success=1 #failure=3

kubectl describe pod my-pod -n my-namespace | grep "Unhealthy"
# Warning  Unhealthy  Liveness probe failed: HTTP probe failed with statuscode: 503

Common fixes:

Increase initialDelaySeconds: application needs more time to start
Increase timeoutSeconds: probe times out before the app responds
Fix the health endpoint: application is returning 5xx on the health check
Check the port: liveness probe port does not match the application's listening port

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30    # Give the app time to start
  periodSeconds: 10
  timeoutSeconds: 5          # Allow slow responses
  failureThreshold: 3        # Fail 3 times before restarting

5. Permission Denied

The container process cannot access files, sockets, or ports due to SecurityContext restrictions or file permission mismatches.

Diagnose:

kubectl logs my-pod -n my-namespace --previous
# Look for: "Permission denied", "EACCES", "Operation not permitted"

# Check the security context
kubectl get pod my-pod -n my-namespace -o jsonpath='{.spec.containers[0].securityContext}' | jq .
kubectl get pod my-pod -n my-namespace -o jsonpath='{.spec.securityContext}' | jq .

Common causes:

runAsNonRoot: true but the image's entrypoint requires root
readOnlyRootFilesystem: true but the app writes to the filesystem
Volume mounted with wrong fsGroup -- files not readable by the container user
Port below 1024 requires root (or NET_BIND_SERVICE capability)

Fix:

securityContext:
  runAsUser: 1000
  runAsGroup: 1000
  fsGroup: 1000              # Set group ownership on mounted volumes

6. Missing Dependency

The container starts but immediately exits because it cannot reach a required service -- database, message queue, external API.

Diagnose:

kubectl logs my-pod -n my-namespace --previous
# Look for: "connection refused", "ECONNREFUSED", "no such host", "timeout"

Fix: Use init containers to wait for dependencies before starting the main container:

initContainers:
  - name: wait-for-db
    image: busybox:1.36
    command: ['sh', '-c', 'until nc -z postgres-service 5432; do echo waiting for db; sleep 2; done']

Or use readiness probes on the dependency service so Kubernetes DNS only resolves healthy backends.

Step-by-Step Debugging Workflow

When you encounter CrashLoopBackOff, follow this sequence:

# Step 1: Get the pod status and restart count
kubectl get pods -n my-namespace | grep my-app

# Step 2: Check events for scheduling or config errors
kubectl describe pod my-pod -n my-namespace | tail -20

# Step 3: Check the previous crash logs
kubectl logs my-pod -n my-namespace --previous

# Step 4: If no --previous logs, the container is dying too fast
# Check the exit code:
kubectl get pod my-pod -n my-namespace -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

# Step 5: Check cluster events for context
kubectl get events -n my-namespace --sort-by=.metadata.creationTimestamp | tail -20

# Step 6: If you can catch it while running, exec in
kubectl exec -it my-pod -n my-namespace -- /bin/sh

Tip: Build these commands quickly using the kubectl Builder -- select the action (logs, describe, get, exec), set the namespace and resource name, and copy the result.

Prevention Strategies

Set Proper Resource Limits

resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "500m"        # Optional -- some teams skip CPU limits
    memory: "256Mi"    # Always set memory limits

Use Readiness and Liveness Probes Correctly

Readiness probe: controls whether the pod receives traffic. Failing readiness does NOT restart the pod.
Liveness probe: controls whether the pod is restarted. Only use it for situations where restarting actually helps.
Startup probe: gives slow-starting containers extra time. Replaces overly generous initialDelaySeconds.

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10          # Up to 300s to start
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5

Use Init Containers for Dependencies

Do not let your main container crash because a database is not ready. Use init containers to wait for dependencies, keeping the main container's restart count clean.

Roll Back with Helm

When a new release causes CrashLoopBackOff, roll back immediately:

helm rollback my-release 0 -n my-namespace

Use the Helm CLI Builder to construct rollback commands with the right flags.

Quick Reference Table

Cause	Exit Code	Key Symptom	Fix
OOMKilled	137	`Reason: OOMKilled` in describe	Increase memory limit
Missing ConfigMap/Secret	--	`CreateContainerConfigError`	Create the missing resource
Bad image/entrypoint	127 or 126	`command not found` in logs	Fix command/args or rebuild image
Failed liveness probe	137	`Unhealthy` events	Increase delay/timeout or fix health endpoint
Permission denied	1 or 126	`EACCES` in logs	Fix securityContext or file permissions
Missing dependency	1	`connection refused` in logs	Add init containers or retry logic

Next Steps

Kubernetes Troubleshooting Guide -- The complete debugging framework for all Kubernetes issues
kubectl Builder -- Build the debugging commands used in this article interactively
Pod Stuck in Pending -- If your pod never reaches the starting phase at all

How to Fix CrashLoopBackOff in Kubernetes: A Systematic Guide

What CrashLoopBackOff Actually Means

6 Common Causes and How to Fix Them

1. OOMKilled -- Memory Limit Exceeded

2. Missing ConfigMap or Secret

3. Bad Container Image

4. Failed Liveness Probe

5. Permission Denied

6. Missing Dependency

Step-by-Step Debugging Workflow

Prevention Strategies

Set Proper Resource Limits

Use Readiness and Liveness Probes Correctly

Use Init Containers for Dependencies

Roll Back with Helm

Quick Reference Table

Next Steps

Related Resources

What CrashLoopBackOff Actually Means

6 Common Causes and How to Fix Them

1. OOMKilled -- Memory Limit Exceeded

2. Missing ConfigMap or Secret

3. Bad Container Image

4. Failed Liveness Probe

5. Permission Denied

6. Missing Dependency

Step-by-Step Debugging Workflow

Prevention Strategies

Set Proper Resource Limits

Use Readiness and Liveness Probes Correctly

Use Init Containers for Dependencies

Roll Back with Helm

Quick Reference Table

Next Steps