A DevOps guide to troubleshooting the errors that keep you up at night, from CrashLoopBackOff to networking black holes.
Kubernetes is the undisputed king of container orchestration. It’s powerful, scalable, and the de facto standard for modern cloud-native applications. But with great power comes great complexity.
If you’re in DevOps, you know that a significant part of your job is playing detective. When a pod turns red, a deployment gets stuck, or services can’t talk to each other, you’re on the clock. The good news is that most K8s issues fall into a few common categories.
Let’s walk through the top 5 technical issues you’ll face in Kubernetes and the practical, command-line solutions to fix them.
1. The Dreaded: CrashLoopBackOff
This is the most infamous error in the Kubernetes world. It means a pod starts, its container(s) run, and then they immediately crash. Kubernetes tries to be helpful by restarting it, but it just crashes again. And again. And again.
Why it happens:
- Application Error: Your app is panicking on startup. This is the most common reason — a null pointer, a bad config file read, a failed database connection.
- Misconfigured Probes: Your Liveness Probe is failing. Maybe the app is too slow to start, and the probe kills it before it’s ready.
- Missing Dependencies: The container is trying to access a file, config map, or secret that doesn’t exist or isn’t mounted.
- Resource Exhaustion: The container is hitting its memory limit and being OOMKilled (Out of Memory Killed) by the system.
How to fix it:
Step 1: Check the logs. This is your primary weapon. You need to see the application’s output from the last container execution (the one that crashed).
# Get the logs from the pod
kubectl logs <pod-name>
# If the pod is crashing, you need the logs from the *previous* failed container
kubectl logs <pod-name> --previous
Step 2: Describe the pod. This tells you why Kubernetes is doing what it’s doing. Look at the “Events” section at the bottom.
kubectl describe pod <pod-name>
You’ll often see messages like Back-off restarting failed container and, more importantly, you might see the Exit Code of the container. For example, Exit Code 137 almost always means it was OOMKilled.
Step 3: Check your probes. If logs are clean but the pod is restarting, check your Liveness Probe.
# A common mistake
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5 # <-- This might be too low!
periodSeconds: 10
- Solution: Increase the
initialDelaySecondsto give your app time to boot. - Better Solution: For slow-starting apps, use a Startup Probe, which runs only at the beginning and has a more generous timeout. Once it passes, the Liveness Probe takes over.
2. The Waiting Game: Pending Pods
You run kubectl apply, and your new pod just sits there, taunting you with the Pending status. It’s not running, it’s not crashing—it’s just… waiting.
Why it happens:
- Insufficient Resources: The cluster is full! The scheduler can’t find a node that has enough free CPU or memory to run your pod.
- Node Taints/Tolerations: Your pod doesn’t “tolerate” a “taint” on the available nodes (e.g., you’re trying to schedule on a control-plane node).
- Failed Volume Mount: The pod is trying to mount a Persistent Volume (PV) that isn’t available or is already in use (e.g., an “ReadWriteOnce” PV on a different node).
How to fix it:
Again, kubectl describe is your best friend.
kubectl describe pod <pod-name>
Check the Events section. You will almost always see a clear message from the scheduler:
- The Fix (Resources): If you see
0/3 nodes are available: 3 Insufficient cpu, you have your answer. You either need to add more nodes (Cluster Autoscaler) or, more likely, you need to adjust your pod’sresources.requests. - The Fix (Taints): If you see
0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate, you need to either add atolerationto your pod’s spec or remove thetaintfrom the node. - The Fix (Volumes): If you see
FailedMountorPersistentVolumeClaim not found, you need to debug your storage. Check that the PVC exists and is in the same namespace.
3. The Typo’s Revenge: ImagePullBackOff / ErrImagePull
This is a close cousin to CrashLoopBackOff, but it’s much simpler. Kubernetes is trying to start your container but can’t even get the image to do so.
Why it happens:
- Typo: You spelled the image name or tag wrong. (e.g.,
my-app:ltestinstead ofmy-app:latest). - Private Registry: The image is in a private registry (like Docker Hub private, ECR, GCR, etc.), and Kubernetes doesn’t have the credentials.
- Invalid Tag: The tag you specified simply doesn’t exist.
How to fix it:
Step 1: Describe the pod.
kubectl describe pod <pod-name>
In the Events section, you’ll see the exact error: Failed to pull image "my-repo/my-app:wrngtag": rpc error: code = Unknown desc = ... not found.
Step 2: Fix the cause.
- Typo: Correct the image name/tag in your Deployment YAML and re-apply.
- Private Registry: You need to create an
imagePullSecret. This is a Kubernetes secret that holds your registry credentials.
Bash
# 1. Create the secret (example for Docker Hub)
kubectl create secret docker-registry regcred \
--docker-server=https://index.docker.io/v1/ \
--docker-username=<your-username> \
--docker-password=<your-password> \
--namespace=<your-namespace>
# 2. Add the secret to your pod/deployment spec
spec:
containers:
- name: my-app
image: my-private-repo/my-app:1.0.0
imagePullSecrets:
- name: regcred # <-- This tells K8s to use your secret
4. The Resource Hog: OOMKilled (and Missing Limits)
Your pod runs for a while… then suddenly disappears and is replaced by a new one. No CrashLoopBackOff, just a silent restart. If you describe the pod, you may see the Last State was OOMKilled.
This means your application’s memory usage spiked, exceeded its allowed limit, and the node’s kernel terminated it to protect itself.
Why it happens:
- No Limits Set: You didn’t specify a memory
limitin your manifest. This is dangerous! Your pod could try to use all the memory on the node. - Memory Leak: Your application has a memory leak.
- Under-provisioned Limit: You set a limit, but it’s just too low for your app’s normal operation (e.g., a Java app with a 100Mi limit).
How to fix it:
Always set requests and limits! This is the single most important thing you can do for cluster stability.
- Requests: The amount of resources K8s guarantees for your pod. This is used for scheduling.
- Limits: The maximum amount of resources your pod is allowed to use.
YAML
spec:
containers:
- name: my-app
image: my-app:1.0.0
resources:
requests:
memory: "256Mi" # <-- Guarantee this much
cpu: "250m" # (0.25 of a core)
limits:
memory: "512Mi" # <-- Kill if it uses more than this
cpu: "500m" # (0.5 of a core)
By setting a limit, you are making the OOMKill predictable. Instead of a random pod dying, your pod will be terminated, and you’ll know why. Your job is then to either optimize your application (fix the leak) or increase the limit to a reasonable-but-safe value.
5. The Black Hole: Networking & Service Discovery
This is the most frustrating category. Your pods are all Running. Your app logs look fine. But your microservices can’t talk to each other. Your frontend can’t reach your backend.
Why it happens:
- Service Selector Mismatch: Your
Serviceobject isn’t “selecting” any pods. This is the #1 cause. - Wrong
targetPort: Your Service is forwarding traffic to the wrong port on the pod. - Network Policy: A
NetworkPolicyis in place that is blocking the traffic by default. - DNS Issues: The cluster’s internal DNS (CoreDNS) is having problems.
How to fix it:
Step 1: Check the Service and Endpoints.
First, check your Service and make sure its selector matches the labels on your pods.
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: my-backend-svc
spec:
selector:
app: my-backend # <-- This MUST match the pod's label
ports:
- protocol: TCP
port: 80 # Port the Service exposes
targetPort: 8080 # <-- Port the container is *actually* listening on
YAML
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-backend-deployment
spec:
template:
metadata:
labels:
app: my-backend # <-- This label
spec:
containers:
- name: my-backend
image: my-backend-image
ports:
- containerPort: 8080 # <-- This port
Step 2: Check the Endpoints object.
Kubernetes automatically creates an Endpoints object for every Service. If the selector is correct, this object will list the IPs of the Running pods.
kubectl get endpoints my-backend-svc
- If you see IPs listed: Your Service is configured correctly. The problem is likely a Network Policy or something inside your app.
- If you see
<none>: Your selector is wrong! Go back to Step 1 and fix your labels.
Step 3: Test from inside the cluster.
Use kubectl exec to get a shell inside a running pod (like your frontend pod) and test connectivity from there.
# Get a shell in your frontend-pod
kubectl exec -it <frontend-pod-name> -- /bin/sh
# Once inside, test the Service name. K8s DNS should resolve it.
# Use curl or wget (whichever your container has)
curl http://my-backend-svc
If this curl fails, you know you have a core networking problem. If it succeeds, but your app still can’t connect, the problem is in your application’s code (e.g., a bad connection string).
Final Thoughts
Kubernetes troubleshooting is a skill built on experience. The key is to learn the “Kubernetes way” of thinking:
- Start with
kubectl get pods. - Move to
kubectl describe podto read the Events. - Check
kubectl logsfor application errors. - For networking, check
Service,Endpoints, andNetworkPolicy.
By mastering these few commands, you can solve 90% of the daily issues you’ll encounter and keep your cluster healthy.