🛑 The Troubleshooting Playbook: Resolving Common Kubernetes Node Failures


Kubernetes is the engine of modern cloud infrastructure, but even the best engines sometimes sputter. When a node (the worker machine running your containers) fails, your pods get evicted, and your application availability plummets.

Mastering node troubleshooting means quickly identifying the issue behind the cryptic status messages. Here is a playbook covering the most common Kubernetes node errors, their root causes, and practical, production-level resolutions.


1. Node Status: NotReady

This is the most common and broadest node error. When a node is marked NotReady, the Kubernetes control plane (specifically the Controller Manager) hasn’t received a heart beat from the node’s Kubelet agent for a certain duration (default is around 40 seconds). The scheduler will not place any new pods on this node.

The Error Message

When you run kubectl get nodes, you see:

NAME          STATUS     ROLES    AGE   VERSION
node-worker-1 NotReady   <none>   3d    v1.28.3

🔍 Root Causes & Resolutions

Root CauseDiagnostic CheckResolutionA. Kubelet Failure (The node’s primary agent has crashed or stopped.)SSH into the node and check Kubelet status: sudo systemctl status kubelet or journalctl -u kubelet -fRestart Kubelet: sudo systemctl restart kubelet. If it continues to crash, review Kubelet logs for misconfiguration or OOM issues.B. Resource Pressure (The node is running critically low on disk, memory, or processes.)Check node conditions: kubectl describe node <node-name>. Look for DiskPressure: True or MemoryPressure: True.Free Resources: Delete unused Docker images/containers. Increase Limits: Adjust resource requests/limits on existing Pods. Consider scaling up the node size.C. Network Connectivity (Node cannot communicate with the API server.)Ping the API server IP from the node: ping <api-server-ip>. Check firewall rules/Security Groups.Verify CNI: Ensure the Container Network Interface (CNI) plugin (e.g., Calico, Flannel) is running correctly and its firewall rules are not blocking traffic to port 6443 (API server default).D. kube-proxy Failure (Network routing agent is down, disrupting inter-pod communication.)Check the status of the kube-proxy pod in the kube-system namespace.If the pod is crashing, check its logs: kubectl logs -n kube-system <kube-proxy-pod>. Restarting the pod (by deleting it) usually resolves temporary issues.


2. Node Condition: NetworkUnavailable

While related to NotReady, this condition is a specific signal that the network configuration on the node is incomplete or broken, preventing Pods from getting reachable IP addresses.

The Error Message

When you run kubectl describe node <node-name>, you see this under Conditions:

- type: NetworkUnavailable
status: "True"
reason: CalicoIsDown
message: Calico is not running on this node.

🔍 Root Cause & Resolution

Root CauseDiagnostic CheckResolutionCNI Plugin Error (The Container Network Interface (CNI) daemon set failed to provision or configure the network layer.)Check CNI pod logs: kubectl logs -n kube-system <cni-pod-name>. Check host logs for CNI service failures.Inspect CNI Config: Verify the CNI manifest (YAML) is correctly deployed, especially IP range configuration. Restart: Delete the CNI pod to force a restart by the DaemonSet.


3. Container Runtime Error: PLEG is not healthy

This error often appears in the Kubelet logs and is a critical failure that directly leads to the NotReady status. PLEG (Pod Lifecycle Event Generator) is Kubelet’s mechanism for monitoring container changes. When it’s unhealthy, Kubelet can’t reliably report on the state of its running Pods.

The Error Message

Found in Kubelet logs (journalctl -u kubelet):

"PLEG is not healthy: PLEG is not running any longer"

or

"Failed to find network plugin 'cni' in path"

🔍 Root Cause & Resolution

Root CauseDiagnostic CheckResolutionContainer Runtime Issues (The runtime — e.g., containerd or Docker — is crashed, hung, or starved of resources.)SSH into the node and check the container runtime service: sudo systemctl status containerd (or docker). Check runtime logs.Restart Runtime: sudo systemctl restart containerd. If disk pressure is the issue, clear logs/images to free up space, as a full disk often makes the runtime unresponsive.CRI Connection Failure (Kubelet cannot communicate with the runtime via its CRI socket.)Verify the socket path in Kubelet configuration (e.g., /var/run/containerd/containerd.sock).Verify Config: Ensure the Kubelet configuration file (/etc/kubernetes/kubelet.conf) points to the correct container runtime socket path.


4. Eviction Error: OOMKilled (Out-Of-Memory)

While technically a Pod failure, it’s often a symptom of poor resource management leading to node instability. When a container exceeds its memory limit, the kernel’s Out-Of-Memory (OOM) killer steps in, resulting in an OOMKilled event.

The Error Message

Found by describing the Pod: kubectl describe pod <pod-name>

State:          Terminated
Reason: OOMKilled
Exit Code: 137

🔍 Root Cause & Resolution

Root Cause

Diagnostic Check

Resolution : Insufficient Memory Limits (The application actually needs more memory than was defined in the Pod spec.)Check container memory usage history via monitoring tools. Review Pod YAML resource limits.

Increase Limits: Adjust resources: limits: memory in the Pod spec. Optimize Application: Identify and fix memory leaks or inefficient code within the application.Node-level Resource Starvation (Too many Pods are requesting resources, leading to overallocation on the node.)Check node resource usage: kubectl top node <node-name>.Set Requests: Ensure all Pods have explicit resource requests (requests: memory), allowing the scheduler to make better decisions and prevent resource contention.


🛠️ The Node Troubleshooting Workflow

When a node goes NotReady, follow this simple, escalating process:

  • Check Status: kubectl get nodes
  • Gather Details: kubectl describe node <node-name> (Look at Conditions and Events.)
  • Check Control Plane Connection: Try running a simple command against the API server from the control plane machine.
  • SSH and Check Kubelet: SSH into the problematic node and check the Kubelet status and logs: sudo systemctl status kubelet and journalctl -u kubelet -f.
  • Check Container Runtime: Check the container runtime status (e.g., containerd/Docker).
  • Resolve and Restart: Apply the fix (e.g., clear disk space, adjust config, restart service).

By systematically checking Kubelet health, resource conditions, and network connectivity, you can cut through the noise and restore your nodes to a healthy, Ready state.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Post

The 5 Most Common Kubernetes Issues (And How to Fix Them)

A DevOps guide to troubleshooting the errors that keep you up at night, from CrashLoopBackOff to networking black holes. Kubernetes is the undisputed king of container orchestration. It’s powerful, scalable, and the de facto standard for modern cloud-native applications. But with great power comes great complexity. If you’re in DevOps, you know that a significant part of […]

Hosting a Static DevOps Resources Website on AWS with Terraform: A Complete Guide

Introduction In today’s fast-paced DevOps landscape, having a centralized hub for essential tools and resources can be invaluable. Whether you’re onboarding new team members, sharing best practices, or simply maintaining a quick reference guide for cloud and DevOps tools, a well-organized static website can be the perfect solution. In this post, I’ll walk you through […]

Your App’s Bouncer: A No-BS Guide to AWS WAF

Look, I’ve been doing this for over a decade. If there’s one thing I know, it’s that the internet is a dumpster fire of malicious requests, and your beautiful, lovingly-crafted application is the target. You can have the cleanest code in the world (you don’t) and the most robust infrastructure (it’s not), but at 3 […]