Introduction
Kubernetes troubleshooting can be challenging due to its distributed nature and multiple abstraction layers. This guide covers the most common issues and systematic approaches to diagnosing and fixing them.
Pod Crash Loops
Understanding CrashLoopBackOff
What it means: The pod starts, crashes, restarts, and repeats in an exponential backoff pattern.
Diagnostic Process
Step 1: Check pod status
kubectl get pods -n production
# Output:
# NAME READY STATUS RESTARTS AGE
# myapp-7d8f9c6b5-xyz12 0/1 CrashLoopBackOff 5 10m
Step 2: Describe the pod
kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Look for:
# - Last State: Terminated (exit code)
# - Events: Recent errors or warnings
Step 3: Check logs
# Current logs (if pod is running)
kubectl logs myapp-7d8f9c6b5-xyz12 -n production
# Previous logs (after crash)
kubectl logs myapp-7d8f9c6b5-xyz12 -n production --previous
# Follow logs in real-time
kubectl logs myapp-7d8f9c6b5-xyz12 -n production -f
# Multiple containers in pod
kubectl logs myapp-7d8f9c6b5-xyz12 -n production -c container-name
Common Causes and Solutions
1. Application Error (Exit Code 1)
Symptoms:
kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Last State: Terminated
# Exit Code: 1
Check logs:
kubectl logs myapp-7d8f9c6b5-xyz12 -n production --previous
# Common errors:
# - Uncaught exceptions
# - Missing environment variables
# - Failed database connections
# - Configuration errors
Solutions:
Missing environment variable:
apiVersion: v1
kind: Pod
metadata:
name: myapp
spec:
containers:
- name: myapp
image: myapp:latest
env:
- name: DATABASE_URL
value: "postgres://db:5432/mydb" # Add missing config
Failed health checks causing restart:
# Adjust probe timing
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60 # Increase if app takes time to start
periodSeconds: 10
failureThreshold: 3 # Allow more failures before restart
2. OOMKilled (Out of Memory)
Symptoms:
kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
Check resource usage:
# Current usage
kubectl top pod myapp-7d8f9c6b5-xyz12 -n production
# Output:
# NAME CPU(cores) MEMORY(bytes)
# myapp-7d8f9c6b5-xyz12 100m 512Mi
Solutions:
Increase memory limits:
apiVersion: v1
kind: Pod
metadata:
name: myapp
spec:
containers:
- name: myapp
image: myapp:latest
resources:
requests:
memory: "256Mi" # Guaranteed memory
limits:
memory: "1Gi" # Increased from 512Mi
Investigate memory leaks:
# Get heap dump (Java example)
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
jmap -dump:format=b,file=/tmp/heap.bin 1
# Copy dump locally
kubectl cp production/myapp-7d8f9c6b5-xyz12:/tmp/heap.bin ./heap.bin
# Analyze with tools like Eclipse MAT
3. Image Pull Errors
Symptoms:
kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Events:
# Failed to pull image "myapp:v2.3.0": rpc error: code = Unknown
Common causes:
ImagePullBackOff - Wrong image name:
# Fix typo or version
spec:
containers:
- name: myapp
image: myregistry.com/myapp:v2.3.0 # Correct path
Authentication required:
# Create secret for private registry
kubectl create secret docker-registry regcred \
--docker-server=myregistry.com \
--docker-username=myuser \
--docker-password=mypassword \
--docker-email=[email protected] \
-n production
# Use secret in pod
kubectl patch serviceaccount default -n production \
-p '{"imagePullSecrets": [{"name": "regcred"}]}'
Or in pod spec:
spec:
imagePullSecrets:
- name: regcred
containers:
- name: myapp
image: myregistry.com/myapp:v2.3.0
4. Failed Liveness/Readiness Probes
Symptoms:
kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Events:
# Liveness probe failed: HTTP probe failed with statuscode: 500
# Killing container myapp
Debug probes:
# Test health endpoint manually
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
curl -v http://localhost:8080/health
# Check if app is ready
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
curl http://localhost:8080/ready
Solutions:
Adjust probe configuration:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # App startup time
periodSeconds: 10 # Check every 10s
timeoutSeconds: 5 # Wait 5s for response
failureThreshold: 3 # 3 failures before restart
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10 # Can be lower than liveness
periodSeconds: 5 # Check more frequently
failureThreshold: 2 # Remove from service faster
Separate liveness and readiness:
# Liveness: Is app alive? (restart if fails)
livenessProbe:
httpGet:
path: /health
port: 8080
# Readiness: Can app serve traffic? (remove from service if fails)
readinessProbe:
httpGet:
path: /ready
port: 8080
5. Missing ConfigMaps or Secrets
Symptoms:
kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Events:
# Error: configmap "app-config" not found
Check dependencies:
# List configmaps
kubectl get configmap -n production
# List secrets
kubectl get secrets -n production
# Check specific configmap
kubectl get configmap app-config -n production -o yaml
Create missing resources:
# Create configmap
kubectl create configmap app-config \
--from-file=config.yaml \
-n production
# Create secret
kubectl create secret generic app-secret \
--from-literal=api-key=abc123 \
-n production
Networking Issues
Pod-to-Pod Communication Problems
Diagnostic Commands
Step 1: Verify DNS resolution
# Get pod IPs
kubectl get pods -n production -o wide
# Test DNS from within a pod
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
nslookup kubernetes.default
# Test service DNS
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
nslookup myservice.production.svc.cluster.local
Step 2: Test connectivity
# Ping another pod (if ICMP allowed)
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
ping 10.244.1.5
# Test service endpoint
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
curl -v http://myservice:8080/health
# Test with wget
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
wget -O- http://myservice:8080/health
Step 3: Check network policies
# List network policies
kubectl get networkpolicy -n production
# Describe specific policy
kubectl describe networkpolicy allow-frontend -n production
Common Networking Issues
1. Service Not Reachable
Check service configuration:
# Get service details
kubectl get svc myservice -n production
# Describe service
kubectl describe svc myservice -n production
# Check endpoints
kubectl get endpoints myservice -n production
Verify selectors match:
# Service selector
kubectl get svc myservice -n production -o yaml | grep -A 5 selector
# Pod labels
kubectl get pods -n production --show-labels
Fix selector mismatch:
# Service
apiVersion: v1
kind: Service
metadata:
name: myservice
spec:
selector:
app: myapp # Must match pod labels
ports:
- port: 8080
targetPort: 8080
---
# Pod
apiVersion: v1
kind: Pod
metadata:
name: myapp
labels:
app: myapp # Matches service selector
spec:
containers:
- name: myapp
ports:
- containerPort: 8080
2. DNS Resolution Failures
Check CoreDNS:
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Check CoreDNS config
kubectl get configmap coredns -n kube-system -o yaml
Test DNS from debug pod:
# Run debug pod
kubectl run debug --image=nicolaka/netshoot -i --tty --rm
# Inside debug pod:
nslookup kubernetes.default
nslookup myservice.production.svc.cluster.local
dig myservice.production.svc.cluster.local
# Check /etc/resolv.conf
cat /etc/resolv.conf
Fix DNS issues:
# Ensure pod uses cluster DNS
apiVersion: v1
kind: Pod
metadata:
name: myapp
spec:
dnsPolicy: ClusterFirst # Use cluster DNS
containers:
- name: myapp
image: myapp:latest
3. Network Policy Blocking Traffic
Check if policies exist:
kubectl get networkpolicy -n production
# If policies exist, they might be blocking traffic
Allow pod-to-pod communication:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-within-namespace
namespace: production
spec:
podSelector: {} # All pods in namespace
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector: {} # From any pod in namespace
egress:
- to:
- podSelector: {} # To any pod in namespace
Allow specific ingress:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-backend
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
4. External Service Access Issues
Test external connectivity:
# Test from pod
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
curl -v https://api.external.com
# Test DNS resolution
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
nslookup api.external.com
Allow egress to external services:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-external-egress
namespace: production
spec:
podSelector:
matchLabels:
app: myapp
policyTypes:
- Egress
egress:
- to:
- namespaceSelector: {} # All namespaces
- to: # External traffic
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443
- protocol: TCP
port: 80
- to: # DNS
- namespaceSelector: {}
- podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
Resource Limit Issues
Understanding Requests and Limits
Requests: Guaranteed resources (for scheduling) Limits: Maximum resources allowed (enforced)
resources:
requests:
memory: "256Mi" # Guaranteed
cpu: "250m" # Guaranteed
limits:
memory: "512Mi" # Max allowed
cpu: "500m" # Max allowed (throttled if exceeded)
Diagnosing Resource Issues
Check resource usage:
# Current usage for all pods
kubectl top pods -n production
# Current usage for nodes
kubectl top nodes
# Detailed pod resource info
kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production | grep -A 10 "Limits\|Requests"
Check resource quotas:
# List quotas
kubectl get resourcequota -n production
# Describe quota
kubectl describe resourcequota production-quota -n production
Common Resource Problems
1. Pod Stuck in Pending (Insufficient Resources)
Symptoms:
kubectl get pods -n production
# NAME READY STATUS RESTARTS AGE
# myapp-7d8f9c6b5-xyz12 0/1 Pending 0 5m
kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Events:
# 0/5 nodes available: insufficient memory
Solutions:
Scale down resource requests:
resources:
requests:
memory: "128Mi" # Reduced from 512Mi
cpu: "100m" # Reduced from 500m
Add more nodes:
# AWS EKS example
eksctl scale nodegroup --cluster=my-cluster --name=my-nodegroup --nodes=5
Use cluster autoscaler:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
2. CPU Throttling
Symptoms:
# High CPU usage near limit
kubectl top pod myapp-7d8f9c6b5-xyz12 -n production
# NAME CPU(cores) MEMORY(bytes)
# myapp-7d8f9c6b5-xyz12 495m 256Mi
# Application experiencing slowness
Check throttling metrics:
# Using metrics-server (if available)
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/production/pods/myapp-7d8f9c6b5-xyz12" | jq
Solutions:
Increase CPU limits:
resources:
requests:
cpu: "250m"
limits:
cpu: "1000m" # Increased from 500m
Remove CPU limits (allows bursting):
resources:
requests:
cpu: "250m"
# No CPU limit - can burst to node capacity
limits:
memory: "512Mi" # Keep memory limit
3. Exceeding Resource Quota
Symptoms:
kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Events:
# Error creating: pods "myapp-7d8f9c6b5-xyz12" is forbidden:
# exceeded quota: production-quota
Check quota usage:
kubectl describe resourcequota production-quota -n production
# Used:
# requests.cpu: 8
# requests.memory: 16Gi
# Hard:
# requests.cpu: 10
# requests.memory: 20Gi
Solutions:
Increase quota:
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "20" # Increased
requests.memory: "40Gi" # Increased
limits.cpu: "40"
limits.memory: "80Gi"
Or reduce resource requests in deployments.
Troubleshooting Toolkit
Essential Commands
Pod debugging:
# Get pod details
kubectl get pod POD_NAME -n NAMESPACE -o yaml
# Describe pod (events are key)
kubectl describe pod POD_NAME -n NAMESPACE
# Logs
kubectl logs POD_NAME -n NAMESPACE
kubectl logs POD_NAME -n NAMESPACE --previous
kubectl logs POD_NAME -n NAMESPACE -c CONTAINER_NAME
# Execute commands in pod
kubectl exec POD_NAME -n NAMESPACE -- COMMAND
kubectl exec -it POD_NAME -n NAMESPACE -- /bin/bash
# Port forwarding
kubectl port-forward POD_NAME -n NAMESPACE 8080:8080
# Copy files
kubectl cp NAMESPACE/POD_NAME:/path/to/file ./local-file
Service debugging:
# Get service
kubectl get svc SERVICE_NAME -n NAMESPACE
# Describe service
kubectl describe svc SERVICE_NAME -n NAMESPACE
# Get endpoints
kubectl get endpoints SERVICE_NAME -n NAMESPACE
# Test service from within cluster
kubectl run debug --image=nicolaka/netshoot -i --tty --rm -- \
curl http://SERVICE_NAME.NAMESPACE.svc.cluster.local:PORT
Network debugging:
# Run debug pod
kubectl run debug --image=nicolaka/netshoot -i --tty --rm
# Or attach to existing pod
kubectl exec -it POD_NAME -- /bin/sh
# Inside pod:
ping IP_ADDRESS
curl http://service:port
nslookup service.namespace.svc.cluster.local
traceroute IP_ADDRESS
netstat -tulpn
Debug Pod Image
Use nicolaka/netshoot
for network debugging:
kubectl run netshoot --image=nicolaka/netshoot -i --tty --rm
# Available tools:
# - curl, wget
# - dig, nslookup
# - ping, traceroute
# - netstat, ss
# - tcpdump
# - iperf3
Useful kubectl Plugins
Install krew (plugin manager):
kubectl krew install debug
kubectl krew install tail
kubectl krew install ctx
kubectl krew install ns
kubectl debug (Kubernetes 1.20+):
# Create debug container in pod
kubectl debug myapp-7d8f9c6b5-xyz12 -n production -it --image=busybox
# Debug with custom image
kubectl debug myapp-7d8f9c6b5-xyz12 -n production -it \
--image=nicolaka/netshoot -- /bin/bash
Systematic Troubleshooting Approach
Step-by-Step Process
1. Identify the problem
- What is the symptom?
- When did it start?
- What changed recently?
2. Gather information
# Pod status
kubectl get pods -n production
# Events
kubectl get events -n production --sort-by='.lastTimestamp'
# Logs
kubectl logs POD_NAME -n production --tail=100
3. Form hypothesis
- Based on errors, what could be the cause?
- Similar issues in the past?
4. Test hypothesis
# Test specific scenarios
# Check configurations
# Verify connectivity
5. Implement fix
- Make minimal changes
- Document changes
- Monitor results
6. Verify resolution
- Check pod is running
- Verify functionality
- Monitor for recurrence
Decision Tree
Pod not working?
โโ Pod status?
โ โโ Pending
โ โ โโ Check events: kubectl describe pod
โ โ โโ Insufficient resources? Add nodes or reduce requests
โ โ โโ Image pull error? Fix image name or credentials
โ โโ CrashLoopBackOff
โ โ โโ Check logs: kubectl logs --previous
โ โ โโ Exit code 137? OOMKilled - increase memory
โ โ โโ Exit code 1? Application error - fix code or config
โ โ โโ Probe failure? Adjust probe timing
โ โโ Running but not working
โ โ โโ Check logs: kubectl logs -f
โ โ โโ Test connectivity: kubectl exec -- curl
โ โ โโ Check service: kubectl get endpoints
โ โโ ImagePullBackOff
โ โโ Check image name
โ โโ Check image exists in registry
โ โโ Add imagePullSecrets if private
Best Practices
1. Always Set Resource Requests and Limits
# Good
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
# Bad - no limits
spec:
containers:
- name: myapp
# Missing resources
2. Use Readiness and Liveness Probes
# Liveness: Restart unhealthy pods
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
# Readiness: Remove from service when not ready
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
3. Use Labels Consistently
metadata:
labels:
app: myapp
version: v1.2.3
environment: production
team: platform
4. Enable Resource Monitoring
# Install metrics-server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Use kubectl top
kubectl top nodes
kubectl top pods -n production
5. Implement Logging Strategy
# Use stdout/stderr
# Don't write to files in container
# Example: Python
import logging
import sys
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s',
handlers=[logging.StreamHandler(sys.stdout)]
)
Conclusion
Effective Kubernetes troubleshooting requires:
- Systematic approach - Follow consistent debugging process
- Understanding fundamentals - Know how pods, services, networking work
- Right tools - kubectl, debug containers, monitoring
- Good practices - Resource limits, probes, labels
- Documentation - Record solutions for future reference
Key commands to remember:
kubectl describe pod
- Events are crucialkubectl logs --previous
- See why pod crashedkubectl exec
- Debug from inside podkubectl get events
- Cluster-level issues
With practice, most Kubernetes issues can be diagnosed and resolved quickly using these techniques.