Introduction

Kubernetes troubleshooting can be challenging due to its distributed nature and multiple abstraction layers. This guide covers the most common issues and systematic approaches to diagnosing and fixing them.

Pod Crash Loops

Understanding CrashLoopBackOff

What it means: The pod starts, crashes, restarts, and repeats in an exponential backoff pattern.

Diagnostic Process

Step 1: Check pod status

kubectl get pods -n production

# Output:
# NAME                    READY   STATUS             RESTARTS   AGE
# myapp-7d8f9c6b5-xyz12   0/1     CrashLoopBackOff   5          10m

Step 2: Describe the pod

kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production

# Look for:
# - Last State: Terminated (exit code)
# - Events: Recent errors or warnings

Step 3: Check logs

# Current logs (if pod is running)
kubectl logs myapp-7d8f9c6b5-xyz12 -n production

# Previous logs (after crash)
kubectl logs myapp-7d8f9c6b5-xyz12 -n production --previous

# Follow logs in real-time
kubectl logs myapp-7d8f9c6b5-xyz12 -n production -f

# Multiple containers in pod
kubectl logs myapp-7d8f9c6b5-xyz12 -n production -c container-name

Common Causes and Solutions

1. Application Error (Exit Code 1)

Symptoms:

kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Last State: Terminated
# Exit Code: 1

Check logs:

kubectl logs myapp-7d8f9c6b5-xyz12 -n production --previous

# Common errors:
# - Uncaught exceptions
# - Missing environment variables
# - Failed database connections
# - Configuration errors

Solutions:

Missing environment variable:

apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
  - name: myapp
    image: myapp:latest
    env:
    - name: DATABASE_URL
      value: "postgres://db:5432/mydb"  # Add missing config

Failed health checks causing restart:

# Adjust probe timing
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 60  # Increase if app takes time to start
  periodSeconds: 10
  failureThreshold: 3  # Allow more failures before restart

2. OOMKilled (Out of Memory)

Symptoms:

kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137

Check resource usage:

# Current usage
kubectl top pod myapp-7d8f9c6b5-xyz12 -n production

# Output:
# NAME                    CPU(cores)   MEMORY(bytes)
# myapp-7d8f9c6b5-xyz12   100m         512Mi

Solutions:

Increase memory limits:

apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  containers:
  - name: myapp
    image: myapp:latest
    resources:
      requests:
        memory: "256Mi"  # Guaranteed memory
      limits:
        memory: "1Gi"    # Increased from 512Mi

Investigate memory leaks:

# Get heap dump (Java example)
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
  jmap -dump:format=b,file=/tmp/heap.bin 1

# Copy dump locally
kubectl cp production/myapp-7d8f9c6b5-xyz12:/tmp/heap.bin ./heap.bin

# Analyze with tools like Eclipse MAT

3. Image Pull Errors

Symptoms:

kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Events:
# Failed to pull image "myapp:v2.3.0": rpc error: code = Unknown

Common causes:

ImagePullBackOff - Wrong image name:

# Fix typo or version
spec:
  containers:
  - name: myapp
    image: myregistry.com/myapp:v2.3.0  # Correct path

Authentication required:

# Create secret for private registry
kubectl create secret docker-registry regcred \
  --docker-server=myregistry.com \
  --docker-username=myuser \
  --docker-password=mypassword \
  --docker-email=[email protected] \
  -n production

# Use secret in pod
kubectl patch serviceaccount default -n production \
  -p '{"imagePullSecrets": [{"name": "regcred"}]}'

Or in pod spec:

spec:
  imagePullSecrets:
  - name: regcred
  containers:
  - name: myapp
    image: myregistry.com/myapp:v2.3.0

4. Failed Liveness/Readiness Probes

Symptoms:

kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Events:
# Liveness probe failed: HTTP probe failed with statuscode: 500
# Killing container myapp

Debug probes:

# Test health endpoint manually
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
  curl -v http://localhost:8080/health

# Check if app is ready
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
  curl http://localhost:8080/ready

Solutions:

Adjust probe configuration:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30    # App startup time
  periodSeconds: 10          # Check every 10s
  timeoutSeconds: 5          # Wait 5s for response
  failureThreshold: 3        # 3 failures before restart

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10    # Can be lower than liveness
  periodSeconds: 5           # Check more frequently
  failureThreshold: 2        # Remove from service faster

Separate liveness and readiness:

# Liveness: Is app alive? (restart if fails)
livenessProbe:
  httpGet:
    path: /health
    port: 8080

# Readiness: Can app serve traffic? (remove from service if fails)
readinessProbe:
  httpGet:
    path: /ready
    port: 8080

5. Missing ConfigMaps or Secrets

Symptoms:

kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Events:
# Error: configmap "app-config" not found

Check dependencies:

# List configmaps
kubectl get configmap -n production

# List secrets
kubectl get secrets -n production

# Check specific configmap
kubectl get configmap app-config -n production -o yaml

Create missing resources:

# Create configmap
kubectl create configmap app-config \
  --from-file=config.yaml \
  -n production

# Create secret
kubectl create secret generic app-secret \
  --from-literal=api-key=abc123 \
  -n production

Networking Issues

Pod-to-Pod Communication Problems

Diagnostic Commands

Step 1: Verify DNS resolution

# Get pod IPs
kubectl get pods -n production -o wide

# Test DNS from within a pod
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
  nslookup kubernetes.default

# Test service DNS
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
  nslookup myservice.production.svc.cluster.local

Step 2: Test connectivity

# Ping another pod (if ICMP allowed)
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
  ping 10.244.1.5

# Test service endpoint
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
  curl -v http://myservice:8080/health

# Test with wget
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
  wget -O- http://myservice:8080/health

Step 3: Check network policies

# List network policies
kubectl get networkpolicy -n production

# Describe specific policy
kubectl describe networkpolicy allow-frontend -n production

Common Networking Issues

1. Service Not Reachable

Check service configuration:

# Get service details
kubectl get svc myservice -n production

# Describe service
kubectl describe svc myservice -n production

# Check endpoints
kubectl get endpoints myservice -n production

Verify selectors match:

# Service selector
kubectl get svc myservice -n production -o yaml | grep -A 5 selector

# Pod labels
kubectl get pods -n production --show-labels

Fix selector mismatch:

# Service
apiVersion: v1
kind: Service
metadata:
  name: myservice
spec:
  selector:
    app: myapp  # Must match pod labels
  ports:
  - port: 8080
    targetPort: 8080

---
# Pod
apiVersion: v1
kind: Pod
metadata:
  name: myapp
  labels:
    app: myapp  # Matches service selector
spec:
  containers:
  - name: myapp
    ports:
    - containerPort: 8080

2. DNS Resolution Failures

Check CoreDNS:

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Check CoreDNS config
kubectl get configmap coredns -n kube-system -o yaml

Test DNS from debug pod:

# Run debug pod
kubectl run debug --image=nicolaka/netshoot -i --tty --rm

# Inside debug pod:
nslookup kubernetes.default
nslookup myservice.production.svc.cluster.local
dig myservice.production.svc.cluster.local

# Check /etc/resolv.conf
cat /etc/resolv.conf

Fix DNS issues:

# Ensure pod uses cluster DNS
apiVersion: v1
kind: Pod
metadata:
  name: myapp
spec:
  dnsPolicy: ClusterFirst  # Use cluster DNS
  containers:
  - name: myapp
    image: myapp:latest

3. Network Policy Blocking Traffic

Check if policies exist:

kubectl get networkpolicy -n production

# If policies exist, they might be blocking traffic

Allow pod-to-pod communication:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-within-namespace
  namespace: production
spec:
  podSelector: {}  # All pods in namespace
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector: {}  # From any pod in namespace
  egress:
  - to:
    - podSelector: {}  # To any pod in namespace

Allow specific ingress:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080

4. External Service Access Issues

Test external connectivity:

# Test from pod
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
  curl -v https://api.external.com

# Test DNS resolution
kubectl exec myapp-7d8f9c6b5-xyz12 -n production -- \
  nslookup api.external.com

Allow egress to external services:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-external-egress
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: myapp
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector: {}  # All namespaces
  - to:  # External traffic
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443
    - protocol: TCP
      port: 80
  - to:  # DNS
    - namespaceSelector: {}
    - podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53

Resource Limit Issues

Understanding Requests and Limits

Requests: Guaranteed resources (for scheduling) Limits: Maximum resources allowed (enforced)

resources:
  requests:
    memory: "256Mi"  # Guaranteed
    cpu: "250m"      # Guaranteed
  limits:
    memory: "512Mi"  # Max allowed
    cpu: "500m"      # Max allowed (throttled if exceeded)

Diagnosing Resource Issues

Check resource usage:

# Current usage for all pods
kubectl top pods -n production

# Current usage for nodes
kubectl top nodes

# Detailed pod resource info
kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production | grep -A 10 "Limits\|Requests"

Check resource quotas:

# List quotas
kubectl get resourcequota -n production

# Describe quota
kubectl describe resourcequota production-quota -n production

Common Resource Problems

1. Pod Stuck in Pending (Insufficient Resources)

Symptoms:

kubectl get pods -n production
# NAME                    READY   STATUS    RESTARTS   AGE
# myapp-7d8f9c6b5-xyz12   0/1     Pending   0          5m

kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Events:
# 0/5 nodes available: insufficient memory

Solutions:

Scale down resource requests:

resources:
  requests:
    memory: "128Mi"  # Reduced from 512Mi
    cpu: "100m"      # Reduced from 500m

Add more nodes:

# AWS EKS example
eksctl scale nodegroup --cluster=my-cluster --name=my-nodegroup --nodes=5

Use cluster autoscaler:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70

2. CPU Throttling

Symptoms:

# High CPU usage near limit
kubectl top pod myapp-7d8f9c6b5-xyz12 -n production
# NAME                    CPU(cores)   MEMORY(bytes)
# myapp-7d8f9c6b5-xyz12   495m         256Mi

# Application experiencing slowness

Check throttling metrics:

# Using metrics-server (if available)
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/production/pods/myapp-7d8f9c6b5-xyz12" | jq

Solutions:

Increase CPU limits:

resources:
  requests:
    cpu: "250m"
  limits:
    cpu: "1000m"  # Increased from 500m

Remove CPU limits (allows bursting):

resources:
  requests:
    cpu: "250m"
  # No CPU limit - can burst to node capacity
  limits:
    memory: "512Mi"  # Keep memory limit

3. Exceeding Resource Quota

Symptoms:

kubectl describe pod myapp-7d8f9c6b5-xyz12 -n production
# Events:
# Error creating: pods "myapp-7d8f9c6b5-xyz12" is forbidden:
# exceeded quota: production-quota

Check quota usage:

kubectl describe resourcequota production-quota -n production
# Used:
#   requests.cpu: 8
#   requests.memory: 16Gi
# Hard:
#   requests.cpu: 10
#   requests.memory: 20Gi

Solutions:

Increase quota:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "20"      # Increased
    requests.memory: "40Gi" # Increased
    limits.cpu: "40"
    limits.memory: "80Gi"

Or reduce resource requests in deployments.

Troubleshooting Toolkit

Essential Commands

Pod debugging:

# Get pod details
kubectl get pod POD_NAME -n NAMESPACE -o yaml

# Describe pod (events are key)
kubectl describe pod POD_NAME -n NAMESPACE

# Logs
kubectl logs POD_NAME -n NAMESPACE
kubectl logs POD_NAME -n NAMESPACE --previous
kubectl logs POD_NAME -n NAMESPACE -c CONTAINER_NAME

# Execute commands in pod
kubectl exec POD_NAME -n NAMESPACE -- COMMAND
kubectl exec -it POD_NAME -n NAMESPACE -- /bin/bash

# Port forwarding
kubectl port-forward POD_NAME -n NAMESPACE 8080:8080

# Copy files
kubectl cp NAMESPACE/POD_NAME:/path/to/file ./local-file

Service debugging:

# Get service
kubectl get svc SERVICE_NAME -n NAMESPACE

# Describe service
kubectl describe svc SERVICE_NAME -n NAMESPACE

# Get endpoints
kubectl get endpoints SERVICE_NAME -n NAMESPACE

# Test service from within cluster
kubectl run debug --image=nicolaka/netshoot -i --tty --rm -- \
  curl http://SERVICE_NAME.NAMESPACE.svc.cluster.local:PORT

Network debugging:

# Run debug pod
kubectl run debug --image=nicolaka/netshoot -i --tty --rm

# Or attach to existing pod
kubectl exec -it POD_NAME -- /bin/sh

# Inside pod:
ping IP_ADDRESS
curl http://service:port
nslookup service.namespace.svc.cluster.local
traceroute IP_ADDRESS
netstat -tulpn

Debug Pod Image

Use nicolaka/netshoot for network debugging:

kubectl run netshoot --image=nicolaka/netshoot -i --tty --rm

# Available tools:
# - curl, wget
# - dig, nslookup
# - ping, traceroute
# - netstat, ss
# - tcpdump
# - iperf3

Useful kubectl Plugins

Install krew (plugin manager):

kubectl krew install debug
kubectl krew install tail
kubectl krew install ctx
kubectl krew install ns

kubectl debug (Kubernetes 1.20+):

# Create debug container in pod
kubectl debug myapp-7d8f9c6b5-xyz12 -n production -it --image=busybox

# Debug with custom image
kubectl debug myapp-7d8f9c6b5-xyz12 -n production -it \
  --image=nicolaka/netshoot -- /bin/bash

Systematic Troubleshooting Approach

Step-by-Step Process

1. Identify the problem

  • What is the symptom?
  • When did it start?
  • What changed recently?

2. Gather information

# Pod status
kubectl get pods -n production

# Events
kubectl get events -n production --sort-by='.lastTimestamp'

# Logs
kubectl logs POD_NAME -n production --tail=100

3. Form hypothesis

  • Based on errors, what could be the cause?
  • Similar issues in the past?

4. Test hypothesis

# Test specific scenarios
# Check configurations
# Verify connectivity

5. Implement fix

  • Make minimal changes
  • Document changes
  • Monitor results

6. Verify resolution

  • Check pod is running
  • Verify functionality
  • Monitor for recurrence

Decision Tree

Pod not working?
โ”œโ”€ Pod status?
โ”‚  โ”œโ”€ Pending
โ”‚  โ”‚  โ”œโ”€ Check events: kubectl describe pod
โ”‚  โ”‚  โ”œโ”€ Insufficient resources? Add nodes or reduce requests
โ”‚  โ”‚  โ””โ”€ Image pull error? Fix image name or credentials
โ”‚  โ”œโ”€ CrashLoopBackOff
โ”‚  โ”‚  โ”œโ”€ Check logs: kubectl logs --previous
โ”‚  โ”‚  โ”œโ”€ Exit code 137? OOMKilled - increase memory
โ”‚  โ”‚  โ”œโ”€ Exit code 1? Application error - fix code or config
โ”‚  โ”‚  โ””โ”€ Probe failure? Adjust probe timing
โ”‚  โ”œโ”€ Running but not working
โ”‚  โ”‚  โ”œโ”€ Check logs: kubectl logs -f
โ”‚  โ”‚  โ”œโ”€ Test connectivity: kubectl exec -- curl
โ”‚  โ”‚  โ””โ”€ Check service: kubectl get endpoints
โ”‚  โ””โ”€ ImagePullBackOff
โ”‚     โ”œโ”€ Check image name
โ”‚     โ”œโ”€ Check image exists in registry
โ”‚     โ””โ”€ Add imagePullSecrets if private

Best Practices

1. Always Set Resource Requests and Limits

# Good
resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

# Bad - no limits
spec:
  containers:
  - name: myapp
    # Missing resources

2. Use Readiness and Liveness Probes

# Liveness: Restart unhealthy pods
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

# Readiness: Remove from service when not ready
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

3. Use Labels Consistently

metadata:
  labels:
    app: myapp
    version: v1.2.3
    environment: production
    team: platform

4. Enable Resource Monitoring

# Install metrics-server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Use kubectl top
kubectl top nodes
kubectl top pods -n production

5. Implement Logging Strategy

# Use stdout/stderr
# Don't write to files in container

# Example: Python
import logging
import sys

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(levelname)s %(message)s',
    handlers=[logging.StreamHandler(sys.stdout)]
)

Conclusion

Effective Kubernetes troubleshooting requires:

  1. Systematic approach - Follow consistent debugging process
  2. Understanding fundamentals - Know how pods, services, networking work
  3. Right tools - kubectl, debug containers, monitoring
  4. Good practices - Resource limits, probes, labels
  5. Documentation - Record solutions for future reference

Key commands to remember:

  • kubectl describe pod - Events are crucial
  • kubectl logs --previous - See why pod crashed
  • kubectl exec - Debug from inside pod
  • kubectl get events - Cluster-level issues

With practice, most Kubernetes issues can be diagnosed and resolved quickly using these techniques.