Chaos Engineering: Building Resilient Systems Through Controlled Experiments

Introduction

Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. Rather than waiting for failures to happen, chaos engineering proactively injects failures to identify weaknesses before they impact users.

Why does this matter? In modern distributed systems (microservices, cloud infrastructure, containers), failures are inevitable. A network can partition, a server can crash, a database can slow down. Traditional testing can’t predict all the ways these components interact when things go wrong. Chaos engineering fills this gap by deliberately causing failures in a controlled way.

Real-world example: Netflix pioneered chaos engineering with “Chaos Monkey,” a tool that randomly kills production servers. By doing this regularly, Netflix ensured their systems could survive server failures without affecting customers watching movies. When AWS had a major outage in 2011, Netflix stayed online while competitors went down—because they had already tested their resilience.

Core Principle: “The best time to find out how your system fails is before your customers do.”

What is Chaos Engineering?

Definition

Chaos Engineering is a systematic approach to discovering system weaknesses by deliberately introducing failures and observing how the system responds.

Think of it like a fire drill: You don’t wait for a real fire to see if your evacuation plan works. Similarly, you don’t wait for a production outage to discover your system can’t handle a database failover.

Key Characteristics:

Proactive: Test before failures occur naturally
- What this means: Instead of reacting to outages at 3 AM, you intentionally cause failures during business hours when your team is ready to respond. This builds “muscle memory” for incident response.
Controlled: Experiments have clear boundaries and rollback plans
- What this means: You don’t randomly break things. Each experiment has a defined scope (e.g., “kill 10% of pods for 5 minutes”) and an automatic stop mechanism if things go wrong.
Observable: Measure system behavior during experiments
- What this means: Before breaking anything, you define metrics to watch (error rate, latency, throughput). If these metrics degrade beyond acceptable limits, the experiment stops automatically.
Incremental: Start small, increase blast radius gradually
- What this means: Begin by killing one pod in staging, then one pod in production, then 5%, then 10%. Don’t jump straight to “delete the entire database.”
Production-focused: Real-world conditions matter most
- What this means: Staging environments don’t have real traffic patterns, real data volumes, or real failure modes. Testing in production (carefully) gives you confidence that matters.

Chaos Engineering vs Traditional Testing

Understanding the difference:

Traditional Testing answers: “Does my code work as expected?”

Unit tests check if a function returns the right value
Integration tests check if two services can talk to each other
You write tests for scenarios you can imagine

Chaos Engineering answers: “What happens when things I didn’t expect go wrong?”

What if the network is slow but not completely down?
What if 3 out of 10 servers crash at the same time?
What if the database becomes read-only unexpectedly?

Visual comparison:

Traditional Testing:
┌─────────────┐
│   Known     │ ──> Test for expected failures
│  Failures   │     (unit tests, integration tests)
└─────────────┘
Example: "Test that API returns 404 when user doesn't exist"

Chaos Engineering:
┌─────────────┐
│  Unknown    │ ──> Discover unexpected failures
│  Unknowns   │     (what you didn't think to test)
└─────────────┘
Example: "What happens when the user database is unreachable?"

Detailed Comparison:

Aspect	Traditional Testing	Chaos Engineering	Example
Scope	Known failure modes	Unknown failure modes	Testing “user not found” vs discovering “what if user DB is slow”
Environment	Test/staging	Production (ideally)	Staging has 10 users, production has 10 million
Approach	Validate correctness	Discover weaknesses	“Does login work?” vs “Can we handle login when auth service is degraded?”
Timing	Before deployment	During production	CI/CD pipeline vs continuous production testing
Goal	Prevent bugs	Build resilience	Fix broken code vs survive infrastructure failures
Mindset	“It should work”	“What could go wrong?”	Optimistic vs paranoid (in a good way)

Why you need both: Traditional testing catches bugs in your code. Chaos engineering catches weaknesses in your architecture and assumptions about how systems interact.

Principles of Chaos Engineering

1. Build a Hypothesis Around Steady State

What is steady state? It’s what “normal” looks like for your system—the baseline metrics when everything is working fine.

Why define it? Because during a chaos experiment, you need to know if things are getting worse. Without a baseline, you can’t tell if your experiment is causing problems.

The key insight: Measure business outcomes, not technical internals.

Bad hypothesis (technical):

"CPU should stay under 80%"

Why is this bad? CPU usage is an internal metric. Customers don’t care about CPU—they care about whether their order goes through. High CPU might be fine if the system is still processing orders successfully.

Good hypothesis (business):

"Order completion rate should stay above 99.5%
even when 20% of backend pods are unavailable"

Why is this good? It focuses on what matters to users (orders completing) and defines acceptable degradation (99.5% success rate). This tells you whether customers are affected.

Another example:

Bad: “Memory usage should stay below 8GB” Good: “API p95 latency should stay below 500ms when 2 out of 5 database replicas fail”

The good hypothesis answers: “Can our customers still use the product when infrastructure fails?”

Steady State Indicators:

service: checkout-service
steady_state_metrics:
  - name: order_success_rate
    threshold: ">99.5%"
    measurement: "successful_orders / total_orders"

  - name: p95_latency
    threshold: "<500ms"
    measurement: "95th percentile checkout time"

  - name: payment_success
    threshold: ">99%"
    measurement: "successful_payments / total_attempts"

2. Vary Real-World Events

What does this mean? Inject failures that could actually happen in production—not theoretical edge cases that will never occur.

The goal: Simulate real disasters you’ve seen before (or that your competitors have experienced).

Common Real-World Failures (Explained):

Network latency/partition
- What it is: Network gets slow (latency) or completely cut off (partition) between services
- Real example: AWS availability zone loses connectivity to another zone
- Why test it: Your microservices might timeout or retry indefinitely, cascading the failure
Pod/container crashes
- What it is: Application container dies unexpectedly
- Real example: Out-of-memory (OOM) killer terminates your process, or a bug causes a panic
- Why test it: Verify Kubernetes restarts pods automatically and load balancers remove unhealthy instances
Resource exhaustion (CPU, memory, disk)
- What it is: System runs out of a critical resource
- Real example: Sudden traffic spike maxes out CPU, or logs fill up the disk
- Why test it: Check if auto-scaling works and if your app degrades gracefully
DNS failures
- What it is: DNS lookups fail or return wrong results
- Real example: DNS server becomes unreachable or cache expires
- Why test it: Many apps don’t handle DNS failures well, causing cascading failures
Cloud provider outages
- What it is: Entire AWS region or Google Cloud zone goes down
- Real example: AWS us-east-1 outage (happens every year)
- Why test it: Verify your multi-region failover actually works
Dependency failures
- What it is: External service (payment gateway, auth provider, database) becomes unavailable
- Real example: Stripe API returns 500 errors
- Why test it: Check if your app can degrade gracefully (e.g., queue orders for later)
Corrupt data
- What it is: Database contains bad data that crashes your app
- Real example: Migration bug writes NULL where code expects a value
- Why test it: Verify input validation and error handling
Clock skew
- What it is: Server system time drifts from actual time
- Real example: NTP sync fails, server thinks it’s 5 minutes in the future
- Why test it: Tokens expire early, logs have wrong timestamps, distributed systems get confused
Certificate expiration
- What it is: TLS/SSL certificate expires
- Real example: Let’s Encrypt cert renewal fails
- Why test it: Many services go down completely when certs expire (and auto-renewal might not work)

3. Run Experiments in Production

Wait, production? Isn’t that dangerous? Yes, if done carelessly. But it’s the only way to get real confidence.

Why staging isn’t enough:

Staging environments don’t have the same:

Traffic patterns
- Staging: 10 test users clicking buttons
- Production: 10,000 real users with unpredictable behavior
- Why it matters: Load balancing and caching behave totally differently at scale
Data volume
- Staging: 1,000 database rows
- Production: 10 million rows
- Why it matters: Queries that are fast in staging become slow in production, breaking timeouts
Service dependencies
- Staging: Mocked payment gateway, fake email service
- Production: Real Stripe API, real SendGrid
- Why it matters: External service failures (rate limits, timeouts) don’t happen in staging
Infrastructure scale
- Staging: 3 small servers
- Production: 50 large servers across 3 regions
- Why it matters: Network topology, failure domains, and scaling behavior are completely different
Real failure modes
- Staging: Clean environment, recently deployed
- Production: Months of accumulated state, edge cases, memory leaks
- Why it matters: Production has bugs and conditions that staging will never reproduce

The Netflix example: When Netflix tested in production and found issues, they could fix them before customers were affected. When they only tested in staging, outages surprised them during AWS failures.

Making production testing safe: Start with tiny blast radius (1% of traffic, 1 pod) and automate rollback if metrics degrade.

Production Experiment Safety:

experiment:
  name: pod-failure-test
  environment: production

  safety_measures:
    - blast_radius: "5% of pods"
    - rollback_trigger: "error_rate > 1%"
    - time_limit: "5 minutes"
    - hours: "business_hours_only"
    - monitoring: "active_observation_required"
    - communication: "team_notified_in_advance"

4. Automate Experiments

Why automate? Because manually breaking things every week is:

Time-consuming (toil)
Inconsistent (humans forget steps)
Not scalable (what about 50 services?)
Easy to skip (“we’re too busy this week”)

Automation benefits:

Consistency
- Manual: “Did we test pod failure this week? I think Bob did it… or was that last week?”
- Automated: Same experiment runs every Monday at 2 AM, logs results, alerts if it fails
Repeatability
- Manual: Each engineer runs the experiment slightly differently
- Automated: Exact same steps, same blast radius, same rollback triggers every time
Continuous validation
- Manual: Test once per quarter (when you remember)
- Automated: Test every week automatically, catch regressions immediately
Reduced toil
- Manual: 30 minutes of engineer time per experiment
- Automated: Set it and forget it, only investigate when tests fail

Real-world example: Chaos Monkey runs automatically at Netflix, randomly killing servers. Engineers don’t manually pick servers to kill—automation does it continuously, ensuring systems stay resilient as code changes.

5. Minimize Blast Radius

What is blast radius? The percentage of your system (or users) affected by the chaos experiment.

Why minimize it? Because if your experiment goes wrong, you want to affect as few customers as possible.

The golden rule: Start tiny, increase slowly.

Blast Radius Progression (Explained):

Week 1:  1% of traffic, 1 pod, staging environment
↓ Why: Zero customer risk, verify the experiment works
↓ Outcome: Confirmed the chaos tool works, monitoring alerts fire

Week 2:  1% of traffic, 1 pod, production (off-peak)
↓ Why: Tiny production impact (1% of users), during low-traffic hours
↓ Outcome: Found issue: pod restart took 30 seconds (too slow), need to fix

Week 3:  5% of traffic, 5 pods, production (off-peak)
↓ Why: Increased blast radius after fixing Week 2 issue
↓ Outcome: System handled it well, confidence growing

Week 4:  10% of traffic, production (business hours)
↓ Why: Test during real traffic to see if systems can handle it
↓ Outcome: Success! No customer-facing errors, ready to increase to 20%

What if Week 2 failed? You don’t proceed to Week 3. You fix the issue first, then re-run Week 2 until it passes.

Real-world disaster: A company skipped this progression, jumped straight to “delete 50% of database replicas in production,” and caused a major outage. Don’t be that company.

Remember: It’s not a race. Slow and steady builds confidence without risking customer trust.

Chaos Engineering Maturity Model

Level 1: Ad-Hoc Chaos

Characteristics:

Manual experiments
No formal process
Reactive to incidents
Limited tooling

Example:

# "Let's see what happens if I kill this pod"
kubectl delete pod api-server-xyz

Level 2: Planned Experiments

Characteristics:

Documented experiments
GameDay events
Hypothesis-driven
Team coordination

Example:

# chaos-experiment.yaml
experiment:
  name: "Q4-2025-GameDay-Zone-Failure"
  date: "2025-12-15"
  hypothesis: "System remains available during AZ failure"
  participants: ["SRE", "Platform", "Backend"]
  rollback_plan: "documented"

Level 3: Automated Continuous Chaos

Characteristics:

Automated experiments
Integrated into CI/CD
Continuous validation
Self-service for teams

Example:

# Automated weekly chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: weekly-pod-chaos
spec:
  schedule: "0 2 * * 1"  # 2 AM every Monday
  type: PodChaos
  podChaos:
    action: pod-kill
    mode: one
    selector:
      namespaces:
        - production
      labelSelectors:
        app: backend

Level 4: Chaos as a Service

Characteristics:

Platform for all teams
Advanced failure scenarios
Automated remediation testing
Culture of resilience

Designing Chaos Experiments

Experiment Template

# Experiment: [Name]

## Hypothesis
**Steady State:** [What is normal?]
**Disruption:** [What will we break?]
**Expected Outcome:** [System should remain in steady state]

## Scope
- **Service:** [Target service]
- **Blast Radius:** [Percentage/count of instances]
- **Duration:** [How long]
- **Environment:** [Production/staging]

## Preconditions
- [ ] Monitoring dashboards ready
- [ ] Team notified
- [ ] Rollback plan documented
- [ ] Off-hours/on-hours decision made
- [ ] Incident response team on standby

## Experiment Steps
1. [Establish baseline]
2. [Inject failure]
3. [Observe behavior]
4. [Measure metrics]
5. [Rollback]
6. [Analyze results]

## Success Criteria
- [ ] Steady state maintained
- [ ] No customer impact
- [ ] Graceful degradation observed
- [ ] Alerts fired appropriately

## Rollback Plan
[How to stop the experiment immediately]

## Results
[Document findings]

Example Experiment: Pod Failure

# chaos-experiment-001.yaml
experiment:
  name: "Backend Pod Failure Test"
  id: "EXP-001"
  date: "2025-10-16"

hypothesis:
  steady_state: "Order success rate >99.5%, p95 latency <500ms"
  disruption: "Kill 10% of backend pods"
  expected: "Kubernetes auto-healing maintains service levels"

scope:
  service: "order-service"
  namespace: "production"
  blast_radius: "10% of pods (2 out of 20)"
  duration: "5 minutes"

preconditions:
  - monitoring: "Grafana dashboard open"
  - notification: "Team notified in #sre-chaos"
  - time: "Tuesday 2 PM PST (low traffic)"
  - oncall: "SRE on standby"

steps:
  1. baseline:
      action: "Observe metrics for 5 minutes"
      metrics: ["order_success_rate", "p95_latency", "error_rate"]

  2. inject_failure:
      tool: "chaos-mesh"
      action: "pod-kill"
      target: "2 pods with label app=order-service"

  3. observe:
      duration: "5 minutes"
      watch:
        - "Pod restart time"
        - "Service availability"
        - "Error rates"

  4. measure:
      compare: "baseline vs during-chaos vs post-chaos"

  5. rollback:
      automatic: true
      trigger: "error_rate > 1% OR p95_latency > 1000ms"

success_criteria:
  - order_success_rate: ">99.5%"
  - p95_latency: "<600ms"
  - customer_complaints: "0"
  - auto_recovery_time: "<30s"

Common Chaos Experiments

1. Pod/Container Failures

Scenario: Random pod crashes

What you’re testing: Does your system handle pod failures gracefully?

Why this matters in production:

Pods crash all the time: out-of-memory, bugs, node failures
Kubernetes should automatically restart them
Your load balancer should stop sending traffic to dead pods
Users should never notice

What you expect to see:

Pod crashes
Kubernetes detects unhealthy pod within 10 seconds
New pod starts automatically
Load balancer routes traffic to healthy pods
Zero customer-facing errors

What you might discover:

Pod restart takes 60 seconds (too slow → optimize startup time)
Load balancer keeps sending traffic to dead pod for 30 seconds (health check interval too long)
Application doesn’t handle graceful shutdown (loses in-flight requests)

Chaos Mesh Example:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-experiment
spec:
  action: pod-kill
  mode: fixed-percent
  value: "10"
  selector:
    namespaces:
      - production
    labelSelectors:
      app: backend
  duration: "30s"
  scheduler:
    cron: "@every 10m"

What to Observe:

Pod restart time
Service availability during restart
Load balancer behavior
Alert notifications

Expected Resilience:

✓ Kubernetes restarts pods automatically
✓ Service remains available (other pods handle traffic)
✓ No customer-facing errors
✓ Alerts fire and auto-resolve

2. Network Latency

Scenario: Degraded network between services

What you’re testing: Does your system handle slow networks gracefully?

Why this matters in production:

Networks don’t just fail completely—they often get slow first
250ms latency might not seem like much, but it adds up across microservices
If Service A calls Service B (250ms) which calls Service C (250ms) which calls Service D (250ms), your user waits 750ms+ for a response

What you expect to see:

Request latency increases slightly (acceptable)
Timeouts are configured correctly (requests fail fast instead of hanging)
Circuit breakers open when latency exceeds threshold
Retries don’t make the problem worse

What you might discover:

No timeouts configured → requests hang for 60+ seconds
Retry logic makes it worse (retrying slow requests overwhelms the service)
Circuit breaker doesn’t exist or is misconfigured
Downstream latency cascades to all services (everything becomes slow)

Chaos Mesh Example:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-experiment
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-gateway
  delay:
    latency: "250ms"
    correlation: "100"
    jitter: "50ms"
  duration: "2m"
  target:
    mode: all
    selector:
      labelSelectors:
        app: database

What to Observe:

Request latency impact
Timeout behavior
Retry logic
Circuit breaker activation

3. Resource Exhaustion

Scenario: CPU/Memory pressure

Chaos Mesh Example:

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress-experiment
spec:
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: worker
  stressors:
    cpu:
      workers: 2
      load: 80
  duration: "3m"

What to Observe:

Auto-scaling triggers
Performance degradation
Resource limits enforcement
OOM killer behavior

4. Dependency Failures

Scenario: External service unavailable

Chaos Mesh Example:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: dependency-failure
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api
  direction: to
  target:
    mode: all
    selector:
      labelSelectors:
        app: payment-service
  duration: "1m"

What to Observe:

Circuit breaker activation
Fallback behavior
Retry policies
Graceful degradation

5. DNS Failures

Scenario: DNS resolution failures

Litmus Example:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: dns-chaos
spec:
  engineState: "active"
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-dns-error
    spec:
      components:
        env:
          - name: TARGET_CONTAINER
            value: "app"
          - name: CHAOS_DURATION
            value: "60"

6. Clock Skew

Scenario: System time drift

Custom Script:

#!/bin/bash
# Inject clock skew (requires privileged access)

POD=$(kubectl get pod -l app=api -o jsonpath='{.items[0].metadata.name}')

# Set time 5 minutes in the future
kubectl exec $POD -- date -s "$(date -d '+5 minutes' --rfc-3339=seconds)"

# Observe for 2 minutes
sleep 120

# Restore correct time
kubectl exec $POD -- ntpdate -s time.google.com

What to Observe:

Token expiration issues
Timestamp validation
Logging accuracy
Certificate validation

GameDay Planning

What is a GameDay?

Definition: A GameDay is a scheduled, team-wide chaos engineering event where you simulate a major disaster and practice your response.

Think of it as: A fire drill for your infrastructure. Everyone knows it’s happening, everyone participates, and you learn what works (and what doesn’t) in your incident response.

Why GameDays matter:

A planned chaos engineering event where teams:

Test disaster recovery procedures
- What this means: Your runbook says “if the database fails, promote the replica.” Does that actually work? GameDay finds out.
- Example: During a GameDay, one company discovered their database failover script had a typo that would have caused a 4-hour outage in a real disaster.
Practice incident response
- What this means: When a real outage happens at 3 AM, panicked engineers make mistakes. GameDays let you practice the response when you’re calm and prepared.
- Example: Teams learn who does what, how to communicate in Slack, when to escalate to management.
Validate runbooks
- What this means: Is your documentation actually correct and complete? Or does it say “Step 3: Fix the database” without explaining how?
- Example: A GameDay revealed a runbook referenced a server that had been decommissioned 6 months ago.
Build muscle memory
- What this means: The first time you handle a database failover, it takes 2 hours. The tenth time, it takes 10 minutes because you know exactly what to do.
- Example: Netflix runs GameDays monthly, so when real AWS outages happen, their teams execute flawlessly.

GameDay vs Regular Chaos Experiments:

Regular experiment: Automated, small blast radius, runs weekly (e.g., kill 1 pod)
GameDay: Manual, larger scope, quarterly event (e.g., entire region failure, whole team participates)

GameDay Template

# GameDay: [Scenario Name]
Date: [YYYY-MM-DD]
Duration: 2-4 hours

## Objectives
1. [Primary objective]
2. [Secondary objective]
3. [Learning goal]

## Participants
- **Incident Commander:** [Name]
- **SRE Team:** [Names]
- **Platform Team:** [Names]
- **Observers:** [Names]

## Scenario
[Description of the failure scenario]

## Timeline
- T-0:00: Baseline established
- T+0:05: Inject failure
- T+0:10: Teams detect and respond
- T+0:30: Mitigation implemented
- T+1:00: Recovery complete
- T+1:30: Retrospective

## Success Criteria
- [ ] Incident detected within 5 minutes
- [ ] Runbook followed correctly
- [ ] Service restored within 30 minutes
- [ ] No data loss
- [ ] Communication protocol followed

## Failure Injection Plan
[Detailed steps]

## Rollback Plan
[Emergency stop procedure]

## Post-GameDay
- [ ] Retrospective scheduled
- [ ] Runbooks updated
- [ ] Gaps identified
- [ ] Follow-up actions assigned

Example GameDay: Database Failover

gameday:
  name: "PostgreSQL Primary Failure"
  date: "2025-11-01"
  duration: "2 hours"

  scenario: |
    The primary PostgreSQL instance fails. Teams must:
    1. Detect the failure
    2. Promote replica to primary
    3. Update connection strings
    4. Verify data integrity

  objectives:
    - Test automated failover
    - Validate runbook accuracy
    - Practice cross-team coordination
    - Measure recovery time

  participants:
    ic: "Alice (SRE)"
    sre: ["Bob", "Carol"]
    platform: ["Dave", "Eve"]
    observers: ["CTO", "Product Lead"]

  timeline:
    "09:00": "Kick-off meeting, review procedures"
    "09:15": "Establish baseline metrics"
    "09:30": "Inject failure: kill primary DB"
    "09:35": "Teams detect and respond"
    "10:00": "Expected: failover complete"
    "10:30": "Verify all services healthy"
    "11:00": "Retrospective and learnings"

  metrics:
    - time_to_detect
    - time_to_failover
    - data_loss_amount
    - services_affected
    - customer_impact

  success_criteria:
    - detection_time: "<5 minutes"
    - failover_time: "<15 minutes"
    - data_loss: "0 transactions"
    - automated_failover: true

Tools and Platforms

Chaos Mesh (Kubernetes)

Installation:

# Install Chaos Mesh
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash

# Verify installation
kubectl get pods -n chaos-mesh

Basic Experiment:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-example
  namespace: chaos-mesh
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      app: nginx
  scheduler:
    cron: "@every 2m"

Dashboard Access:

# Port-forward to Chaos Mesh dashboard
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333

# Access at http://localhost:2333

Litmus Chaos (Cloud-Native)

Installation:

# Install Litmus
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.0.0.yaml

# Install chaos experiments
kubectl apply -f https://hub.litmuschaos.io/api/chaos/3.0.0?file=charts/generic/experiments.yaml

ChaosEngine Example:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: default
spec:
  engineState: "active"
  chaosServiceAccount: litmus
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
          - name: TOTAL_CHAOS_DURATION
            value: '30'
          - name: CHAOS_INTERVAL
            value: '10'
          - name: FORCE
            value: 'false'

Gremlin (Commercial)

Installation:

# Kubernetes installation
helm repo add gremlin https://helm.gremlin.com
helm install gremlin gremlin/gremlin \
  --set gremlin.teamID=$GREMLIN_TEAM_ID \
  --set gremlin.teamSecret=$GREMLIN_TEAM_SECRET

Attack Example (CLI):

# Shutdown attack on specific container
gremlin attack container shutdown \
  --labels "app=api" \
  --length 60

# CPU attack
gremlin attack container cpu \
  --labels "app=worker" \
  --cores 2 \
  --length 120

# Network latency attack
gremlin attack container latency \
  --labels "app=frontend" \
  --delay 300 \
  --length 180

Observability During Chaos

Pre-Experiment Checklist

observability_checklist:
  dashboards:
    - name: "Service Health Dashboard"
      url: "grafana/service-health"
      metrics: ["error_rate", "latency", "throughput"]

    - name: "Infrastructure Dashboard"
      url: "grafana/infra"
      metrics: ["cpu", "memory", "network"]

  alerts:
    - verify: "Alerts are configured"
    - test: "Alert routing works"
    - oncall: "On-call engineer available"

  logging:
    - check: "Log aggregation working"
    - access: "Team has log access"
    - retention: "Logs retained for analysis"

  tracing:
    - verify: "Distributed tracing enabled"
    - sample_rate: ">1% of requests"

Monitoring Chaos Experiments

Prometheus Queries:

# Error rate during experiment
sum(rate(http_requests_total{status=~"5.."}[1m]))
/
sum(rate(http_requests_total[1m])) * 100

# Latency percentiles
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[1m])) by (le)
)

# Pod restarts during chaos
sum(kube_pod_container_status_restarts_total{namespace="production"})

Grafana Annotations:

{
  "dashboardUID": "service-health",
  "time": 1697472000000,
  "timeEnd": 1697472300000,
  "tags": ["chaos-experiment", "pod-kill"],
  "text": "EXP-001: Pod Kill Experiment - 10% of backend pods"
}

Safety and Best Practices

Safety Guardrails

safety_guardrails:
  blast_radius:
    - rule: "Never affect >20% of instances"
    - rule: "Start with 1 instance, increase gradually"

  timing:
    - rule: "Run during business hours (with team available)"
    - rule: "Avoid Black Friday, tax season, etc."

  automation:
    - rule: "Automated rollback on error threshold breach"
    - rule: "Maximum experiment duration enforced"

  communication:
    - rule: "Notify team 24 hours in advance"
    - rule: "Announce in #incidents before starting"
    - rule: "Keep stakeholders informed"

  approval:
    - rule: "GameDays require manager approval"
    - rule: "Production chaos requires SRE lead approval"

Rollback Triggers

# Automatic experiment termination
rollback_conditions:
  error_rate:
    threshold: ">1%"
    window: "1m"
    action: "terminate_immediately"

  latency_p95:
    threshold: ">1000ms"
    window: "2m"
    action: "terminate_immediately"

  customer_complaints:
    threshold: ">5"
    window: "5m"
    action: "terminate_and_alert"

  manual:
    command: "kubectl delete -f chaos-experiment.yaml"
    hotkey: "Ctrl+C in terminal"

Communication Protocol

## Pre-Experiment (24 hours before)

**Slack #chaos-engineering:**
> 🧪 **Chaos Experiment Scheduled**
>
> **What:** Pod Kill - Backend Service
> **When:** Tuesday Oct 16, 2 PM PST
> **Duration:** 5 minutes
> **Blast Radius:** 10% of pods
> **Expected Impact:** None (auto-healing)
> **Rollback:** Automated on error >1%
>
> Questions? Reply here or DM @alice

## During Experiment

**Slack #incidents:**
> ⚠️ **CHAOS EXPERIMENT IN PROGRESS**
>
> **Status:** ACTIVE
> **Started:** 2:00 PM PST
> **Expected End:** 2:05 PM PST
> **Dashboard:** [link]
>
> This is a planned experiment. No action required.

## Post-Experiment

**Slack #chaos-engineering:**
> ✅ **Chaos Experiment Complete**
>
> **Results:** SUCCESS
> **Hypothesis:** Confirmed
> **Findings:** Auto-healing worked as expected
> **Learnings:** [link to doc]
> **Next:** Increase blast radius to 20%

Measuring Success

Chaos Engineering KPIs

kpis:
  experiment_velocity:
    metric: "Experiments per month"
    current: 4
    target: 12
    trend: "increasing"

  coverage:
    metric: "% of services with chaos tests"
    current: 40%
    target: 80%

  mttr_improvement:
    metric: "Mean time to recovery"
    before_chaos: "45 minutes"
    after_chaos: "15 minutes"
    improvement: "66%"

  incident_reduction:
    metric: "Production incidents per month"
    before: 12
    after: 5
    improvement: "58%"

  confidence_score:
    metric: "Team confidence in system resilience (1-10)"
    before: 5
    after: 8

Experiment Results Template

# Experiment Results: EXP-001

## Hypothesis
System maintains >99.5% availability when 10% of pods fail

## Result
✅ CONFIRMED

## Metrics

| Metric | Baseline | During Chaos | Impact |
|--------|----------|--------------|--------|
| Success Rate | 99.8% | 99.7% | -0.1% ✅ |
| P95 Latency | 280ms | 320ms | +40ms ✅ |
| Pod Restart Time | N/A | 12s | ✅ |

## Observations

### What Worked ✅
- Kubernetes auto-healing restarted pods in <15s
- Service mesh load balancer rerouted traffic immediately
- No customer-facing errors
- Alerts fired and auto-resolved correctly

### What Didn't Work ❌
- Brief latency spike (+40ms) during pod restart
- Grafana dashboard missing pod restart metric

### Surprises 🤔
- One pod failed to restart due to ImagePullBackOff
- Discovered stale image tag in deployment manifest

## Action Items
- [ ] Fix ImagePullBackOff issue (ticket #1234)
- [ ] Add pod restart time to Grafana dashboard
- [ ] Update runbook with observed behavior
- [ ] Schedule follow-up experiment with 20% blast radius

## Confidence Level
Before: 6/10 → After: 8/10

Common Pitfalls

Pitfall 1: Skipping Production

Problem: Only testing in staging Impact: Miss real-world failure modes Solution: Start with small blast radius in production

Pitfall 2: No Hypothesis

Problem: “Let’s break stuff and see what happens” Impact: No learning, no improvement Solution: Always define expected steady state

Pitfall 3: Too Much Chaos

Problem: Testing everything at once Impact: Can’t identify root cause Solution: One variable at a time

Pitfall 4: No Rollback Plan

Problem: Experiment goes wrong, no way to stop it Impact: Real incident Solution: Always have automated rollback

Pitfall 5: Chaos Without Observability

Problem: Can’t measure impact Impact: Don’t know if experiment succeeded Solution: Monitoring before chaos

Implementation Roadmap

Month 1: Foundation

**Goals:**
- Build chaos engineering awareness
- Set up tooling
- Run first experiments in staging

**Actions:**
- [ ] Install Chaos Mesh/Litmus
- [ ] Create experiment templates
- [ ] Define safety guardrails
- [ ] Run 2-3 staging experiments
- [ ] Document learnings

**Deliverable:** Chaos engineering playbook

Month 2: Production Experiments

**Goals:**
- Move to production with small blast radius
- Build team confidence
- Establish GameDay cadence

**Actions:**
- [ ] Run first production experiment (1% blast radius)
- [ ] Schedule monthly GameDay
- [ ] Create observability dashboards
- [ ] Document runbooks based on learnings

**Deliverable:** First production chaos report

Month 3: Automation

**Goals:**
- Automate recurring experiments
- Expand coverage to more services
- Build self-service capability

**Actions:**
- [ ] Automate top 5 experiments
- [ ] Integrate chaos into CI/CD
- [ ] Enable teams to run their own experiments
- [ ] Create chaos engineering dashboard

**Deliverable:** Automated chaos pipeline

Month 6+: Continuous Chaos

**Goals:**
- Chaos as normal operation
- Continuous resilience validation
- Culture of resilience

**Actions:**
- [ ] 80% of services have chaos tests
- [ ] Weekly automated chaos
- [ ] Monthly GameDays
- [ ] Chaos training for all engineers

**Deliverable:** Resilience-first culture

Conclusion

Chaos Engineering is not about breaking things—it’s about building confidence in your system’s ability to handle failure. Key takeaways:

Start Small: Single pod, 1% traffic, staging first
Be Scientific: Hypothesis → Experiment → Learn
Automate: Manual chaos doesn’t scale
Observe: You can’t improve what you can’t measure
Communicate: Transparency builds trust
Iterate: Increase complexity gradually
Make it Cultural: Resilience is everyone’s responsibility

Remember: “It’s not a question of if your system will fail, but when. Chaos engineering helps you be ready.”

“Hope is not a strategy. Test your systems before your customers do.”

Introduction#

What is Chaos Engineering?#

Definition#

Chaos Engineering vs Traditional Testing#

Principles of Chaos Engineering#

1. Build a Hypothesis Around Steady State#

2. Vary Real-World Events#

3. Run Experiments in Production#

4. Automate Experiments#

5. Minimize Blast Radius#

Chaos Engineering Maturity Model#

Level 1: Ad-Hoc Chaos#

Level 2: Planned Experiments#

Level 3: Automated Continuous Chaos#

Level 4: Chaos as a Service#

Designing Chaos Experiments#

Experiment Template#

Example Experiment: Pod Failure#

Common Chaos Experiments#

1. Pod/Container Failures#

2. Network Latency#

3. Resource Exhaustion#

4. Dependency Failures#

5. DNS Failures#

6. Clock Skew#

GameDay Planning#

What is a GameDay?#

GameDay Template#

Example GameDay: Database Failover#

Tools and Platforms#

Chaos Mesh (Kubernetes)#

Litmus Chaos (Cloud-Native)#

Gremlin (Commercial)#

Observability During Chaos#

Pre-Experiment Checklist#

Monitoring Chaos Experiments#

Safety and Best Practices#

Safety Guardrails#

Rollback Triggers#

Communication Protocol#

Measuring Success#

Chaos Engineering KPIs#

Experiment Results Template#

Common Pitfalls#

Pitfall 1: Skipping Production#

Pitfall 2: No Hypothesis#

Pitfall 3: Too Much Chaos#

Pitfall 4: No Rollback Plan#

Pitfall 5: Chaos Without Observability#

Implementation Roadmap#

Month 1: Foundation#

Month 2: Production Experiments#

Month 3: Automation#

Month 6+: Continuous Chaos#

Conclusion#

Introduction

What is Chaos Engineering?

Definition

Chaos Engineering vs Traditional Testing

Principles of Chaos Engineering

1. Build a Hypothesis Around Steady State

2. Vary Real-World Events

3. Run Experiments in Production

4. Automate Experiments

5. Minimize Blast Radius

Chaos Engineering Maturity Model

Level 1: Ad-Hoc Chaos

Level 2: Planned Experiments

Level 3: Automated Continuous Chaos

Level 4: Chaos as a Service

Designing Chaos Experiments

Experiment Template

Example Experiment: Pod Failure

Common Chaos Experiments

1. Pod/Container Failures

2. Network Latency

3. Resource Exhaustion

4. Dependency Failures

5. DNS Failures

6. Clock Skew

GameDay Planning

What is a GameDay?

GameDay Template

Example GameDay: Database Failover

Tools and Platforms

Chaos Mesh (Kubernetes)

Litmus Chaos (Cloud-Native)

Gremlin (Commercial)

Observability During Chaos

Pre-Experiment Checklist

Monitoring Chaos Experiments

Safety and Best Practices

Safety Guardrails

Rollback Triggers

Communication Protocol

Measuring Success

Chaos Engineering KPIs

Experiment Results Template

Common Pitfalls

Pitfall 1: Skipping Production

Pitfall 2: No Hypothesis

Pitfall 3: Too Much Chaos

Pitfall 4: No Rollback Plan

Pitfall 5: Chaos Without Observability

Implementation Roadmap

Month 1: Foundation

Month 2: Production Experiments

Month 3: Automation

Month 6+: Continuous Chaos

Conclusion