Introduction
Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. Rather than waiting for failures to happen, chaos engineering proactively injects failures to identify weaknesses before they impact users.
Why does this matter? In modern distributed systems (microservices, cloud infrastructure, containers), failures are inevitable. A network can partition, a server can crash, a database can slow down. Traditional testing can’t predict all the ways these components interact when things go wrong. Chaos engineering fills this gap by deliberately causing failures in a controlled way.
Real-world example: Netflix pioneered chaos engineering with “Chaos Monkey,” a tool that randomly kills production servers. By doing this regularly, Netflix ensured their systems could survive server failures without affecting customers watching movies. When AWS had a major outage in 2011, Netflix stayed online while competitors went down—because they had already tested their resilience.
Core Principle: “The best time to find out how your system fails is before your customers do.”
What is Chaos Engineering?
Definition
Chaos Engineering is a systematic approach to discovering system weaknesses by deliberately introducing failures and observing how the system responds.
Think of it like a fire drill: You don’t wait for a real fire to see if your evacuation plan works. Similarly, you don’t wait for a production outage to discover your system can’t handle a database failover.
Key Characteristics:
Proactive: Test before failures occur naturally
- What this means: Instead of reacting to outages at 3 AM, you intentionally cause failures during business hours when your team is ready to respond. This builds “muscle memory” for incident response.
Controlled: Experiments have clear boundaries and rollback plans
- What this means: You don’t randomly break things. Each experiment has a defined scope (e.g., “kill 10% of pods for 5 minutes”) and an automatic stop mechanism if things go wrong.
Observable: Measure system behavior during experiments
- What this means: Before breaking anything, you define metrics to watch (error rate, latency, throughput). If these metrics degrade beyond acceptable limits, the experiment stops automatically.
Incremental: Start small, increase blast radius gradually
- What this means: Begin by killing one pod in staging, then one pod in production, then 5%, then 10%. Don’t jump straight to “delete the entire database.”
Production-focused: Real-world conditions matter most
- What this means: Staging environments don’t have real traffic patterns, real data volumes, or real failure modes. Testing in production (carefully) gives you confidence that matters.
Chaos Engineering vs Traditional Testing
Understanding the difference:
Traditional Testing answers: “Does my code work as expected?”
- Unit tests check if a function returns the right value
- Integration tests check if two services can talk to each other
- You write tests for scenarios you can imagine
Chaos Engineering answers: “What happens when things I didn’t expect go wrong?”
- What if the network is slow but not completely down?
- What if 3 out of 10 servers crash at the same time?
- What if the database becomes read-only unexpectedly?
Visual comparison:
Traditional Testing:
┌─────────────┐
│ Known │ ──> Test for expected failures
│ Failures │ (unit tests, integration tests)
└─────────────┘
Example: "Test that API returns 404 when user doesn't exist"
Chaos Engineering:
┌─────────────┐
│ Unknown │ ──> Discover unexpected failures
│ Unknowns │ (what you didn't think to test)
└─────────────┘
Example: "What happens when the user database is unreachable?"
Detailed Comparison:
Aspect | Traditional Testing | Chaos Engineering | Example |
---|---|---|---|
Scope | Known failure modes | Unknown failure modes | Testing “user not found” vs discovering “what if user DB is slow” |
Environment | Test/staging | Production (ideally) | Staging has 10 users, production has 10 million |
Approach | Validate correctness | Discover weaknesses | “Does login work?” vs “Can we handle login when auth service is degraded?” |
Timing | Before deployment | During production | CI/CD pipeline vs continuous production testing |
Goal | Prevent bugs | Build resilience | Fix broken code vs survive infrastructure failures |
Mindset | “It should work” | “What could go wrong?” | Optimistic vs paranoid (in a good way) |
Why you need both: Traditional testing catches bugs in your code. Chaos engineering catches weaknesses in your architecture and assumptions about how systems interact.
Principles of Chaos Engineering
1. Build a Hypothesis Around Steady State
What is steady state? It’s what “normal” looks like for your system—the baseline metrics when everything is working fine.
Why define it? Because during a chaos experiment, you need to know if things are getting worse. Without a baseline, you can’t tell if your experiment is causing problems.
The key insight: Measure business outcomes, not technical internals.
Bad hypothesis (technical):
"CPU should stay under 80%"
Why is this bad? CPU usage is an internal metric. Customers don’t care about CPU—they care about whether their order goes through. High CPU might be fine if the system is still processing orders successfully.
Good hypothesis (business):
"Order completion rate should stay above 99.5%
even when 20% of backend pods are unavailable"
Why is this good? It focuses on what matters to users (orders completing) and defines acceptable degradation (99.5% success rate). This tells you whether customers are affected.
Another example:
Bad: “Memory usage should stay below 8GB” Good: “API p95 latency should stay below 500ms when 2 out of 5 database replicas fail”
The good hypothesis answers: “Can our customers still use the product when infrastructure fails?”
Steady State Indicators:
service: checkout-service
steady_state_metrics:
- name: order_success_rate
threshold: ">99.5%"
measurement: "successful_orders / total_orders"
- name: p95_latency
threshold: "<500ms"
measurement: "95th percentile checkout time"
- name: payment_success
threshold: ">99%"
measurement: "successful_payments / total_attempts"
2. Vary Real-World Events
What does this mean? Inject failures that could actually happen in production—not theoretical edge cases that will never occur.
The goal: Simulate real disasters you’ve seen before (or that your competitors have experienced).
Common Real-World Failures (Explained):
Network latency/partition
- What it is: Network gets slow (latency) or completely cut off (partition) between services
- Real example: AWS availability zone loses connectivity to another zone
- Why test it: Your microservices might timeout or retry indefinitely, cascading the failure
Pod/container crashes
- What it is: Application container dies unexpectedly
- Real example: Out-of-memory (OOM) killer terminates your process, or a bug causes a panic
- Why test it: Verify Kubernetes restarts pods automatically and load balancers remove unhealthy instances
Resource exhaustion (CPU, memory, disk)
- What it is: System runs out of a critical resource
- Real example: Sudden traffic spike maxes out CPU, or logs fill up the disk
- Why test it: Check if auto-scaling works and if your app degrades gracefully
DNS failures
- What it is: DNS lookups fail or return wrong results
- Real example: DNS server becomes unreachable or cache expires
- Why test it: Many apps don’t handle DNS failures well, causing cascading failures
Cloud provider outages
- What it is: Entire AWS region or Google Cloud zone goes down
- Real example: AWS us-east-1 outage (happens every year)
- Why test it: Verify your multi-region failover actually works
Dependency failures
- What it is: External service (payment gateway, auth provider, database) becomes unavailable
- Real example: Stripe API returns 500 errors
- Why test it: Check if your app can degrade gracefully (e.g., queue orders for later)
Corrupt data
- What it is: Database contains bad data that crashes your app
- Real example: Migration bug writes NULL where code expects a value
- Why test it: Verify input validation and error handling
Clock skew
- What it is: Server system time drifts from actual time
- Real example: NTP sync fails, server thinks it’s 5 minutes in the future
- Why test it: Tokens expire early, logs have wrong timestamps, distributed systems get confused
Certificate expiration
- What it is: TLS/SSL certificate expires
- Real example: Let’s Encrypt cert renewal fails
- Why test it: Many services go down completely when certs expire (and auto-renewal might not work)
3. Run Experiments in Production
Wait, production? Isn’t that dangerous? Yes, if done carelessly. But it’s the only way to get real confidence.
Why staging isn’t enough:
Staging environments don’t have the same:
Traffic patterns
- Staging: 10 test users clicking buttons
- Production: 10,000 real users with unpredictable behavior
- Why it matters: Load balancing and caching behave totally differently at scale
Data volume
- Staging: 1,000 database rows
- Production: 10 million rows
- Why it matters: Queries that are fast in staging become slow in production, breaking timeouts
Service dependencies
- Staging: Mocked payment gateway, fake email service
- Production: Real Stripe API, real SendGrid
- Why it matters: External service failures (rate limits, timeouts) don’t happen in staging
Infrastructure scale
- Staging: 3 small servers
- Production: 50 large servers across 3 regions
- Why it matters: Network topology, failure domains, and scaling behavior are completely different
Real failure modes
- Staging: Clean environment, recently deployed
- Production: Months of accumulated state, edge cases, memory leaks
- Why it matters: Production has bugs and conditions that staging will never reproduce
The Netflix example: When Netflix tested in production and found issues, they could fix them before customers were affected. When they only tested in staging, outages surprised them during AWS failures.
Making production testing safe: Start with tiny blast radius (1% of traffic, 1 pod) and automate rollback if metrics degrade.
Production Experiment Safety:
experiment:
name: pod-failure-test
environment: production
safety_measures:
- blast_radius: "5% of pods"
- rollback_trigger: "error_rate > 1%"
- time_limit: "5 minutes"
- hours: "business_hours_only"
- monitoring: "active_observation_required"
- communication: "team_notified_in_advance"
4. Automate Experiments
Why automate? Because manually breaking things every week is:
- Time-consuming (toil)
- Inconsistent (humans forget steps)
- Not scalable (what about 50 services?)
- Easy to skip (“we’re too busy this week”)
Automation benefits:
Consistency
- Manual: “Did we test pod failure this week? I think Bob did it… or was that last week?”
- Automated: Same experiment runs every Monday at 2 AM, logs results, alerts if it fails
Repeatability
- Manual: Each engineer runs the experiment slightly differently
- Automated: Exact same steps, same blast radius, same rollback triggers every time
Continuous validation
- Manual: Test once per quarter (when you remember)
- Automated: Test every week automatically, catch regressions immediately
Reduced toil
- Manual: 30 minutes of engineer time per experiment
- Automated: Set it and forget it, only investigate when tests fail
Real-world example: Chaos Monkey runs automatically at Netflix, randomly killing servers. Engineers don’t manually pick servers to kill—automation does it continuously, ensuring systems stay resilient as code changes.
5. Minimize Blast Radius
What is blast radius? The percentage of your system (or users) affected by the chaos experiment.
Why minimize it? Because if your experiment goes wrong, you want to affect as few customers as possible.
The golden rule: Start tiny, increase slowly.
Blast Radius Progression (Explained):
Week 1: 1% of traffic, 1 pod, staging environment
↓ Why: Zero customer risk, verify the experiment works
↓ Outcome: Confirmed the chaos tool works, monitoring alerts fire
Week 2: 1% of traffic, 1 pod, production (off-peak)
↓ Why: Tiny production impact (1% of users), during low-traffic hours
↓ Outcome: Found issue: pod restart took 30 seconds (too slow), need to fix
Week 3: 5% of traffic, 5 pods, production (off-peak)
↓ Why: Increased blast radius after fixing Week 2 issue
↓ Outcome: System handled it well, confidence growing
Week 4: 10% of traffic, production (business hours)
↓ Why: Test during real traffic to see if systems can handle it
↓ Outcome: Success! No customer-facing errors, ready to increase to 20%
What if Week 2 failed? You don’t proceed to Week 3. You fix the issue first, then re-run Week 2 until it passes.
Real-world disaster: A company skipped this progression, jumped straight to “delete 50% of database replicas in production,” and caused a major outage. Don’t be that company.
Remember: It’s not a race. Slow and steady builds confidence without risking customer trust.
Chaos Engineering Maturity Model
Level 1: Ad-Hoc Chaos
Characteristics:
- Manual experiments
- No formal process
- Reactive to incidents
- Limited tooling
Example:
# "Let's see what happens if I kill this pod"
kubectl delete pod api-server-xyz
Level 2: Planned Experiments
Characteristics:
- Documented experiments
- GameDay events
- Hypothesis-driven
- Team coordination
Example:
# chaos-experiment.yaml
experiment:
name: "Q4-2025-GameDay-Zone-Failure"
date: "2025-12-15"
hypothesis: "System remains available during AZ failure"
participants: ["SRE", "Platform", "Backend"]
rollback_plan: "documented"
Level 3: Automated Continuous Chaos
Characteristics:
- Automated experiments
- Integrated into CI/CD
- Continuous validation
- Self-service for teams
Example:
# Automated weekly chaos
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: weekly-pod-chaos
spec:
schedule: "0 2 * * 1" # 2 AM every Monday
type: PodChaos
podChaos:
action: pod-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: backend
Level 4: Chaos as a Service
Characteristics:
- Platform for all teams
- Advanced failure scenarios
- Automated remediation testing
- Culture of resilience
Designing Chaos Experiments
Experiment Template
# Experiment: [Name]
## Hypothesis
**Steady State:** [What is normal?]
**Disruption:** [What will we break?]
**Expected Outcome:** [System should remain in steady state]
## Scope
- **Service:** [Target service]
- **Blast Radius:** [Percentage/count of instances]
- **Duration:** [How long]
- **Environment:** [Production/staging]
## Preconditions
- [ ] Monitoring dashboards ready
- [ ] Team notified
- [ ] Rollback plan documented
- [ ] Off-hours/on-hours decision made
- [ ] Incident response team on standby
## Experiment Steps
1. [Establish baseline]
2. [Inject failure]
3. [Observe behavior]
4. [Measure metrics]
5. [Rollback]
6. [Analyze results]
## Success Criteria
- [ ] Steady state maintained
- [ ] No customer impact
- [ ] Graceful degradation observed
- [ ] Alerts fired appropriately
## Rollback Plan
[How to stop the experiment immediately]
## Results
[Document findings]
Example Experiment: Pod Failure
# chaos-experiment-001.yaml
experiment:
name: "Backend Pod Failure Test"
id: "EXP-001"
date: "2025-10-16"
hypothesis:
steady_state: "Order success rate >99.5%, p95 latency <500ms"
disruption: "Kill 10% of backend pods"
expected: "Kubernetes auto-healing maintains service levels"
scope:
service: "order-service"
namespace: "production"
blast_radius: "10% of pods (2 out of 20)"
duration: "5 minutes"
preconditions:
- monitoring: "Grafana dashboard open"
- notification: "Team notified in #sre-chaos"
- time: "Tuesday 2 PM PST (low traffic)"
- oncall: "SRE on standby"
steps:
1. baseline:
action: "Observe metrics for 5 minutes"
metrics: ["order_success_rate", "p95_latency", "error_rate"]
2. inject_failure:
tool: "chaos-mesh"
action: "pod-kill"
target: "2 pods with label app=order-service"
3. observe:
duration: "5 minutes"
watch:
- "Pod restart time"
- "Service availability"
- "Error rates"
4. measure:
compare: "baseline vs during-chaos vs post-chaos"
5. rollback:
automatic: true
trigger: "error_rate > 1% OR p95_latency > 1000ms"
success_criteria:
- order_success_rate: ">99.5%"
- p95_latency: "<600ms"
- customer_complaints: "0"
- auto_recovery_time: "<30s"
Common Chaos Experiments
1. Pod/Container Failures
Scenario: Random pod crashes
What you’re testing: Does your system handle pod failures gracefully?
Why this matters in production:
- Pods crash all the time: out-of-memory, bugs, node failures
- Kubernetes should automatically restart them
- Your load balancer should stop sending traffic to dead pods
- Users should never notice
What you expect to see:
- Pod crashes
- Kubernetes detects unhealthy pod within 10 seconds
- New pod starts automatically
- Load balancer routes traffic to healthy pods
- Zero customer-facing errors
What you might discover:
- Pod restart takes 60 seconds (too slow → optimize startup time)
- Load balancer keeps sending traffic to dead pod for 30 seconds (health check interval too long)
- Application doesn’t handle graceful shutdown (loses in-flight requests)
Chaos Mesh Example:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-experiment
spec:
action: pod-kill
mode: fixed-percent
value: "10"
selector:
namespaces:
- production
labelSelectors:
app: backend
duration: "30s"
scheduler:
cron: "@every 10m"
What to Observe:
- Pod restart time
- Service availability during restart
- Load balancer behavior
- Alert notifications
Expected Resilience:
✓ Kubernetes restarts pods automatically
✓ Service remains available (other pods handle traffic)
✓ No customer-facing errors
✓ Alerts fire and auto-resolve
2. Network Latency
Scenario: Degraded network between services
What you’re testing: Does your system handle slow networks gracefully?
Why this matters in production:
- Networks don’t just fail completely—they often get slow first
- 250ms latency might not seem like much, but it adds up across microservices
- If Service A calls Service B (250ms) which calls Service C (250ms) which calls Service D (250ms), your user waits 750ms+ for a response
What you expect to see:
- Request latency increases slightly (acceptable)
- Timeouts are configured correctly (requests fail fast instead of hanging)
- Circuit breakers open when latency exceeds threshold
- Retries don’t make the problem worse
What you might discover:
- No timeouts configured → requests hang for 60+ seconds
- Retry logic makes it worse (retrying slow requests overwhelms the service)
- Circuit breaker doesn’t exist or is misconfigured
- Downstream latency cascades to all services (everything becomes slow)
Chaos Mesh Example:
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay-experiment
spec:
action: delay
mode: one
selector:
namespaces:
- production
labelSelectors:
app: api-gateway
delay:
latency: "250ms"
correlation: "100"
jitter: "50ms"
duration: "2m"
target:
mode: all
selector:
labelSelectors:
app: database
What to Observe:
- Request latency impact
- Timeout behavior
- Retry logic
- Circuit breaker activation
3. Resource Exhaustion
Scenario: CPU/Memory pressure
Chaos Mesh Example:
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress-experiment
spec:
mode: one
selector:
namespaces:
- production
labelSelectors:
app: worker
stressors:
cpu:
workers: 2
load: 80
duration: "3m"
What to Observe:
- Auto-scaling triggers
- Performance degradation
- Resource limits enforcement
- OOM killer behavior
4. Dependency Failures
Scenario: External service unavailable
Chaos Mesh Example:
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: dependency-failure
spec:
action: partition
mode: all
selector:
namespaces:
- production
labelSelectors:
app: api
direction: to
target:
mode: all
selector:
labelSelectors:
app: payment-service
duration: "1m"
What to Observe:
- Circuit breaker activation
- Fallback behavior
- Retry policies
- Graceful degradation
5. DNS Failures
Scenario: DNS resolution failures
Litmus Example:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: dns-chaos
spec:
engineState: "active"
chaosServiceAccount: litmus-admin
experiments:
- name: pod-dns-error
spec:
components:
env:
- name: TARGET_CONTAINER
value: "app"
- name: CHAOS_DURATION
value: "60"
6. Clock Skew
Scenario: System time drift
Custom Script:
#!/bin/bash
# Inject clock skew (requires privileged access)
POD=$(kubectl get pod -l app=api -o jsonpath='{.items[0].metadata.name}')
# Set time 5 minutes in the future
kubectl exec $POD -- date -s "$(date -d '+5 minutes' --rfc-3339=seconds)"
# Observe for 2 minutes
sleep 120
# Restore correct time
kubectl exec $POD -- ntpdate -s time.google.com
What to Observe:
- Token expiration issues
- Timestamp validation
- Logging accuracy
- Certificate validation
GameDay Planning
What is a GameDay?
Definition: A GameDay is a scheduled, team-wide chaos engineering event where you simulate a major disaster and practice your response.
Think of it as: A fire drill for your infrastructure. Everyone knows it’s happening, everyone participates, and you learn what works (and what doesn’t) in your incident response.
Why GameDays matter:
A planned chaos engineering event where teams:
Test disaster recovery procedures
- What this means: Your runbook says “if the database fails, promote the replica.” Does that actually work? GameDay finds out.
- Example: During a GameDay, one company discovered their database failover script had a typo that would have caused a 4-hour outage in a real disaster.
Practice incident response
- What this means: When a real outage happens at 3 AM, panicked engineers make mistakes. GameDays let you practice the response when you’re calm and prepared.
- Example: Teams learn who does what, how to communicate in Slack, when to escalate to management.
Validate runbooks
- What this means: Is your documentation actually correct and complete? Or does it say “Step 3: Fix the database” without explaining how?
- Example: A GameDay revealed a runbook referenced a server that had been decommissioned 6 months ago.
Build muscle memory
- What this means: The first time you handle a database failover, it takes 2 hours. The tenth time, it takes 10 minutes because you know exactly what to do.
- Example: Netflix runs GameDays monthly, so when real AWS outages happen, their teams execute flawlessly.
GameDay vs Regular Chaos Experiments:
- Regular experiment: Automated, small blast radius, runs weekly (e.g., kill 1 pod)
- GameDay: Manual, larger scope, quarterly event (e.g., entire region failure, whole team participates)
GameDay Template
# GameDay: [Scenario Name]
Date: [YYYY-MM-DD]
Duration: 2-4 hours
## Objectives
1. [Primary objective]
2. [Secondary objective]
3. [Learning goal]
## Participants
- **Incident Commander:** [Name]
- **SRE Team:** [Names]
- **Platform Team:** [Names]
- **Observers:** [Names]
## Scenario
[Description of the failure scenario]
## Timeline
- T-0:00: Baseline established
- T+0:05: Inject failure
- T+0:10: Teams detect and respond
- T+0:30: Mitigation implemented
- T+1:00: Recovery complete
- T+1:30: Retrospective
## Success Criteria
- [ ] Incident detected within 5 minutes
- [ ] Runbook followed correctly
- [ ] Service restored within 30 minutes
- [ ] No data loss
- [ ] Communication protocol followed
## Failure Injection Plan
[Detailed steps]
## Rollback Plan
[Emergency stop procedure]
## Post-GameDay
- [ ] Retrospective scheduled
- [ ] Runbooks updated
- [ ] Gaps identified
- [ ] Follow-up actions assigned
Example GameDay: Database Failover
gameday:
name: "PostgreSQL Primary Failure"
date: "2025-11-01"
duration: "2 hours"
scenario: |
The primary PostgreSQL instance fails. Teams must:
1. Detect the failure
2. Promote replica to primary
3. Update connection strings
4. Verify data integrity
objectives:
- Test automated failover
- Validate runbook accuracy
- Practice cross-team coordination
- Measure recovery time
participants:
ic: "Alice (SRE)"
sre: ["Bob", "Carol"]
platform: ["Dave", "Eve"]
observers: ["CTO", "Product Lead"]
timeline:
"09:00": "Kick-off meeting, review procedures"
"09:15": "Establish baseline metrics"
"09:30": "Inject failure: kill primary DB"
"09:35": "Teams detect and respond"
"10:00": "Expected: failover complete"
"10:30": "Verify all services healthy"
"11:00": "Retrospective and learnings"
metrics:
- time_to_detect
- time_to_failover
- data_loss_amount
- services_affected
- customer_impact
success_criteria:
- detection_time: "<5 minutes"
- failover_time: "<15 minutes"
- data_loss: "0 transactions"
- automated_failover: true
Tools and Platforms
Chaos Mesh (Kubernetes)
Installation:
# Install Chaos Mesh
curl -sSL https://mirrors.chaos-mesh.org/latest/install.sh | bash
# Verify installation
kubectl get pods -n chaos-mesh
Basic Experiment:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-example
namespace: chaos-mesh
spec:
action: pod-kill
mode: one
selector:
namespaces:
- default
labelSelectors:
app: nginx
scheduler:
cron: "@every 2m"
Dashboard Access:
# Port-forward to Chaos Mesh dashboard
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333
# Access at http://localhost:2333
Litmus Chaos (Cloud-Native)
Installation:
# Install Litmus
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.0.0.yaml
# Install chaos experiments
kubectl apply -f https://hub.litmuschaos.io/api/chaos/3.0.0?file=charts/generic/experiments.yaml
ChaosEngine Example:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: default
spec:
engineState: "active"
chaosServiceAccount: litmus
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
Gremlin (Commercial)
Installation:
# Kubernetes installation
helm repo add gremlin https://helm.gremlin.com
helm install gremlin gremlin/gremlin \
--set gremlin.teamID=$GREMLIN_TEAM_ID \
--set gremlin.teamSecret=$GREMLIN_TEAM_SECRET
Attack Example (CLI):
# Shutdown attack on specific container
gremlin attack container shutdown \
--labels "app=api" \
--length 60
# CPU attack
gremlin attack container cpu \
--labels "app=worker" \
--cores 2 \
--length 120
# Network latency attack
gremlin attack container latency \
--labels "app=frontend" \
--delay 300 \
--length 180
Observability During Chaos
Pre-Experiment Checklist
observability_checklist:
dashboards:
- name: "Service Health Dashboard"
url: "grafana/service-health"
metrics: ["error_rate", "latency", "throughput"]
- name: "Infrastructure Dashboard"
url: "grafana/infra"
metrics: ["cpu", "memory", "network"]
alerts:
- verify: "Alerts are configured"
- test: "Alert routing works"
- oncall: "On-call engineer available"
logging:
- check: "Log aggregation working"
- access: "Team has log access"
- retention: "Logs retained for analysis"
tracing:
- verify: "Distributed tracing enabled"
- sample_rate: ">1% of requests"
Monitoring Chaos Experiments
Prometheus Queries:
# Error rate during experiment
sum(rate(http_requests_total{status=~"5.."}[1m]))
/
sum(rate(http_requests_total[1m])) * 100
# Latency percentiles
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[1m])) by (le)
)
# Pod restarts during chaos
sum(kube_pod_container_status_restarts_total{namespace="production"})
Grafana Annotations:
{
"dashboardUID": "service-health",
"time": 1697472000000,
"timeEnd": 1697472300000,
"tags": ["chaos-experiment", "pod-kill"],
"text": "EXP-001: Pod Kill Experiment - 10% of backend pods"
}
Safety and Best Practices
Safety Guardrails
safety_guardrails:
blast_radius:
- rule: "Never affect >20% of instances"
- rule: "Start with 1 instance, increase gradually"
timing:
- rule: "Run during business hours (with team available)"
- rule: "Avoid Black Friday, tax season, etc."
automation:
- rule: "Automated rollback on error threshold breach"
- rule: "Maximum experiment duration enforced"
communication:
- rule: "Notify team 24 hours in advance"
- rule: "Announce in #incidents before starting"
- rule: "Keep stakeholders informed"
approval:
- rule: "GameDays require manager approval"
- rule: "Production chaos requires SRE lead approval"
Rollback Triggers
# Automatic experiment termination
rollback_conditions:
error_rate:
threshold: ">1%"
window: "1m"
action: "terminate_immediately"
latency_p95:
threshold: ">1000ms"
window: "2m"
action: "terminate_immediately"
customer_complaints:
threshold: ">5"
window: "5m"
action: "terminate_and_alert"
manual:
command: "kubectl delete -f chaos-experiment.yaml"
hotkey: "Ctrl+C in terminal"
Communication Protocol
## Pre-Experiment (24 hours before)
**Slack #chaos-engineering:**
> 🧪 **Chaos Experiment Scheduled**
>
> **What:** Pod Kill - Backend Service
> **When:** Tuesday Oct 16, 2 PM PST
> **Duration:** 5 minutes
> **Blast Radius:** 10% of pods
> **Expected Impact:** None (auto-healing)
> **Rollback:** Automated on error >1%
>
> Questions? Reply here or DM @alice
## During Experiment
**Slack #incidents:**
> ⚠️ **CHAOS EXPERIMENT IN PROGRESS**
>
> **Status:** ACTIVE
> **Started:** 2:00 PM PST
> **Expected End:** 2:05 PM PST
> **Dashboard:** [link]
>
> This is a planned experiment. No action required.
## Post-Experiment
**Slack #chaos-engineering:**
> ✅ **Chaos Experiment Complete**
>
> **Results:** SUCCESS
> **Hypothesis:** Confirmed
> **Findings:** Auto-healing worked as expected
> **Learnings:** [link to doc]
> **Next:** Increase blast radius to 20%
Measuring Success
Chaos Engineering KPIs
kpis:
experiment_velocity:
metric: "Experiments per month"
current: 4
target: 12
trend: "increasing"
coverage:
metric: "% of services with chaos tests"
current: 40%
target: 80%
mttr_improvement:
metric: "Mean time to recovery"
before_chaos: "45 minutes"
after_chaos: "15 minutes"
improvement: "66%"
incident_reduction:
metric: "Production incidents per month"
before: 12
after: 5
improvement: "58%"
confidence_score:
metric: "Team confidence in system resilience (1-10)"
before: 5
after: 8
Experiment Results Template
# Experiment Results: EXP-001
## Hypothesis
System maintains >99.5% availability when 10% of pods fail
## Result
✅ CONFIRMED
## Metrics
| Metric | Baseline | During Chaos | Impact |
|--------|----------|--------------|--------|
| Success Rate | 99.8% | 99.7% | -0.1% ✅ |
| P95 Latency | 280ms | 320ms | +40ms ✅ |
| Pod Restart Time | N/A | 12s | ✅ |
## Observations
### What Worked ✅
- Kubernetes auto-healing restarted pods in <15s
- Service mesh load balancer rerouted traffic immediately
- No customer-facing errors
- Alerts fired and auto-resolved correctly
### What Didn't Work ❌
- Brief latency spike (+40ms) during pod restart
- Grafana dashboard missing pod restart metric
### Surprises 🤔
- One pod failed to restart due to ImagePullBackOff
- Discovered stale image tag in deployment manifest
## Action Items
- [ ] Fix ImagePullBackOff issue (ticket #1234)
- [ ] Add pod restart time to Grafana dashboard
- [ ] Update runbook with observed behavior
- [ ] Schedule follow-up experiment with 20% blast radius
## Confidence Level
Before: 6/10 → After: 8/10
Common Pitfalls
Pitfall 1: Skipping Production
Problem: Only testing in staging Impact: Miss real-world failure modes Solution: Start with small blast radius in production
Pitfall 2: No Hypothesis
Problem: “Let’s break stuff and see what happens” Impact: No learning, no improvement Solution: Always define expected steady state
Pitfall 3: Too Much Chaos
Problem: Testing everything at once Impact: Can’t identify root cause Solution: One variable at a time
Pitfall 4: No Rollback Plan
Problem: Experiment goes wrong, no way to stop it Impact: Real incident Solution: Always have automated rollback
Pitfall 5: Chaos Without Observability
Problem: Can’t measure impact Impact: Don’t know if experiment succeeded Solution: Monitoring before chaos
Implementation Roadmap
Month 1: Foundation
**Goals:**
- Build chaos engineering awareness
- Set up tooling
- Run first experiments in staging
**Actions:**
- [ ] Install Chaos Mesh/Litmus
- [ ] Create experiment templates
- [ ] Define safety guardrails
- [ ] Run 2-3 staging experiments
- [ ] Document learnings
**Deliverable:** Chaos engineering playbook
Month 2: Production Experiments
**Goals:**
- Move to production with small blast radius
- Build team confidence
- Establish GameDay cadence
**Actions:**
- [ ] Run first production experiment (1% blast radius)
- [ ] Schedule monthly GameDay
- [ ] Create observability dashboards
- [ ] Document runbooks based on learnings
**Deliverable:** First production chaos report
Month 3: Automation
**Goals:**
- Automate recurring experiments
- Expand coverage to more services
- Build self-service capability
**Actions:**
- [ ] Automate top 5 experiments
- [ ] Integrate chaos into CI/CD
- [ ] Enable teams to run their own experiments
- [ ] Create chaos engineering dashboard
**Deliverable:** Automated chaos pipeline
Month 6+: Continuous Chaos
**Goals:**
- Chaos as normal operation
- Continuous resilience validation
- Culture of resilience
**Actions:**
- [ ] 80% of services have chaos tests
- [ ] Weekly automated chaos
- [ ] Monthly GameDays
- [ ] Chaos training for all engineers
**Deliverable:** Resilience-first culture
Conclusion
Chaos Engineering is not about breaking things—it’s about building confidence in your system’s ability to handle failure. Key takeaways:
- Start Small: Single pod, 1% traffic, staging first
- Be Scientific: Hypothesis → Experiment → Learn
- Automate: Manual chaos doesn’t scale
- Observe: You can’t improve what you can’t measure
- Communicate: Transparency builds trust
- Iterate: Increase complexity gradually
- Make it Cultural: Resilience is everyone’s responsibility
Remember: “It’s not a question of if your system will fail, but when. Chaos engineering helps you be ready.”
“Hope is not a strategy. Test your systems before your customers do.”