Incident Description

Time: 2025-08-17 02:00 UTC
Duration: 45 minutes
Impact: Critical - all scheduled tasks stopped

Symptoms

  • DAGs disappeared from Airflow UI
  • Scheduler logs showing import errors
  • Tasks not running on schedule

Timeline

02:00 - Issue Detection

# Monitoring showed no tasks
airflow dags list | wc -l
# Result: 0 (should be ~50)

02:05 - Initial Diagnosis

# Check scheduler status
systemctl status airflow-scheduler

# Check logs
tail -f /var/log/airflow/scheduler.log

02:10 - Root Cause Found

# Error found in logs:
ImportError: No module named 'pandas'
# DAG file imports pandas, but library is missing

Root Cause Analysis

Cause

Virtual environment update removed the pandas dependency used in one of the DAG files. Airflow stops loading ALL DAGs when any single DAG file has import errors.

Trigger

Automatic environment update as part of weekly maintenance.

Solution

Immediate Fix

# Install missing dependency
pip install pandas==1.5.3

# Restart scheduler
systemctl restart airflow-scheduler

# Verify recovery
airflow dags list

Long-term Fix

  1. Added dependencies to requirements.txt
pandas==1.5.3
numpy==1.24.3
  1. Improved monitoring
# Alert on DAG count
dag_count_alert = dag_count < expected_count
  1. CI/CD pipeline
  • DAG validation before deployment
  • Dependency checking

Lessons Learned

What went wrong

  • Missing dependency pinning
  • Insufficient DAG monitoring
  • Single failing DAG affects all

Prevention

  • Use requirements.txt
  • Monitor active DAG count
  • DAG isolation on import errors

Monitoring

Key metrics

# Count of active DAGs
SELECT COUNT(*) FROM dag WHERE is_active=1;

# Import errors
grep "ImportError" /var/log/airflow/scheduler.log

Preventive Measures

1. Dependency Management

# Pin all dependencies
pip freeze > requirements.txt

# Use virtual environments
python -m venv airflow-env
source airflow-env/bin/activate

2. DAG Validation

# Pre-deployment DAG testing
def validate_dag(dag_file):
    try:
        exec(open(dag_file).read())
        return True
    except Exception as e:
        print(f"DAG validation failed: {e}")
        return False

3. Monitoring Setup

# Prometheus alert
- alert: AirflowDAGsLow
  expr: airflow_dag_total < 45
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Low number of Airflow DAGs detected"

4. Recovery Procedures

#!/bin/bash
# Auto-recovery script
DAG_COUNT=$(airflow dags list | wc -l)
if [ $DAG_COUNT -lt 10 ]; then
    systemctl restart airflow-scheduler
    sleep 30
    # Send alert if still broken
fi