Incident Description
Time: 2025-08-17 02:00 UTC
Duration: 45 minutes
Impact: Critical - all scheduled tasks stopped
Symptoms
- DAGs disappeared from Airflow UI
- Scheduler logs showing import errors
- Tasks not running on schedule
Timeline
02:00 - Issue Detection
# Monitoring showed no tasks
airflow dags list | wc -l
# Result: 0 (should be ~50)
02:05 - Initial Diagnosis
# Check scheduler status
systemctl status airflow-scheduler
# Check logs
tail -f /var/log/airflow/scheduler.log
02:10 - Root Cause Found
# Error found in logs:
ImportError: No module named 'pandas'
# DAG file imports pandas, but library is missing
Root Cause Analysis
Cause
Virtual environment update removed the pandas
dependency used in one of the DAG files. Airflow stops loading ALL DAGs when any single DAG file has import errors.
Trigger
Automatic environment update as part of weekly maintenance.
Solution
Immediate Fix
# Install missing dependency
pip install pandas==1.5.3
# Restart scheduler
systemctl restart airflow-scheduler
# Verify recovery
airflow dags list
Long-term Fix
- Added dependencies to requirements.txt
pandas==1.5.3
numpy==1.24.3
- Improved monitoring
# Alert on DAG count
dag_count_alert = dag_count < expected_count
- CI/CD pipeline
- DAG validation before deployment
- Dependency checking
Lessons Learned
What went wrong
- Missing dependency pinning
- Insufficient DAG monitoring
- Single failing DAG affects all
Prevention
- Use requirements.txt
- Monitor active DAG count
- DAG isolation on import errors
Monitoring
Key metrics
# Count of active DAGs
SELECT COUNT(*) FROM dag WHERE is_active=1;
# Import errors
grep "ImportError" /var/log/airflow/scheduler.log
Preventive Measures
1. Dependency Management
# Pin all dependencies
pip freeze > requirements.txt
# Use virtual environments
python -m venv airflow-env
source airflow-env/bin/activate
2. DAG Validation
# Pre-deployment DAG testing
def validate_dag(dag_file):
try:
exec(open(dag_file).read())
return True
except Exception as e:
print(f"DAG validation failed: {e}")
return False
3. Monitoring Setup
# Prometheus alert
- alert: AirflowDAGsLow
expr: airflow_dag_total < 45
for: 2m
labels:
severity: critical
annotations:
summary: "Low number of Airflow DAGs detected"
4. Recovery Procedures
#!/bin/bash
# Auto-recovery script
DAG_COUNT=$(airflow dags list | wc -l)
if [ $DAG_COUNT -lt 10 ]; then
systemctl restart airflow-scheduler
sleep 30
# Send alert if still broken
fi