Scenario Brief
A flood of downstream alerts is obscuring a single upstream infrastructure failure during a high-pressure on-call window.
At 03:14 UTC, PagerDuty fires 47 alerts in 90 seconds. Your phone won't stop. Every service looks broken: payments, auth, the API gateway, 3 internal microservices, and the CDN edge nodes. You have one engineer on call (you). SLA breach in 20 minutes. The trap: 46 of these alerts are downstream symptoms. One is the root cause. Telemetry is conflicting: the database is unreachable, but primary CPU is idle; the CDN is alarming, but edge traffic is only partially degraded; one team insists DNS is at fault while another thinks a bad deploy is still rolling out. You do not yet know which signal deserves trust.
Timeline
- 03:13 UTC: Rack-3 network switch drops offline.
- 03:14 UTC: 47 alerts fire across dependent services.
- 03:17 UTC: Incident manager from another team asks whether to declare Sev-1 publicly.
- 03:18 UTC: Auth team reports their last deploy was 20 minutes ago and suspects they caused the outage.
Stakeholder Pressure
- Executive channel: Is this isolated to one service or a company-wide outage?
- Support: Enterprise customers are reporting timeouts across multiple products.
- Auth team: We had a deploy tonight. Should we roll it back immediately, or will that make the situation worse?
Constraints
- You are the only on-call engineer immediately engaged.
- Most alerts are downstream noise and can easily consume the whole response window.
- You cannot manually inspect every failing system before the public status-page decision is due.
Current Telemetry
Alerts firing
47 (in last 90 sec)
Payment service
Timeouts — 100% failure
Auth service
503s — all requests failing
API gateway
Latency 30s+ (timeout)
Database (primary)
Connections: 0/200 used — UNREACHABLE
Network switch (rack-3)
Last seen 03:13 UTC