Active IncidentHard

Alert Storm Masking the Real Issue

Scenario Brief

A flood of downstream alerts is obscuring a single upstream infrastructure failure during a high-pressure on-call window.

At 03:14 UTC, PagerDuty fires 47 alerts in 90 seconds. Your phone won't stop. Every service looks broken: payments, auth, the API gateway, 3 internal microservices, and the CDN edge nodes. You have one engineer on call (you). SLA breach in 20 minutes. The trap: 46 of these alerts are downstream symptoms. One is the root cause. Telemetry is conflicting: the database is unreachable, but primary CPU is idle; the CDN is alarming, but edge traffic is only partially degraded; one team insists DNS is at fault while another thinks a bad deploy is still rolling out. You do not yet know which signal deserves trust.

Timeline

03:13 UTC: Rack-3 network switch drops offline.
03:14 UTC: 47 alerts fire across dependent services.
03:17 UTC: Incident manager from another team asks whether to declare Sev-1 publicly.
03:18 UTC: Auth team reports their last deploy was 20 minutes ago and suspects they caused the outage.

Stakeholder Pressure

Executive channel: Is this isolated to one service or a company-wide outage?
Support: Enterprise customers are reporting timeouts across multiple products.
Auth team: We had a deploy tonight. Should we roll it back immediately, or will that make the situation worse?

Constraints

You are the only on-call engineer immediately engaged.
Most alerts are downstream noise and can easily consume the whole response window.
You cannot manually inspect every failing system before the public status-page decision is due.

Current Telemetry

Alerts firing

47 (in last 90 sec)

Payment service

Timeouts — 100% failure

Auth service

503s — all requests failing

API gateway

Latency 30s+ (timeout)

Database (primary)

Connections: 0/200 used — UNREACHABLE

Network switch (rack-3)

Last seen 03:13 UTC