Active IncidentHard

Alert Storm Masking the Real Issue

A flood of downstream alerts is obscuring a single upstream infrastructure failure during a high-pressure on-call window.

At 03:14 UTC, PagerDuty fires 47 alerts in 90 seconds. Your phone won't stop. Every service looks broken: payments, auth, the API gateway, 3 internal microservices, and the CDN edge nodes. You have one engineer on call (you). SLA breach in 20 minutes. The trap: 46 of these alerts are downstream symptoms. One is the root cause. Telemetry is conflicting: the database is unreachable, but primary CPU is idle; the CDN is alarming, but edge traffic is only partially degraded; one team insists DNS is at fault while another thinks a bad deploy is still rolling out. You do not yet know which signal deserves trust.

  • 03:13 UTC: Rack-3 network switch drops offline.
  • 03:14 UTC: 47 alerts fire across dependent services.
  • 03:17 UTC: Incident manager from another team asks whether to declare Sev-1 publicly.
  • 03:18 UTC: Auth team reports their last deploy was 20 minutes ago and suspects they caused the outage.
  • Executive channel: Is this isolated to one service or a company-wide outage?
  • Support: Enterprise customers are reporting timeouts across multiple products.
  • Auth team: We had a deploy tonight. Should we roll it back immediately, or will that make the situation worse?
  • You are the only on-call engineer immediately engaged.
  • Most alerts are downstream noise and can easily consume the whole response window.
  • You cannot manually inspect every failing system before the public status-page decision is due.
Alerts firing
47 (in last 90 sec)
Payment service
Timeouts — 100% failure
Auth service
503s — all requests failing
API gateway
Latency 30s+ (timeout)
Database (primary)
Connections: 0/200 used — UNREACHABLE
Network switch (rack-3)
Last seen 03:13 UTC