Scenario Brief
API traffic is failing because database connection demand exceeded capacity after a new batch job rollout.
At 09:15 UTC, the API starts returning 503s. Logs show: "too many connections" and "connection pool timeout after 30s". The DB has max_connections=100. Your app has 8 pods, each with a connection pool of size 20. A new batch job was deployed last night. The confusing part: DB CPU is normal, cache hit rate is normal, and one engineer insists this looks like an application memory issue instead of a database saturation problem. You do not yet know whether the batch job is the trigger or only making a pre-existing misconfiguration visible.
Timeline
- 22:00 UTC: New batch job deploy completes.
- 09:15 UTC: API starts returning 503s.
- 09:18 UTC: Sales asks whether customer data is at risk or only delayed.
- 09:19 UTC: Another engineer suggests increasing max_connections immediately because the database host is not CPU-bound.
Stakeholder Pressure
- Sales: Customers cannot access the dashboard. Is this a full outage?
- Finance: The new batch job feeds a 10 AM report. Can we keep it running?
- Platform engineer: We can bump max_connections right now, but I’m not convinced that fixes the underlying pressure.
Constraints
- Database CPU looks normal, so the issue is easy to misdiagnose.
- The batch job powers morning finance exports that leadership is expecting.
- Scaling the database or restarting it would take longer than the current customer-facing error budget allows.
Current Telemetry
API error rate
67% (503s)
DB connections
100/100 (maxed)
API pod count
8
Batch job pods
4 (new, deployed 22:00 UTC)
Avg query time
28s (↑ from 12ms)
CPU (DB host)
31% (normal)