Active IncidentMedium

Database Connection Pool Exhausted

API traffic is failing because database connection demand exceeded capacity after a new batch job rollout.

At 09:15 UTC, the API starts returning 503s. Logs show: "too many connections" and "connection pool timeout after 30s". The DB has max_connections=100. Your app has 8 pods, each with a connection pool of size 20. A new batch job was deployed last night. The confusing part: DB CPU is normal, cache hit rate is normal, and one engineer insists this looks like an application memory issue instead of a database saturation problem. You do not yet know whether the batch job is the trigger or only making a pre-existing misconfiguration visible.

  • 22:00 UTC: New batch job deploy completes.
  • 09:15 UTC: API starts returning 503s.
  • 09:18 UTC: Sales asks whether customer data is at risk or only delayed.
  • 09:19 UTC: Another engineer suggests increasing max_connections immediately because the database host is not CPU-bound.
  • Sales: Customers cannot access the dashboard. Is this a full outage?
  • Finance: The new batch job feeds a 10 AM report. Can we keep it running?
  • Platform engineer: We can bump max_connections right now, but I’m not convinced that fixes the underlying pressure.
  • Database CPU looks normal, so the issue is easy to misdiagnose.
  • The batch job powers morning finance exports that leadership is expecting.
  • Scaling the database or restarting it would take longer than the current customer-facing error budget allows.
API error rate
67% (503s)
DB connections
100/100 (maxed)
API pod count
8
Batch job pods
4 (new, deployed 22:00 UTC)
Avg query time
28s (↑ from 12ms)
CPU (DB host)
31% (normal)