Active IncidentMedium

Database Connection Pool Exhausted

Scenario Brief

API traffic is failing because database connection demand exceeded capacity after a new batch job rollout.

At 09:15 UTC, the API starts returning 503s. Logs show: "too many connections" and "connection pool timeout after 30s". The DB has max_connections=100. Your app has 8 pods, each with a connection pool of size 20. A new batch job was deployed last night. The confusing part: DB CPU is normal, cache hit rate is normal, and one engineer insists this looks like an application memory issue instead of a database saturation problem. You do not yet know whether the batch job is the trigger or only making a pre-existing misconfiguration visible.

Timeline

22:00 UTC: New batch job deploy completes.
09:15 UTC: API starts returning 503s.
09:18 UTC: Sales asks whether customer data is at risk or only delayed.
09:19 UTC: Another engineer suggests increasing max_connections immediately because the database host is not CPU-bound.

Stakeholder Pressure

Sales: Customers cannot access the dashboard. Is this a full outage?
Finance: The new batch job feeds a 10 AM report. Can we keep it running?
Platform engineer: We can bump max_connections right now, but I’m not convinced that fixes the underlying pressure.

Constraints

Database CPU looks normal, so the issue is easy to misdiagnose.
The batch job powers morning finance exports that leadership is expecting.
Scaling the database or restarting it would take longer than the current customer-facing error budget allows.

Current Telemetry

API error rate

67% (503s)

DB connections

100/100 (maxed)

API pod count

Batch job pods

4 (new, deployed 22:00 UTC)

Avg query time

28s (↑ from 12ms)

CPU (DB host)

31% (normal)