Skip to main content

One post tagged with "prisma"

View All Tags

Incident Report: Prisma DB Reconnect Blocks the Event Loop and Kills Liveliness

Yuneng Jiang
Senior SWE @ LiteLLM

Date: April 2026 Duration: Multiple incidents across customer deployments before fix landed Severity: High — surfaced as full proxy outages in Kubernetes Status: Resolved

Note: This fix is available starting from the release that contains PR #26225 (merged April 29, 2026).

Summary​

When the upstream Postgres database became unreachable, the LiteLLM proxy's Prisma reconnect path called await self.db.disconnect(). Under prisma-client-py that call invokes a synchronous subprocess.Popen.wait() on the Rust query-engine subprocess. Because wait() does not yield, the asyncio event loop froze for as long as the engine took to shut down — typically 30–120 seconds in production when the engine was stuck on TCP close operations against the unresponsive database.

While the loop was frozen, no coroutines ran, including /health/liveliness. Kubernetes liveness probes timed out and the kubelet SIGKILLed the pod. From the operator's point of view the proxy looked dead even though the underlying issue was a transient DB outage that the reconnect logic was supposed to ride through.

Impact: Any customer whose Postgres briefly became unresponsive saw proxy pods get killed and restarted instead of degrading gracefully and reconnecting once the DB came back. Reported externally by FLock and reproduced internally.