Skip to main content

Health Check Driven Routing

Route traffic away from unhealthy deployments before users hit errors. Background health checks run on a configurable interval, and any deployment that fails gets removed from the routing pool proactively, not after a user request already failed.

Architecture

Background Loopevery health_check_interval secondsDeployment Aahealth_check() → 200 ✓Deployment Bahealth_check() → 401 ✗Deployment Cahealth_check() → 429 ⚡ignore_transient_errors: true429 / 408 → ignorednot written to cacheallowed_fails_policy401 → increment countercounter > threshold→ cooldown triggeredShared StateDeploymentHealthCacheA → healthy ✓B → unhealthy ✗C → not written (ignored)TTL: staleness_threshold × 1.5Cooldown CacheB → cooling down(after policy threshold)TTL: cooldown_timefailed_calls counterB: 2 / AuthAllowedFails: 1→ threshold exceededTTL: cooldown_time (must > interval)Request PathIncoming requestAll deployments [A, B, C]① Health Check Filterif policy set → bypasselse → remove unhealthy② Cooldown Filterremove deployments in cooldownSafety Netif all removed → return all③ Load BalancerSelected: Deployment A ✓

What problem does this solve?

By default, LiteLLM routes traffic to all deployments and only stops sending to a broken one after it has already failed a user request. The cooldown system is reactive.

Health check driven routing makes this proactive: a background loop pings every deployment on a configurable interval. If a deployment fails its health check, it gets removed from the routing pool immediately, before a user request lands on it.

When you also set allowed_fails_policy, you control exactly how many health check failures of each error type (auth errors, rate limits, timeouts) are needed before a deployment enters cooldown. This avoids false positives from transient noise.

Setup

Step 1: Enable background health checks

Background health checks are off by default. Turn them on in general_settings:

general_settings:
background_health_checks: true
health_check_interval: 60 # seconds between each full check cycle

Step 2: Enable health check routing

general_settings:
background_health_checks: true
health_check_interval: 60
enable_health_check_routing: true # ← route away from unhealthy deployments

At this point, any deployment that fails its health check is immediately excluded from routing until the next check cycle clears it.

Step 3: Add a policy to control how many failures trigger cooldown

Without a policy, the first health check failure marks a deployment as unhealthy. If you want more tolerance (e.g., only act after 2 consecutive auth failures), use allowed_fails_policy:

model_list:
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-5
api_key: os.environ/ANTHROPIC_API_KEY

- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-5
api_key: os.environ/ANTHROPIC_API_KEY_SECONDARY

general_settings:
background_health_checks: true
health_check_interval: 30
enable_health_check_routing: true

router_settings:
cooldown_time: 60 # how long a deployment stays in cooldown
allowed_fails_policy:
AuthenticationErrorAllowedFails: 1 # cooldown after 2nd auth failure
TimeoutErrorAllowedFails: 3 # cooldown after 4th timeout

When allowed_fails_policy is set, the binary health check filter is bypassed. Only the cooldown system controls routing exclusion, and it only fires after your configured threshold is crossed.

Step 4 (optional): Ignore transient errors

429 (rate limit) and 408 (timeout) from a health check usually mean the deployment is temporarily overloaded, not broken. To prevent these from affecting routing at all:

general_settings:
background_health_checks: true
health_check_interval: 30
enable_health_check_routing: true
health_check_ignore_transient_errors: true # 429 and 408 never affect routing

With this on, only hard failures (401, 404, 5xx) from health checks contribute to cooldown.

Full example

model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY

- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY_SECONDARY

- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY

general_settings:
background_health_checks: true
health_check_interval: 30
enable_health_check_routing: true
health_check_ignore_transient_errors: true

router_settings:
cooldown_time: 60
allowed_fails_policy:
AuthenticationErrorAllowedFails: 0 # cooldown immediately on auth failure
TimeoutErrorAllowedFails: 2 # cooldown after 3 timeouts
RateLimitErrorAllowedFails: 5 # cooldown after 6 rate limits (if not ignoring transients)

Configuration reference

SettingWhereDefaultDescription
enable_health_check_routinggeneral_settingsfalseRoute away from deployments that fail health checks
background_health_checksgeneral_settingsfalseMust be true for health check routing to work
health_check_intervalgeneral_settings300Seconds between full health check cycles
health_check_staleness_thresholdgeneral_settingsinterval x 2Seconds before cached health state is ignored
health_check_ignore_transient_errorsgeneral_settingsfalseIgnore 429 and 408 from health checks; these never affect routing
cooldown_timerouter_settings5Seconds a deployment stays in cooldown after threshold is crossed
allowed_fails_policyrouter_settingsnullPer-error-type failure thresholds before cooldown (see below)

allowed_fails_policy fields

FieldError typeHTTP status
AuthenticationErrorAllowedFailsBad API key401
TimeoutErrorAllowedFailsRequest timeout408
RateLimitErrorAllowedFailsRate limit exceeded429
BadRequestErrorAllowedFailsMalformed request400
ContentPolicyViolationErrorAllowedFailsContent filtered400

The value is the number of failures tolerated before cooldown. 0 means cooldown on the first failure. 2 means cooldown on the third.

Things to keep in mind

  • Counter TTL must be longer than the health check interval. allowed_fails_policy works by incrementing a failed_calls counter per deployment. That counter expires after cooldown_time seconds. If cooldown_time is shorter than health_check_interval, the counter resets between every check cycle and failures never accumulate. Set cooldown_time greater than health_check_interval when using allowed_fails_policy.

    router_settings:
    cooldown_time: 60 # must be > health_check_interval (30s here)

    general_settings:
    health_check_interval: 30
  • AllowedFails: N means cooldown on the (N+1)th failure. The counter check is updated_fails > allowed_fails, so 0 triggers on the 1st failure, 1 on the 2nd, 2 on the 3rd.

    AllowedFailsCooldown triggers after
    01st failure
    12nd failure
    23rd failure
  • Without allowed_fails_policy, the first failure is enough. The first failed health check immediately excludes the deployment from routing. Use allowed_fails_policy when you want tolerance for flaky checks.

  • If all deployments are unhealthy, the filter is bypassed. Traffic keeps flowing rather than returning no deployment at all. Requests will fail, but the router keeps trying.

  • Health check failures and request failures share the same counters. When allowed_fails_policy is set, both sources increment the same failed_calls counter. A deployment at 1 health check failure that then receives 1 failing request will hit the threshold for AllowedFails: 1 and enter cooldown.

Debugging

Run the proxy with --detailed_debug and look for these log lines:

After each health check cycle (written at DEBUG level):

health_check_routing_state_updated healthy=2 unhealthy=1

When a health check failure increments the counter and triggers cooldown (DEBUG level):

checks 'should_run_cooldown_logic'
Attempting to add <deployment_id> to cooldown list

When safety net fires because all deployments are in cooldown:

All deployments in cooldown via health-check routing, bypassing cooldown filter

When safety net fires because all deployments are unhealthy (binary filter, no allowed_fails_policy):

All deployments marked unhealthy by health checks, bypassing health filter