Skip to main content

Dynamic TPM/RPM Allocation

Prevent projects from gobbling too much tpm/rpm.

Dynamically allocate TPM/RPM quota to api keys, based on active keys in that minute. See Code

Quick Start Usage​

  1. Setup config.yaml
config.yaml
model_list: 
- model_name: my-fake-model
litellm_params:
model: gpt-3.5-turbo
api_key: my-fake-key
mock_response: hello-world
tpm: 60

litellm_settings:
callbacks: ["dynamic_rate_limiter_v3"]

general_settings:
master_key: sk-1234 # OR set `LITELLM_MASTER_KEY=".."` in your .env
database_url: postgres://.. # OR set `DATABASE_URL=".."` in your .env
  1. Start proxy
litellm --config /path/to/config.yaml
  1. Test it!
test.py
"""
- Run 2 concurrent teams calling same model
- model has 60 TPM
- Mock response returns 30 total tokens / request
- Each team will only be able to make 1 request per minute
"""

import requests
from openai import OpenAI, RateLimitError

def create_key(api_key: str, base_url: str):
response = requests.post(
url="{}/key/generate".format(base_url),
json={},
headers={
"Authorization": "Bearer {}".format(api_key)
}
)

_response = response.json()

return _response["key"]

key_1 = create_key(api_key="sk-1234", base_url="http://0.0.0.0:4000")
key_2 = create_key(api_key="sk-1234", base_url="http://0.0.0.0:4000")

# call proxy with key 1 - works
openai_client_1 = OpenAI(api_key=key_1, base_url="http://0.0.0.0:4000")

response = openai_client_1.chat.completions.with_raw_response.create(
model="my-fake-model", messages=[{"role": "user", "content": "Hello world!"}],
)

print("Headers for call 1 - {}".format(response.headers))
_response = response.parse()
print("Total tokens for call - {}".format(_response.usage.total_tokens))


# call proxy with key 2 - works
openai_client_2 = OpenAI(api_key=key_2, base_url="http://0.0.0.0:4000")

response = openai_client_2.chat.completions.with_raw_response.create(
model="my-fake-model", messages=[{"role": "user", "content": "Hello world!"}],
)

print("Headers for call 2 - {}".format(response.headers))
_response = response.parse()
print("Total tokens for call - {}".format(_response.usage.total_tokens))
# call proxy with key 2 - fails
try:
openai_client_2.chat.completions.with_raw_response.create(model="my-fake-model", messages=[{"role": "user", "content": "Hey, how's it going?"}])
raise Exception("This should have failed!")
except RateLimitError as e:
print("This was rate limited b/c - {}".format(str(e)))

Expected Response

This was rate limited b/c - Error code: 429 - {'error': {'message': {'error': 'Key=<hashed_token> over available TPM=0. Model TPM=0, Active keys=2'}, 'type': 'None', 'param': 'None', 'code': 429}}

[BETA] Set Priority / Reserve Quota​

Reserve TPM/RPM capacity for different environments or use cases. This ensures critical production workloads always have guaranteed capacity, while development or lower-priority tasks use remaining quota.

Use Cases:

  • Production vs Development environments
  • Real-time applications vs batch processing
  • Critical services vs experimental features
tip

Reserving TPM/RPM on keys based on priority is a premium feature. Please get an enterprise license for it.

How Priority Reservation Works​

Priority reservation allocates a percentage of your model's total TPM/RPM to specific priority levels. Keys with higher priority get guaranteed access to their reserved quota first.

Example Scenario:

  • Model has 10 RPM total capacity
  • Priority reservation: {"prod": 0.9, "dev": 0.1}
  • Result: Production keys get 9 RPM guaranteed, Development keys get 1 RPM guaranteed

Configuration​

1. Setup config.yaml​

config.yaml
model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: "gpt-3.5-turbo"
api_key: os.environ/OPENAI_API_KEY
rpm: 10 # Total model capacity

litellm_settings:
callbacks: ["dynamic_rate_limiter_v3"]
priority_reservation:
"prod": 0.9 # 90% reserved for production (9 RPM)
"dev": 0.1 # 10% reserved for development (1 RPM)

general_settings:
master_key: sk-1234 # OR set `LITELLM_MASTER_KEY=".."` in your .env
database_url: postgres://.. # OR set `DATABASE_URL=".."` in your.env

Configuration Details:

priority_reservation: Dict[str, float]

  • Key (str): Priority level name (can be any string like "prod", "dev", "critical", etc.)
  • Value (float): Percentage of total TPM/RPM to reserve (0.0 to 1.0)
  • Note: Values should sum to 1.0 or less

Start Proxy

litellm --config /path/to/config.yaml

2. Create Keys with Priority Levels​

Production Key:

curl -X POST 'http://0.0.0.0:4000/key/generate' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{
"metadata": {"priority": "prod"}
}'

Development Key:

curl -X POST 'http://0.0.0.0:4000/key/generate' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{
"metadata": {"priority": "dev"}
}'

Expected Response for both:

{
"key": "sk-...",
"metadata": {"priority": "prod"}, // or "dev"
...
}

3. Test Priority Allocation​

Test Production Key (should get 9 RPM):

curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-prod-key' \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello from prod"}]
}'

Test Development Key (should get 1 RPM):

curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-dev-key' \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello from dev"}]
}'

Expected Behavior​

With the configuration above:

  1. Production keys can make up to 9 requests per minute
  2. Development keys can make up to 1 request per minute
  3. Production requests are never blocked by development usage

Rate Limit Error Example:

{
"error": {
"message": "Key=sk-dev-... over available RPM=0. Model RPM=10, Reserved RPM for priority 'dev'=1, Active keys=1",
"type": "rate_limit_exceeded",
"code": 429
}
}

Demo Video​

This video walks through setting up dynamic rate limiting with priority reservation and locust tests to validate the behavior.