Dynamic TPM/RPM Allocation
Prevent projects from gobbling too much tpm/rpm.
Dynamically allocate TPM/RPM quota to api keys, based on active keys in that minute. See Code
Quick Start Usage​
- Setup config.yaml
model_list:
- model_name: my-fake-model
litellm_params:
model: gpt-3.5-turbo
api_key: my-fake-key
mock_response: hello-world
tpm: 60
litellm_settings:
callbacks: ["dynamic_rate_limiter_v3"]
general_settings:
master_key: sk-1234 # OR set `LITELLM_MASTER_KEY=".."` in your .env
database_url: postgres://.. # OR set `DATABASE_URL=".."` in your .env
- Start proxy
litellm --config /path/to/config.yaml
- Test it!
"""
- Run 2 concurrent teams calling same model
- model has 60 TPM
- Mock response returns 30 total tokens / request
- Each team will only be able to make 1 request per minute
"""
import requests
from openai import OpenAI, RateLimitError
def create_key(api_key: str, base_url: str):
response = requests.post(
url="{}/key/generate".format(base_url),
json={},
headers={
"Authorization": "Bearer {}".format(api_key)
}
)
_response = response.json()
return _response["key"]
key_1 = create_key(api_key="sk-1234", base_url="http://0.0.0.0:4000")
key_2 = create_key(api_key="sk-1234", base_url="http://0.0.0.0:4000")
# call proxy with key 1 - works
openai_client_1 = OpenAI(api_key=key_1, base_url="http://0.0.0.0:4000")
response = openai_client_1.chat.completions.with_raw_response.create(
model="my-fake-model", messages=[{"role": "user", "content": "Hello world!"}],
)
print("Headers for call 1 - {}".format(response.headers))
_response = response.parse()
print("Total tokens for call - {}".format(_response.usage.total_tokens))
# call proxy with key 2 - works
openai_client_2 = OpenAI(api_key=key_2, base_url="http://0.0.0.0:4000")
response = openai_client_2.chat.completions.with_raw_response.create(
model="my-fake-model", messages=[{"role": "user", "content": "Hello world!"}],
)
print("Headers for call 2 - {}".format(response.headers))
_response = response.parse()
print("Total tokens for call - {}".format(_response.usage.total_tokens))
# call proxy with key 2 - fails
try:
openai_client_2.chat.completions.with_raw_response.create(model="my-fake-model", messages=[{"role": "user", "content": "Hey, how's it going?"}])
raise Exception("This should have failed!")
except RateLimitError as e:
print("This was rate limited b/c - {}".format(str(e)))
Expected Response
This was rate limited b/c - Error code: 429 - {'error': {'message': {'error': 'Key=<hashed_token> over available TPM=0. Model TPM=0, Active keys=2'}, 'type': 'None', 'param': 'None', 'code': 429}}
[BETA] Set Priority / Reserve Quota​
Reserve TPM/RPM capacity for different environments or use cases. This ensures critical production workloads always have guaranteed capacity, while development or lower-priority tasks use remaining quota.
Use Cases:
- Production vs Development environments
- Real-time applications vs batch processing
- Critical services vs experimental features
Reserving TPM/RPM on keys based on priority is a premium feature. Please get an enterprise license for it.
How Priority Reservation Works​
Priority reservation allocates a percentage of your model's total TPM/RPM to specific priority levels. Keys with higher priority get guaranteed access to their reserved quota first.
Example Scenario:
- Model has 10 RPM total capacity
- Priority reservation:
{"prod": 0.9, "dev": 0.1}
- Result: Production keys get 9 RPM guaranteed, Development keys get 1 RPM guaranteed
Configuration​
1. Setup config.yaml​
model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: "gpt-3.5-turbo"
api_key: os.environ/OPENAI_API_KEY
rpm: 10 # Total model capacity
litellm_settings:
callbacks: ["dynamic_rate_limiter_v3"]
priority_reservation:
"prod": 0.9 # 90% reserved for production (9 RPM)
"dev": 0.1 # 10% reserved for development (1 RPM)
general_settings:
master_key: sk-1234 # OR set `LITELLM_MASTER_KEY=".."` in your .env
database_url: postgres://.. # OR set `DATABASE_URL=".."` in your.env
Configuration Details:
priority_reservation
: Dict[str, float]
- Key (str): Priority level name (can be any string like "prod", "dev", "critical", etc.)
- Value (float): Percentage of total TPM/RPM to reserve (0.0 to 1.0)
- Note: Values should sum to 1.0 or less
Start Proxy
litellm --config /path/to/config.yaml
2. Create Keys with Priority Levels​
Production Key:
curl -X POST 'http://0.0.0.0:4000/key/generate' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{
"metadata": {"priority": "prod"}
}'
Development Key:
curl -X POST 'http://0.0.0.0:4000/key/generate' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{
"metadata": {"priority": "dev"}
}'
Expected Response for both:
{
"key": "sk-...",
"metadata": {"priority": "prod"}, // or "dev"
...
}
3. Test Priority Allocation​
Test Production Key (should get 9 RPM):
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-prod-key' \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello from prod"}]
}'
Test Development Key (should get 1 RPM):
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-dev-key' \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello from dev"}]
}'
Expected Behavior​
With the configuration above:
- Production keys can make up to 9 requests per minute
- Development keys can make up to 1 request per minute
- Production requests are never blocked by development usage
Rate Limit Error Example:
{
"error": {
"message": "Key=sk-dev-... over available RPM=0. Model RPM=10, Reserved RPM for priority 'dev'=1, Active keys=1",
"type": "rate_limit_exceeded",
"code": 429
}
}
Demo Video​
This video walks through setting up dynamic rate limiting with priority reservation and locust tests to validate the behavior.