Prompt Caching
Supported Providers:
- OpenAI (
openai/) - Anthropic API (
anthropic/) - Google AI Studio (
gemini/) - Vertex AI (
vertex_ai/,vertex_ai_beta/) - Bedrock (
bedrock/,bedrock/invoke/,bedrock/converse) (All models bedrock supports prompt caching on) - Deepseek API (
deepseek/) - xAI (
xai/)
Prompt caching is silently skipped when the input is below the provider's minimum — no error is returned. Always verify caching occurred by checking cache_creation_input_tokens in the response.
| Provider | Minimum input tokens |
|---|---|
| OpenAI | 1,024 |
| Anthropic (Claude 3.x) | 1,024 |
| Anthropic (Claude Sonnet/Opus 4.x) | 2,048 |
| Anthropic (Claude Haiku 4.5+, Opus 4.5+) | 4,096 |
| Bedrock (Claude 3.5, 3.7) | 1,024 |
| Bedrock (Claude Sonnet 4.x) | 2,048 |
| Google Gemini | 1,024 |
For the supported providers, LiteLLM follows the OpenAI prompt caching usage object format:
"usage": {
"prompt_tokens": 2006,
"completion_tokens": 300,
"total_tokens": 2306,
"prompt_tokens_details": {
"cached_tokens": 1920
},
"completion_tokens_details": {
"reasoning_tokens": 0
}
# ANTHROPIC_ONLY #
"cache_creation_input_tokens": 0
}
prompt_tokens: These are all prompt tokens including cache-miss and cache-hit input tokens.completion_tokens: These are the output tokens generated by the model.total_tokens: Sum of prompt_tokens + completion_tokens.prompt_tokens_details: Object containing cached_tokens.cached_tokens: Tokens that were a cache-hit for that call.
completion_tokens_details: Object containing reasoning_tokens.- ANTHROPIC_ONLY:
cache_creation_input_tokensare the number of tokens that were written to cache. (Anthropic charges for this).
Quick Start​
Note: OpenAI caching is only available for prompts containing 1024 tokens or more
- SDK
- PROXY
from litellm import completion
import os
os.environ["OPENAI_API_KEY"] = ""
for _ in range(2):
response = completion(
model="gpt-4o",
messages=[
# System Message
{
"role": "system",
"content": [
{
"type": "text",
"text": "Here is the full text of a complex legal agreement"
* 400,
}
],
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
{
"role": "assistant",
"content": "Certainly! the key terms and conditions are the following: the contract is 1 year long for $10/mo",
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
],
temperature=0.2,
max_tokens=10,
)
print("response=", response)
print("response.usage=", response.usage)
assert "prompt_tokens_details" in response.usage
assert response.usage.prompt_tokens_details.cached_tokens > 0
- Setup config.yaml
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- Start proxy
litellm --config /path/to/config.yaml
- Test it!
from openai import OpenAI
import os
client = OpenAI(
api_key="LITELLM_PROXY_KEY", # sk-1234
base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)
for _ in range(2):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
# System Message
{
"role": "system",
"content": [
{
"type": "text",
"text": "Here is the full text of a complex legal agreement"
* 400,
}
],
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
{
"role": "assistant",
"content": "Certainly! the key terms and conditions are the following: the contract is 1 year long for $10/mo",
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
],
temperature=0.2,
max_tokens=10,
)
print("response=", response)
print("response.usage=", response.usage)
assert "prompt_tokens_details" in response.usage
assert response.usage.prompt_tokens_details.cached_tokens > 0
OpenAI prompt_cache_key and prompt_cache_retention​
OpenAI prompt caching is automatic — no cache_control message annotations are needed. Any request with 1024+ prompt tokens is eligible for caching.
OpenAI also supports two optional parameters for more control over caching behavior:
prompt_cache_key(string) — A routing hint that improves cache hit rates for requests sharing long common prefixes. Requests with the same cache key are routed to the same backend, increasing the likelihood of a cache hit.prompt_cache_retention("in_memory"or"24h") — Controls cache TTL. Default is"in_memory"(5–10 min). Set to"24h"for extended caching that offloads KV tensors to GPU-local storage.
- SDK
- PROXY
from litellm import completion
import os
os.environ["OPENAI_API_KEY"] = ""
response = completion(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are an AI assistant tasked with analyzing legal documents. "
+ "Here is the full text of a complex legal agreement " * 400,
},
{
"role": "user",
"content": "What are the key terms and conditions?",
},
],
prompt_cache_key="legal-doc-analysis",
prompt_cache_retention="24h",
)
print(response.usage)
from openai import OpenAI
client = OpenAI(
api_key="LITELLM_PROXY_KEY",
base_url="LITELLM_PROXY_BASE",
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are an AI assistant tasked with analyzing legal documents. "
+ "Here is the full text of a complex legal agreement " * 400,
},
{
"role": "user",
"content": "What are the key terms and conditions?",
},
],
extra_body={
"prompt_cache_key": "legal-doc-analysis",
"prompt_cache_retention": "24h",
},
)
print(response.usage)
Anthropic Example​
Anthropic charges for cache writes.
Specify the content to cache with "cache_control": {"type": "ephemeral"}.
This same format also works for Gemini / Vertex AI. For other providers, it will be ignored.
- SDK
- PROXY
from litellm import completion
import litellm
import os
litellm.set_verbose = True # 👈 SEE RAW REQUEST
os.environ["ANTHROPIC_API_KEY"] = ""
response = completion(
model="anthropic/claude-3-5-sonnet-20240620",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing legal documents.",
},
{
"type": "text",
"text": "Here is the full text of a complex legal agreement" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "what are the key terms and conditions in this agreement?",
},
]
)
print(response.usage)
- Setup config.yaml
model_list:
- model_name: claude-3-5-sonnet-20240620
litellm_params:
model: anthropic/claude-3-5-sonnet-20240620
api_key: os.environ/ANTHROPIC_API_KEY
- Start proxy
litellm --config /path/to/config.yaml
- Test it!
from openai import OpenAI
import os
client = OpenAI(
api_key="LITELLM_PROXY_KEY", # sk-1234
base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)
response = client.chat.completions.create(
model="claude-3-5-sonnet-20240620",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing legal documents.",
},
{
"type": "text",
"text": "Here is the full text of a complex legal agreement" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "what are the key terms and conditions in this agreement?",
},
]
)
print(response.usage)
Prompts below the minimum are processed without caching — no error is returned. Check cache_creation_input_tokens in the response.
| Model | Min tokens |
|---|---|
| Claude 3 Haiku, 3 Sonnet, 3 Opus | 1,024 |
| Claude 3.5 Sonnet, 3.7 Sonnet | 1,024 |
| Claude 3.5 Haiku | 2,048 |
| Claude Sonnet 4.5, Sonnet 4.6, Opus 4 | 2,048 |
| Claude Haiku 4.5, Opus 4.5+ | 4,096 |
Bedrock Example​
LiteLLM automatically translates OpenAI-format cache_control markers to Bedrock's native cachePoint format — no changes needed to your existing code if you're already using cache_control.
Prompts below the minimum are processed without caching — no error is returned. Check cache_creation_input_tokens in the response.
| Model family | Min tokens per request |
|---|---|
| Claude 3.5 Sonnet v2, Claude 3.7 Sonnet | 1,024 |
| Claude Sonnet 4.5, Sonnet 4.6 | 2,048 |
- SDK
- PROXY
import litellm
response = litellm.completion(
model="bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "<your large system prompt here — min 1,024 tokens for Claude 3.x, 2,048 for Claude Sonnet 4.x>",
"cache_control": {"type": "ephemeral"}
}
]
},
{"role": "user", "content": "What is prompt caching?"}
]
)
print(response.usage)
# cache_creation_input_tokens > 0 on first call (cache written)
# cache_read_input_tokens > 0 on subsequent calls (cache hit)
- Setup config.yaml
model_list:
- model_name: bedrock-claude-sonnet
litellm_params:
model: bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0
- Start proxy
litellm --config /path/to/config.yaml
- Test it!
curl -X POST http://localhost:4000/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-d '{
"model": "bedrock-claude-sonnet",
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "<your large system prompt here — min 1,024 tokens for Claude 3.x, 2,048 for Claude Sonnet 4.x>",
"cache_control": {"type": "ephemeral"}
}
]
},
{"role": "user", "content": "What is prompt caching?"}
]
}'
Supported Bedrock models:
| Model | Bedrock Model ID | Min Tokens | TTL Options |
|---|---|---|---|
| Claude 3.5 Sonnet v2 | anthropic.claude-3-5-sonnet-20241022-v2:0 | 1,024 | 5 min, 1 hour |
| Claude 3.7 Sonnet | anthropic.claude-3-7-sonnet-20250219-v1:0 | 1,024 | 5 min, 1 hour |
| Claude Opus 4 | anthropic.claude-opus-4-20250514-v1:0 | 1,024 | 5 min, 1 hour |
| Claude Sonnet 4.5, 4.6 | us.anthropic.claude-sonnet-4-5-*, us.anthropic.claude-sonnet-4-6-* | 2,048 | 5 min, 1 hour |
Cross-region inference profiles are also supported for the models above.
See the AWS Bedrock prompt caching docs for the full list of supported models and regions.
Google AI Studio / Vertex AI (Gemini) Example​
Use the same Anthropic-style cache_control format — LiteLLM automatically translates it to Google's context caching API.
How it works under the hood:
- Messages with
cache_controlare separated and sent to Google'scachedContentsAPI - The cached content ID is then passed as
cachedContentin the Gemini request body - Works across all three providers:
gemini/(Google AI Studio),vertex_ai/, andvertex_ai_beta/ - Requires a minimum of 1024 tokens in the cached content — below that, caching is silently skipped
- SDK
- PROXY
from litellm import completion
import os
os.environ["GEMINI_API_KEY"] = ""
response = completion(
model="gemini/gemini-2.5-flash",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing legal documents.",
},
{
"type": "text",
"text": "Here is the full text of a complex legal agreement" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "what are the key terms and conditions in this agreement?",
},
],
)
print(response.usage)
- Setup config.yaml
model_list:
- model_name: gemini-2.5-flash
litellm_params:
model: gemini/gemini-2.5-flash
api_key: os.environ/GEMINI_API_KEY
- Start proxy
litellm --config /path/to/config.yaml
- Test it!
from openai import OpenAI
client = OpenAI(
api_key="LITELLM_PROXY_KEY", # sk-1234
base_url="LITELLM_PROXY_BASE", # http://0.0.0.0:4000
)
response = client.chat.completions.create(
model="gemini-2.5-flash",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing legal documents.",
},
{
"type": "text",
"text": "Here is the full text of a complex legal agreement" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "what are the key terms and conditions in this agreement?",
},
],
)
print(response.usage)
Vertex AI​
For Vertex AI, use vertex_ai/ prefix:
- SDK
- PROXY
from litellm import completion
response = completion(
model="vertex_ai/gemini-2.5-flash",
vertex_project="my-gcp-project",
vertex_location="us-central1",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing legal documents.",
},
{
"type": "text",
"text": "Here is the full text of a complex legal agreement" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "what are the key terms and conditions in this agreement?",
},
],
)
print(response.usage)
- Setup config.yaml
model_list:
- model_name: gemini-2.5-flash
litellm_params:
model: vertex_ai/gemini-2.5-flash
vertex_project: my-gcp-project
vertex_location: us-central1
- Start proxy
litellm --config /path/to/config.yaml
- Test it!
from openai import OpenAI
client = OpenAI(
api_key="LITELLM_PROXY_KEY", # sk-1234
base_url="LITELLM_PROXY_BASE", # http://0.0.0.0:4000
)
response = client.chat.completions.create(
model="gemini-2.5-flash",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing legal documents.",
},
{
"type": "text",
"text": "Here is the full text of a complex legal agreement" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "what are the key terms and conditions in this agreement?",
},
],
)
print(response.usage)
Deepeek Example​
Works the same as OpenAI.
from litellm import completion
import litellm
import os
os.environ["DEEPSEEK_API_KEY"] = ""
litellm.set_verbose = True # 👈 SEE RAW REQUEST
model_name = "deepseek/deepseek-chat"
messages_1 = [
{
"role": "system",
"content": "You are a history expert. The user will provide a series of questions, and your answers should be concise and start with `Answer:`",
},
{
"role": "user",
"content": "In what year did Qin Shi Huang unify the six states?",
},
{"role": "assistant", "content": "Answer: 221 BC"},
{"role": "user", "content": "Who was the founder of the Han Dynasty?"},
{"role": "assistant", "content": "Answer: Liu Bang"},
{"role": "user", "content": "Who was the last emperor of the Tang Dynasty?"},
{"role": "assistant", "content": "Answer: Li Zhu"},
{
"role": "user",
"content": "Who was the founding emperor of the Ming Dynasty?",
},
{"role": "assistant", "content": "Answer: Zhu Yuanzhang"},
{
"role": "user",
"content": "Who was the founding emperor of the Qing Dynasty?",
},
]
message_2 = [
{
"role": "system",
"content": "You are a history expert. The user will provide a series of questions, and your answers should be concise and start with `Answer:`",
},
{
"role": "user",
"content": "In what year did Qin Shi Huang unify the six states?",
},
{"role": "assistant", "content": "Answer: 221 BC"},
{"role": "user", "content": "Who was the founder of the Han Dynasty?"},
{"role": "assistant", "content": "Answer: Liu Bang"},
{"role": "user", "content": "Who was the last emperor of the Tang Dynasty?"},
{"role": "assistant", "content": "Answer: Li Zhu"},
{
"role": "user",
"content": "Who was the founding emperor of the Ming Dynasty?",
},
{"role": "assistant", "content": "Answer: Zhu Yuanzhang"},
{"role": "user", "content": "When did the Shang Dynasty fall?"},
]
response_1 = litellm.completion(model=model_name, messages=messages_1)
response_2 = litellm.completion(model=model_name, messages=message_2)
# Add any assertions here to check the response
print(response_2.usage)
Calculate Cost​
Cost cache-hit prompt tokens can differ from cache-miss prompt tokens.
Use the completion_cost() function for calculating cost (handles prompt caching cost calculation as well). See more helper functions
cost = completion_cost(completion_response=response, model=model)
Usage​
- SDK
- PROXY
from litellm import completion, completion_cost
import litellm
import os
litellm.set_verbose = True # 👈 SEE RAW REQUEST
os.environ["ANTHROPIC_API_KEY"] = ""
model = "anthropic/claude-3-5-sonnet-20240620"
response = completion(
model=model,
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing legal documents.",
},
{
"type": "text",
"text": "Here is the full text of a complex legal agreement" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "what are the key terms and conditions in this agreement?",
},
]
)
print(response.usage)
cost = completion_cost(completion_response=response, model=model)
formatted_string = f"${float(cost):.10f}"
print(formatted_string)
LiteLLM returns the calculated cost in the response headers - x-litellm-response-cost
from openai import OpenAI
client = OpenAI(
api_key="LITELLM_PROXY_KEY", # sk-1234..
base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)
response = client.chat.completions.with_raw_response.create(
messages=[{
"role": "user",
"content": "Say this is a test",
}],
model="gpt-3.5-turbo",
)
print(response.headers.get('x-litellm-response-cost'))
completion = response.parse() # get the object that `chat.completions.create()` would have returned
print(completion)
Check Model Support​
Check if a model supports prompt caching with supports_prompt_caching()
- SDK
- PROXY
from litellm.utils import supports_prompt_caching
supports_pc: bool = supports_prompt_caching(model="anthropic/claude-3-5-sonnet-20240620")
assert supports_pc
Use the /model/info endpoint to check if a model on the proxy supports prompt caching
- Setup config.yaml
model_list:
- model_name: claude-3-5-sonnet-20240620
litellm_params:
model: anthropic/claude-3-5-sonnet-20240620
api_key: os.environ/ANTHROPIC_API_KEY
- Start proxy
litellm --config /path/to/config.yaml
- Test it!
curl -L -X GET 'http://0.0.0.0:4000/v1/model/info' \
-H 'Authorization: Bearer sk-1234' \
Expected Response
{
"data": [
{
"model_name": "claude-3-5-sonnet-20240620",
"litellm_params": {
"model": "anthropic/claude-3-5-sonnet-20240620"
},
"model_info": {
"key": "claude-3-5-sonnet-20240620",
...
"supports_prompt_caching": true # 👈 LOOK FOR THIS!
}
}
]
}
This checks our maintained model info/cost map
Read More​
Want LiteLLM to automatically add cache_control directives without modifying your code?
See Auto-Inject Prompt Caching Tutorial to learn how to use cache_control_injection_points to automatically cache system messages, specific messages by index, or custom injection patterns.