Skip to main content

/responses

LiteLLM provides an endpoint in the spec of OpenAI's /responses API

Requests to /chat/completions may be bridged here automatically when the provider lacks support for that endpoint. The model’s default mode determines how bridging works.(see model_prices_and_context_window)

FeatureSupportedNotes
Cost Trackingβœ…Works with all supported models
Loggingβœ…Works across all integrations
End-user Trackingβœ…
Streamingβœ…
WebSocket Modeβœ…Lower-latency persistent connections for all providers
Image Generation Streamingβœ…Progressive image generation with partial images (1-3)
Fallbacksβœ…Works between supported models
Loadbalancingβœ…Works between supported models
Guardrailsβœ…Applies to input and output text (non-streaming only)
Supported operationsCreate a response, Get a response, Delete a response
Supported LiteLLM Versions1.63.8+
Supported LLM providersAll LiteLLM supported providersopenai, anthropic, bedrock, vertex_ai, gemini, azure, azure_ai etc.

Usage​

LiteLLM Python SDK​

Non-streaming​

OpenAI Non-streaming Response
import litellm

# Non-streaming response
response = litellm.responses(
model="openai/o1-pro",
input="Tell me a three sentence bedtime story about a unicorn.",
max_output_tokens=100
)

print(response)

Response Format (OpenAI Responses API Format)​

{
"id": "resp_abc123",
"object": "response",
"created_at": 1734366691,
"status": "completed",
"model": "o1-pro-2025-01-30",
"output": [
{
"type": "message",
"id": "msg_abc123",
"status": "completed",
"role": "assistant",
"content": [
{
"type": "output_text",
"text": "Once upon a time, a little unicorn named Stardust lived in a magical meadow where flowers sang lullabies. One night, she discovered that her horn could paint dreams across the sky, and she spent the evening creating the most beautiful aurora for all the forest creatures to enjoy. As the animals drifted off to sleep beneath her shimmering lights, Stardust curled up on a cloud of moonbeams, happy to have shared her magic with her friends.",
"annotations": []
}
]
}
],
"usage": {
"input_tokens": 18,
"output_tokens": 98,
"total_tokens": 116
}
}

Streaming​

OpenAI Streaming Response
import litellm

# Streaming response
response = litellm.responses(
model="openai/o1-pro",
input="Tell me a three sentence bedtime story about a unicorn.",
stream=True
)

for event in response:
print(event)

Image Generation with Streaming​

OpenAI Streaming Image Generation
import litellm
import base64

# Streaming image generation with partial images
stream = litellm.responses(
model="gpt-4.1", # Use an actual image generation model
input="Generate a gorgeous image of a river made of white owl feathers",
stream=True,
tools=[{"type": "image_generation", "partial_images": 2}],

)

for event in stream:
if event.type == "response.image_generation_call.partial_image":
idx = event.partial_image_index
image_base64 = event.partial_image_b64
image_bytes = base64.b64decode(image_base64)
with open(f"river{idx}.png", "wb") as f:
f.write(image_bytes)

Image Generation (Non-streaming)​

Image generation is supported for models that generate images. Generated images are returned in the output array with type: "image_generation_call".

Gemini (Google AI Studio):

Gemini Image Generation
import litellm
import base64

# Gemini image generation models don't require tools parameter
response = litellm.responses(
model="gemini/gemini-2.5-flash-image",
input="Generate a cute cat playing with yarn"
)

# Access generated images from output
for item in response.output:
if item.type == "image_generation_call":
# item.result contains pure base64 (no data: prefix)
image_bytes = base64.b64decode(item.result)

# Save the image
with open(f"generated_{item.id}.png", "wb") as f:
f.write(image_bytes)

print(f"Image saved: generated_{response.output[0].id}.png")

OpenAI:

OpenAI Image Generation
import litellm
import base64

# OpenAI models require tools parameter for image generation
response = litellm.responses(
model="openai/gpt-4o",
input="Generate a futuristic city at sunset",
tools=[{"type": "image_generation"}]
)

# Access generated images from output
for item in response.output:
if item.type == "image_generation_call":
image_bytes = base64.b64decode(item.result)
with open(f"generated_{item.id}.png", "wb") as f:
f.write(image_bytes)

Response Format:

When image generation is successful, the response contains:

{
"id": "resp_abc123",
"status": "completed",
"output": [
{
"type": "image_generation_call",
"id": "resp_abc123_img_0",
"status": "completed",
"result": "iVBORw0KGgo..." // Pure base64 string (no data: prefix)
}
]
}

Supported Models:

ProviderModelsRequires tools Parameter
Google AI Studiogemini/gemini-2.5-flash-image❌ No
Vertex AIvertex_ai/gemini-2.5-flash-image-preview❌ No
OpenAIgpt-4o, gpt-4o-mini, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, o3βœ… Yes
AWS BedrockStability AI, Amazon Nova Canvas modelsModel-specific
Fal AIVarious image generation modelsCheck model docs

Note: The result field contains pure base64-encoded image data without the data:image/png;base64, prefix. You must decode it with base64.b64decode() before saving.

GET a Response​

Get Response by ID
import litellm

# First, create a response
response = litellm.responses(
model="openai/o1-pro",
input="Tell me a three sentence bedtime story about a unicorn.",
max_output_tokens=100
)

# Get the response ID
response_id = response.id

# Retrieve the response by ID
retrieved_response = litellm.get_responses(
response_id=response_id
)

print(retrieved_response)

# For async usage
# retrieved_response = await litellm.aget_responses(response_id=response_id)

CANCEL a Response​

You can cancel an in-progress response (if supported by the provider):

Cancel Response by ID
import litellm

# First, create a response
response = litellm.responses(
model="openai/o1-pro",
input="Tell me a three sentence bedtime story about a unicorn.",
max_output_tokens=100
)

# Get the response ID
response_id = response.id

# Cancel the response by ID
cancel_response = litellm.cancel_responses(
response_id=response_id
)

print(cancel_response)

# For async usage
# cancel_response = await litellm.acancel_responses(response_id=response_id)

REST API:

curl -X POST http://localhost:4000/v1/responses/response_id/cancel \
-H "Authorization: Bearer sk-1234"

This will attempt to cancel the in-progress response with the given ID. Note: Not all providers support response cancellation. If unsupported, an error will be raised.

DELETE a Response​

Delete Response by ID
import litellm

# First, create a response
response = litellm.responses(
model="openai/o1-pro",
input="Tell me a three sentence bedtime story about a unicorn.",
max_output_tokens=100
)

# Get the response ID
response_id = response.id

# Delete the response by ID
delete_response = litellm.delete_responses(
response_id=response_id
)

print(delete_response)

# For async usage
# delete_response = await litellm.adelete_responses(response_id=response_id)

LiteLLM Proxy with OpenAI SDK​

First, set up and start your LiteLLM proxy server.

Start LiteLLM Proxy Server
litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

First, add this to your litellm proxy config.yaml:

OpenAI Proxy Configuration
model_list:
- model_name: openai/o1-pro
litellm_params:
model: openai/o1-pro
api_key: os.environ/OPENAI_API_KEY

Non-streaming​

OpenAI Proxy Non-streaming Response
from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
base_url="http://localhost:4000", # Your proxy URL
api_key="your-api-key" # Your proxy API key
)

# Non-streaming response
response = client.responses.create(
model="openai/o1-pro",
input="Tell me a three sentence bedtime story about a unicorn."
)

print(response)

Streaming​

OpenAI Proxy Streaming Response
from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
base_url="http://localhost:4000", # Your proxy URL
api_key="your-api-key" # Your proxy API key
)

# Streaming response
response = client.responses.create(
model="openai/o1-pro",
input="Tell me a three sentence bedtime story about a unicorn.",
stream=True
)

for event in response:
print(event)

Image Generation with Streaming​

OpenAI Proxy Streaming Image Generation
from openai import OpenAI
import base64

client = OpenAI(api_key="sk-1234", base_url="http://localhost:4000")

stream = client.responses.create(
model="gpt-4.1",
input="Draw a gorgeous image of a river made of white owl feathers, snaking its way through a serene winter landscape",
stream=True,
tools=[{"type": "image_generation", "partial_images": 2}],
)


for event in stream:
print(f"event: {event}")
if event.type == "response.image_generation_call.partial_image":
idx = event.partial_image_index
image_base64 = event.partial_image_b64
image_bytes = base64.b64decode(image_base64)
with open(f"river{idx}.png", "wb") as f:
f.write(image_bytes)

GET a Response​

Get Response by ID with OpenAI SDK
from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
base_url="http://localhost:4000", # Your proxy URL
api_key="your-api-key" # Your proxy API key
)

# First, create a response
response = client.responses.create(
model="openai/o1-pro",
input="Tell me a three sentence bedtime story about a unicorn."
)

# Get the response ID
response_id = response.id

# Retrieve the response by ID
retrieved_response = client.responses.retrieve(response_id)

print(retrieved_response)

DELETE a Response​

Delete Response by ID with OpenAI SDK
from openai import OpenAI

# Initialize client with your proxy URL
client = OpenAI(
base_url="http://localhost:4000", # Your proxy URL
api_key="your-api-key" # Your proxy API key
)

# First, create a response
response = client.responses.create(
model="openai/o1-pro",
input="Tell me a three sentence bedtime story about a unicorn."
)

# Get the response ID
response_id = response.id

# Delete the response by ID
delete_response = client.responses.delete(response_id)

print(delete_response)

WebSocket Mode​

The Responses API supports WebSocket mode for lower-latency, persistent connections ideal for agentic workflows. WebSocket mode works with all LiteLLM providers, not just those with native WebSocket support.

Architecture​

LiteLLM provides two WebSocket modes:

  1. Native WebSocket: Direct wss:// connection to providers that support it (OpenAI, Azure)
  2. Managed WebSocket: HTTP streaming over WebSocket for all other providers (Anthropic, Gemini, Bedrock, etc.)

The system automatically selects the appropriate mode based on provider capabilities.

Usage​

WebSocket with Python
import json
from websocket import create_connection # pip install websocket-client

# Connect to LiteLLM proxy WebSocket endpoint
ws = create_connection(
"ws://localhost:4000/v1/responses?model=gemini-2.5-flash",
header=["Authorization: Bearer sk-1234"]
)

try:
# Send initial message
ws.send(json.dumps({
"type": "response.create",
"model": "gemini-2.5-flash",
"store": True,
"input": [{
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "My favorite color is blue."}]
}]
}))

# Collect response events
response_id = None
while True:
event = json.loads(ws.recv())
print(f"Event: {event['type']}")

if event["type"] == "response.completed":
response_id = event["response"]["id"]
break
elif event["type"] == "response.output_text.delta":
print(f"Text: {event.get('delta', '')}", end="", flush=True)

print(f"\nResponse ID: {response_id}")

# Send follow-up with previous_response_id for multi-turn
ws.send(json.dumps({
"type": "response.create",
"model": "gemini-2.5-flash",
"previous_response_id": response_id,
"input": [{
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "What is my favorite color?"}]
}]
}))

# Collect follow-up response
while True:
event = json.loads(ws.recv())
if event["type"] == "response.completed":
break
elif event["type"] == "response.output_text.delta":
print(event.get("delta", ""), end="", flush=True)

finally:
ws.close()

Event Types​

WebSocket connections receive Server-Sent Events (SSE) formatted as JSON:

Event TypeDescription
response.createdResponse generation started
response.in_progressResponse is being generated
response.output_item.addedNew output item (message, tool call, etc.) added
response.output_text.deltaIncremental text chunk
response.output_text.doneText output completed
response.content_part.doneContent part completed
response.output_item.doneOutput item completed
response.completedFull response completed successfully
response.failedResponse generation failed
response.incompleteResponse incomplete (e.g., max tokens reached)
errorError occurred

Multi-Turn Conversations​

Use previous_response_id to maintain conversation context across multiple WebSocket messages:

Multi-turn WebSocket Conversation
# Turn 1
ws.send(json.dumps({
"type": "response.create",
"model": "gemini-2.5-flash",
"store": True, # Required for multi-turn
"input": [{"type": "message", "role": "user", "content": [{"type": "input_text", "text": "Hello"}]}]
}))

# ... collect events and get response_id from response.completed event ...

# Turn 2 - reference previous response
ws.send(json.dumps({
"type": "response.create",
"model": "gemini-2.5-flash",
"previous_response_id": response_id, # Links to previous turn
"input": [{"type": "message", "role": "user", "content": [{"type": "input_text", "text": "Continue"}]}]
}))

Provider Support​

ProviderWebSocket ModeNotes
OpenAINativeDirect wss:// connection to OpenAI
Azure OpenAINativeDirect wss:// connection to Azure
AnthropicManagedHTTP streaming over WebSocket
Google AI Studio (Gemini)ManagedHTTP streaming over WebSocket
Vertex AIManagedHTTP streaming over WebSocket
AWS BedrockManagedHTTP streaming over WebSocket
All other providersManagedHTTP streaming over WebSocket

Note: Both native and managed modes provide the same event stream format. The difference is transparent to clients.

Configuration​

No special configuration needed. WebSocket mode is automatically available on the /v1/responses endpoint when accessed via WebSocket protocol (ws:// or wss://).

For LiteLLM Proxy, ensure your models are configured normally:

config.yaml
model_list:
- model_name: gemini-2.5-flash
litellm_params:
model: gemini/gemini-2.5-flash
api_key: os.environ/GEMINI_API_KEY

- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY

Both models will automatically support WebSocket mode at ws://localhost:4000/v1/responses.

Response ID Security​

By default, LiteLLM Proxy prevents users from accessing other users' response IDs.

This is done by encrypting the response ID with the user ID, enabling users to only access their own response IDs.

Trying to access someone else's response ID returns 403:

{
"error": {
"message": "Forbidden. The response id is not associated with the user, who this key belongs to.",
"code": 403
}
}

To disable this, set disable_responses_id_security: true:

general_settings:
disable_responses_id_security: true

This allows any user to access any response ID.

Supported Responses API Parameters​

ProviderSupported Parameters
openaiAll Responses API parameters are supported
azureAll Responses API parameters are supported
anthropicSee supported parameters here
bedrockSee supported parameters here
geminiSee supported parameters here
vertex_aiSee supported parameters here
azure_aiSee supported parameters here
All other llm api providersSee supported parameters here

Load Balancing with Session Continuity.​

When using the Responses API with multiple deployments of the same model (e.g., multiple Azure OpenAI endpoints), LiteLLM provides session continuity. This ensures that follow-up requests using a previous_response_id are routed to the same deployment that generated the original response.

Example Usage​

Python SDK with Session Continuity
import litellm

# Set up router with multiple deployments of the same model
router = litellm.Router(
model_list=[
{
"model_name": "azure-gpt4-turbo",
"litellm_params": {
"model": "azure/gpt-4-turbo",
"api_key": "your-api-key-1",
"api_version": "2024-06-01",
"api_base": "https://endpoint1.openai.azure.com",
},
},
{
"model_name": "azure-gpt4-turbo",
"litellm_params": {
"model": "azure/gpt-4-turbo",
"api_key": "your-api-key-2",
"api_version": "2024-06-01",
"api_base": "https://endpoint2.openai.azure.com",
},
},
],
# `responses_api_deployment_check` ensures Requests with `previous_response_id`
# are routed to the same deployment. `deployment_affinity` adds sticky sessions
# for requests without `previous_response_id` (useful for implicit caching).
# `session_affinity` adds sticky sessions based on `session_id` metadata.
optional_pre_call_checks=["responses_api_deployment_check", "deployment_affinity", "session_affinity"],
# Optional (default is 3600 seconds / 1 hour)
deployment_affinity_ttl_seconds=3600,
)

# Initial request
response = await router.aresponses(
model="azure-gpt4-turbo",
input="Hello, who are you?",
truncation="auto",
)

# Store the response ID
response_id = response.id

# Follow-up request - will be automatically routed to the same deployment
follow_up = await router.aresponses(
model="azure-gpt4-turbo",
input="Tell me more about yourself",
truncation="auto",
previous_response_id=response_id # This ensures routing to the same deployment
)

Encrypted Content Affinity (Multi-Region Load Balancing)​

When load balancing Responses API across deployments with different API keys (e.g., different Azure regions or OpenAI organizations), encrypted content items (like rs_... reasoning items) can only be decrypted by the API key that created them.

The Problem​

{
"error": {
"message": "The encrypted content for item rs_0d09d6e56879e76500699d6feee41c8197bd268aae76141f87 could not be verified. Reason: Encrypted content organization_id did not match the target organization.",
"type": "invalid_request_error",
"code": "invalid_encrypted_content"
}
}

This error occurs when:

  1. Initial request goes to Deployment A (API Key 1) β†’ produces encrypted item rs_xyz
  2. Follow-up request with rs_xyz in input gets load balanced to Deployment B (API Key 2)
  3. Deployment B cannot decrypt content created by Deployment A β†’ request fails

The Solution: encrypted_content_affinity​

The encrypted_content_affinity pre-call check routes follow-up requests containing encrypted items to the originating deployment only when necessary

Key Benefits:

  • βœ… No quota reduction: Unlike deployment_affinity, only pins requests that contain encrypted items
  • βœ… Bypasses rate limits: When encrypted content requires a specific deployment, RPM/TPM limits are bypassed (the request would fail on any other deployment anyway)
  • βœ… No previous_response_id required: Works by encoding model_id directly into item IDs
  • βœ… No cache required: model_id is decoded on-the-fly β€” no Redis dependency, no TTL to manage
  • βœ… Globally safe: Can be enabled for all models; non-Responses-API calls (chat, embeddings) are unaffected

How It Works​

  1. Encoding Phase (on response):

    • For each output item that contains encrypted_content, LiteLLM rewrites the item ID to embed the originating model_id: rs_xyz β†’ encitem_{base64("litellm:model_id:{model_id};item_id:rs_xyz")}
    • The original item ID is restored before forwarding the request to the upstream provider
  2. Routing Phase (before request):

    • Scans request input for encitem_ prefixed IDs
    • If found β†’ decodes model_id, pins to originating deployment, bypasses rate limits
    • If no encoded items β†’ normal load balancing

Configuration​

from litellm import Router

router = Router(
model_list=[
{
"model_name": "gpt-5.1-codex",
"litellm_params": {
"model": "openai/gpt-5.1-codex",
"api_key": "org-1-api-key", # Different API key
},
"model_info": {"id": "deployment-us-east"},
},
{
"model_name": "gpt-5.1-codex",
"litellm_params": {
"model": "openai/gpt-5.1-codex",
"api_key": "org-2-api-key", # Different API key
},
"model_info": {"id": "deployment-eu-west"},
},
],
optional_pre_call_checks=["encrypted_content_affinity"],
)

# Initial request - routes to any deployment
response1 = await router.aresponses(
model="gpt-5.1-codex",
input="Explain quantum computing",
)

# Follow-up with encrypted items - automatically routes to same deployment
response2 = await router.aresponses(
model="gpt-5.1-codex",
input=response1.output, # Contains encrypted items from response1
)

When to Use Each Affinity Type​

Affinity TypeUse CaseScopeQuota Impact
encrypted_content_affinity[Recommended] Multi-region Responses API with different API keysOnly requests with tracked encrypted itemsβœ… None (surgical pinning)
responses_api_deployment_checkWhen previous_response_id is availableRequests with previous_response_idβœ… None
session_affinitySession-based applicationsAll requests with same session_id⚠️ Reduces quota by # of sessions
deployment_affinitySimple sticky sessionsAll requests from same API key❌ Reduces quota by # of users

Calling non-Responses API endpoints (/responses to /chat/completions Bridge)​

LiteLLM allows you to call non-Responses API models via a bridge to LiteLLM's /chat/completions endpoint. This is useful for calling Anthropic, Gemini and even non-Responses API OpenAI models.

Python SDK Usage​

SDK Usage
import litellm
import os

# Set API key
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key"

# Non-streaming response
response = litellm.responses(
model="anthropic/claude-3-5-sonnet-20240620",
input="Tell me a three sentence bedtime story about a unicorn.",
max_output_tokens=100
)

print(response)

LiteLLM Proxy Usage​

Setup Config:

Example Configuration
model_list:
- model_name: anthropic-model
litellm_params:
model: anthropic/claude-3-5-sonnet-20240620
api_key: os.environ/ANTHROPIC_API_KEY

Start Proxy:

Start LiteLLM Proxy
litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

Make Request:

non-Responses API Model Request
curl http://localhost:4000/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "anthropic-model",
"input": "who is Michael Jordan"
}'

Server-side compaction​

For long-running conversations, you can enable server-side compaction so that when the rendered context size crosses a threshold, the server automatically runs compaction in-stream and emits a compaction itemβ€”no separate POST /v1/responses/compact call is required.

Supported on the OpenAI Responses API when using the openai or azure provider. Pass context_management with a compaction entry and compact_threshold (token count; minimum 1000). When the context crosses the threshold, the server compacts in-stream and continues. Chain turns with previous_response_id or by appending output items to your next input array. See OpenAI Compaction guide for details.

Note: You can use openai context_management format with Anthropic models via LiteLLM via responses API. LiteLLM will automatically translate this format for Anthropic and handle context management for you.

For explicit control over when compaction runs, use the standalone compact endpoint (POST /v1/responses/compact) instead.

Python SDK​

Server-side compaction with LiteLLM Python SDK
import litellm

# Non-streaming: enable compaction when context exceeds 200k tokens
response = litellm.responses(
model="openai/gpt-4o",
input="Your conversation input...",
context_management=[{"type": "compaction", "compact_threshold": 200000}],
max_output_tokens=1024,
)
print(response)

# Streaming: same context_management, compaction runs in-stream if threshold is crossed
stream = litellm.responses(
model="openai/gpt-4o",
input="Your conversation input...",
context_management=[{"type": "compaction", "compact_threshold": 200000}],
stream=True,
)
for event in stream:
print(event)

LiteLLM Proxy (AI Gateway)​

Use the OpenAI SDK with your proxy as base_url, or call the proxy with curl. The proxy forwards context_management to the provider.

OpenAI Python SDK (proxy as base_url):

Server-side compaction via LiteLLM Proxy
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:4000", # LiteLLM Proxy (AI Gateway)
api_key="your-proxy-api-key",
)

response = client.responses.create(
model="openai/gpt-4o",
input="Your conversation input...",
context_management=[{"type": "compaction", "compact_threshold": 200000}],
max_output_tokens=1024,
)
print(response)

curl (proxy):

Server-side compaction via curl to LiteLLM Proxy
curl -X POST "http://localhost:4000/v1/responses" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-proxy-api-key" \
-d '{
"model": "openai/gpt-4o",
"input": "Your conversation input...",
"context_management": [{"type": "compaction", "compact_threshold": 200000}],
"max_output_tokens": 1024
}'

Shell tool​

The Shell tool lets the model run commands in a hosted container or local runtime (OpenAI Responses API). You pass tools=[{"type": "shell", "environment": {...}}]; the environment object configures the runtime (e.g. type: "container_auto" for auto-provisioned containers). See OpenAI Shell tool guide for full options.

Supported when using the openai or azure provider with a model that supports the Shell tool.

Python SDK​

Shell tool with LiteLLM Python SDK
import litellm

response = litellm.responses(
model="openai/gpt-5.2",
input="List files in /mnt/data and run python --version.",
tools=[{"type": "shell", "environment": {"type": "container_auto"}}],
tool_choice="auto",
max_output_tokens=1024,
)

LiteLLM Proxy (AI Gateway)​

Use the OpenAI SDK with your proxy as base_url, or call the proxy with curl. The proxy forwards tools (including type: "shell") to the provider.

OpenAI Python SDK (proxy as base_url):

Shell tool via LiteLLM Proxy
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:4000",
api_key="your-proxy-api-key",
)

response = client.responses.create(
model="openai/gpt-5.2",
input="List files in /mnt/data.",
tools=[{"type": "shell", "environment": {"type": "container_auto"}}],
tool_choice="auto",
max_output_tokens=1024,
)

curl:

Shell tool via curl to LiteLLM Proxy
curl -X POST "http://localhost:4000/v1/responses" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-proxy-api-key" \
-d '{
"model": "openai/gpt-5.2",
"input": "List files in /mnt/data.",
"tools": [{"type": "shell", "environment": {"type": "container_auto"}}],
"tool_choice": "auto",
"max_output_tokens": 1024
}'

Session Management​

LiteLLM Proxy supports session management for all supported models. This allows you to store and fetch conversation history (state) in LiteLLM Proxy.

Usage​

  1. Enable storing request / response content in the database

Set store_prompts_in_cold_storage: true in your proxy config.yaml. When this is enabled, LiteLLM will store the request and response content in the s3 bucket you specify.

config.yaml with Session Continuity
litellm_settings:
callbacks: ["s3_v2"]
cold_storage_custom_logger: s3_v2
s3_callback_params: # learn more https://docs.litellm.ai/docs/proxy/logging#s3-buckets
s3_bucket_name: litellm-logs # AWS Bucket Name for S3
s3_region_name: us-west-2

general_settings:
store_prompts_in_cold_storage: true
store_prompts_in_spend_logs: true
  1. Make request 1 with no previous_response_id (new session)

Start a new conversation by making a request without specifying a previous response ID.

curl http://localhost:4000/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "anthropic/claude-3-5-sonnet-latest",
"input": "who is Michael Jordan"
}'

Response:

{
"id":"resp_123abc",
"model":"claude-3-5-sonnet-20241022",
"output":[{
"type":"message",
"content":[{
"type":"output_text",
"text":"Michael Jordan is widely considered one of the greatest basketball players of all time. He played for the Chicago Bulls (1984-1993, 1995-1998) and Washington Wizards (2001-2003), winning 6 NBA Championships with the Bulls."
}]
}]
}
  1. Make request 2 with previous_response_id (same session)

Continue the conversation by referencing the previous response ID to maintain conversation context.

curl http://localhost:4000/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "anthropic/claude-3-5-sonnet-latest",
"input": "can you tell me more about him",
"previous_response_id": "resp_123abc"
}'

Response:

{
"id":"resp_456def",
"model":"claude-3-5-sonnet-20241022",
"output":[{
"type":"message",
"content":[{
"type":"output_text",
"text":"Michael Jordan was born February 17, 1963. He attended University of North Carolina before being drafted 3rd overall by the Bulls in 1984. Beyond basketball, he built the Air Jordan brand with Nike and later became owner of the Charlotte Hornets."
}]
}]
}
  1. Make request 3 with no previous_response_id (new session)

Start a brand new conversation without referencing previous context to demonstrate how context is not maintained between sessions.

curl http://localhost:4000/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "anthropic/claude-3-5-sonnet-latest",
"input": "can you tell me more about him"
}'

Response:

{
"id":"resp_789ghi",
"model":"claude-3-5-sonnet-20241022",
"output":[{
"type":"message",
"content":[{
"type":"output_text",
"text":"I don't see who you're referring to in our conversation. Could you let me know which person you'd like to learn more about?"
}]
}]
}