Skip to main content

[PRE-RELEASE] v1.72.6-stable

Krrish Dholakia
Ishaan Jaffer
info

This is a pre-release version.

The production version will be released on Wednesday.

Deploy this version​

docker run litellm
docker run
-e STORE_MODEL_IN_DB=True
-p 4000:4000
ghcr.io/berriai/litellm:main-v1.72.6.rc

TLDR​

  • Why Upgrade
    • Codex-mini on Claude Code: You can now use codex-mini (OpenAI’s code assistant model) via Claude Code.
    • MCP Permissions Management: Manage permissions for MCP Servers by Keys, Teams, Organizations (entities) on LiteLLM.
    • UI: Turn on/off auto refresh on logs view.
    • Rate Limiting: Support for output token-only rate limiting.
  • Who Should Read
    • Teams using /v1/messages API (Claude Code)
    • Teams using MCP
    • Teams giving access to self-hosted models and setting rate limits
  • Risk of Upgrade
    • Low
      • No major changes to existing functionality or package updates.

Key Highlights​

MCP Permissions Management​

This release brings support for managing permissions for MCP Servers by Keys, Teams, Organizations (entities) on LiteLLM. When a MCP client attempts to list tools, LiteLLM will only return the tools the entity has permissions to access.

This is great for use cases that require access to restricted data (e.g Jira MCP) that you don't want everyone to use.

For Proxy Admins, this enables centralized management of all MCP Servers with access control. For developers, this means you'll only see the MCP tools assigned to you.

Codex-mini on Claude Code​

This release brings support for calling codex-mini (OpenAI’s code assistant model) via Claude Code.

This is done by LiteLLM enabling any Responses API model (including o3-pro) to be called via /chat/completions and /v1/messages endpoints. This includes:

  • Streaming calls
  • Non-streaming calls
  • Cost Tracking on success + failure for Responses API models

Here's how to use it today


New / Updated Models​

Pricing / Context Window Updates​

ProviderModelContext WindowInput ($/1M tokens)Output ($/1M tokens)Type
VertexAIvertex_ai/claude-opus-4200K$15.00$75.00New
OpenAIgpt-4o-audio-preview-2025-06-03128k$2.5 (text), $40 (audio)$10 (text), $80 (audio)New
OpenAIo3-pro200k2080New
OpenAIo3-pro-2025-06-10200k2080New
OpenAIo3200k28Updated
OpenAIo3-2025-04-16200k28Updated
Azureazure/gpt-4o-mini-transcribe16k1.25 (text), 3 (audio)5 (text)New
Mistralmistral/magistral-medium-latest40k25New
Mistralmistral/magistral-small-latest40k0.51.5New

Updated Models​

Bugs​

  • Watsonx
    • Ignore space id on Watsonx deployments (throws json errors) - PR
  • Ollama
    • Set tool call id for streaming calls - PR
  • Gemini (VertexAI + Google AI Studio)
    • Fix tool call indexes - PR
    • Handle empty string for arguments in function calls - PR
    • Add audio/ogg mime type support when inferring from file url’s - PR
  • Custom LLM
    • Fix passing api_base, api_key, litellm_params_dict to custom_llm embedding methods - PR s/o ElefHead
  • Huggingface
    • Add /chat/completions to endpoint url when missing - PR
  • Deepgram
    • Support async httpx calls - PR
  • Anthropic
    • Append prefix (if set) to assistant content start - PR

Features​

  • VertexAI
    • Support vertex credentials set via env var on passthrough - PR
    • Support for choosing ‘global’ region when model is only available there - PR
    • Anthropic passthrough cost calculation + token tracking - PR
    • Support ‘global’ vertex region on passthrough - PR
  • Anthropic
  • Perplexity
  • Mistral
  • SGLang
    • Map context window exceeded error for proper handling - PR
  • Deepgram
    • Provider specific params support - PR
  • Azure
    • Return content safety filter results - PR

LLM API Endpoints​

Bugs​

  • Chat Completion
    • Streaming - Ensure consistent ‘created’ across chunks - PR

Features​

  • MCP
    • Add controls for MCP Permission Management - PR, Docs
    • Add permission management for MCP List + Call Tool operations - PR, Docs
    • Streamable HTTP server support - PR, PR, Docs
    • Use Experimental dedicated Rest endpoints for list, calling MCP tools - PR
  • Responses API
    • NEW API Endpoint - List input items - PR
    • Background mode for OpenAI + Azure OpenAI - PR
    • Langfuse/other Logging support on responses api requests - PR
  • Chat Completions
    • Bridge for Responses API - allows calling codex-mini via /chat/completions and /v1/messages - PR, PR

Spend Tracking​

Bugs​


Management Endpoints / UI​

Bugs​

  • Users
    • /user/info - fix passing user with + in user id
    • Add admin-initiated password reset flow - PR
    • Fixes default user settings UI rendering error - PR
  • Budgets
    • Correct success message when new user budget is created - PR

Features​

  • Leftnav
    • Show remaining Enterprise users on UI
  • MCP
    • New server add form - PR
    • Allow editing mcp servers - PR
  • Models
    • Add deepgram models on UI
    • Model Access Group support on UI - PR
  • Keys
    • Trim long user id’s - PR
  • Logs
    • Add live tail feature to logs view, allows user to disable auto refresh in high traffic - PR
    • Audit Logs - preview screenshot - PR

Logging / Guardrails Integrations​

Bugs​

Features​

  • Lasso Guardrails
    • [NEW] Lasso Guardrails support - PR
  • Users
    • New organizations param on /user/new - allows adding users to orgs on creation - PR
  • Prevent double logging when using bridge logic - PR

Performance / Reliability Improvements​

Bugs​

Features​

  • Caching
    • New optional ‘litellm[caching]’ pip install for adding disk cache dependencies - PR

General Proxy Improvements​

Bugs​

  • aiohttp
    • fixes for transfer encoding error on aiohttp transport - PR

Features​

  • aiohttp
    • Enable System Proxy Support for aiohttp transport - PR (s/o idootop)
  • CLI
    • Make all commands show server URL - PR
  • Unicorn
    • Allow setting keep alive timeout - PR
  • Experimental Rate Limiting v2 (enable via EXPERIMENTAL_MULTI_INSTANCE_RATE_LIMITING="True")
    • Support specifying rate limit by output_tokens only - PR
    • Decrement parallel requests on call failure - PR
    • In-memory only rate limiting support - PR
    • Return remaining rate limits by key/user/team - PR
  • Helm
    • support extraContainers in migrations-job.yaml - PR

New Contributors​


Demo Instance​

Here's a Demo Instance to test changes:

Git Diff​