7 posts tagged with "ai-gateway"

Benchmarking the LiteLLM Rust AI Gateway: Overhead, Memory, and Cost

July 22, 2026

CTO, LiteLLM

Last Updated: July 2026

We are launching an early beta of the LiteLLM AI Gateway in Rust, and we built AIGatewayBench to measure it against Portkey, Bifrost, and the current LiteLLM Python proxy. Across all four, the LiteLLM Rust gateway has the lowest p99 added latency and the smallest memory footprint by a wide margin: roughly 7x lower overhead and 9x less memory than the next-closest gateway (Bifrost), the lowest cost footprint, and the fastest whole-session times for coding agents. On raw sustained throughput it is close to Bifrost; the separation is in overhead, memory, and cost.

Migrating LiteLLM to Rust - Building the Fastest and Litest AI Gateway

June 22, 2026

Ishaan Jaffer

CTO, LiteLLM

Last Updated: June 2026

Over the past year, we have heard the same thing from our users and our community: they want the fastest, most lightweight AI gateway they can run. We have heard you. We are addressing it by moving LiteLLM to Rust, and committing to sub-1ms overhead with a sub-100MB memory binary you can deploy. By the end of this migration, you will get a pure Rust server that can serve 100% of your AI traffic, with every hot path operation, including auth and rate limiting, running in Rust.

Want to help us build it?

We are opening an early beta and want to work directly with teams who care about a fast, lightweight gateway. If that is you, sign up here and we will get you testing the Rust gateway in your own stack, with a direct line to our team.

The reason it matters: under real load, CPU and memory climb with concurrency, and pods get OOM-killed at the worst time. Today the LiteLLM Python proxy peaks around 359MB of memory under load, and that cost multiplies across every pod, region, and retry you run.

We are already seeing the payoff in benchmarks. The Rust gateway serves about 15x the throughput (453 to 6,782 requests per second) on about 11x less memory (359MB to 32MB), and cuts per-request overhead from about 7.5ms on the Python path to about 0.05ms, well under the 1ms we commit to.

What you get

You deploy a single Rust binary. It uses about 65MB of memory, gateway overhead stays under 1ms, and nothing in your setup changes: same config.yaml, same database, same client API, same providers. You keep LiteLLM's coverage of 100+ LLM providers behind one OpenAI-compatible API, with /chat/completions, /messages, /responses, and every other LLM endpoint LiteLLM supports today, now as the fastest and most lightweight LLM gateway you can self-host.

This is not a v2 and not a rewrite. There is no new major version to migrate to and nothing for you to change. The runtime under the hot path gets faster and lighter while your config stays exactly where it is.

We ship this the careful way. Each route moves to Rust only after it passes our full parity and end-to-end test suite, and it runs in production before the next route starts. Stability is the priority, and we target zero regressions on every release.

A Unified Agent Control Plane

June 10, 2026

Krrish Dholakia

CEO, LiteLLM

Last updated: June 2026

Agent infrastructure is already separating into three layers: models, harnesses, and runtimes. We believe a fourth layer will emerge: the unified agent control plane. This will allow calling agents living in different agent runtimes, all from 1 place.

The reason is that companies will not run every agent on one runtime. Coding agents may run on Bedrock AgentCore or Claude Managed Agents. Data agents may run inside Elastic, Databricks, or Snowflake. Internal workflow agents may run on custom infrastructure. The control plane emerges because companies want one place where all of these agents can be used, regardless of where they were built or run.

But a registry alone is not enough. Anyone can build a list of agents.

The harder problem is invocation. Agent runtimes expose similar primitives — agents, sessions, events, tools — but they do not expose them through the same APIs. So if you want one place to actually use these agents, not just list them, the control plane has to manage agent runtimes, schedules, memory, and sessions.

This is the same pattern LiteLLM saw with models. Companies did not just need a catalog of models. They needed one interface to call them. The only change, is that the primitive is now the agent session, not the model call.

The Stack of the Future

Model stack — today

calling models

Agent stack — future

calling harnesses

Unified API

one interface, many backends

LiteLLM

one API across 100+ models

→

one API across agent runtimes

Managed cloud service

fully hosted, pay-per-use

Bedrock

cloud model inference

→

Claude Managed Agents

cloud model + harness API

Deployment platform

run open-source yourself

SageMaker

deploy OSS models

→

AgentCore · Vertex Agents

deploy OSS harnesses

High-perf serving

throughput & latency engine

vLLM

fast model serving

→

fast harness serving

open gap — no clear winner yet

established / announced player

Each model-stack layer has a mirror in the agent stack. Dashed boxes mark open opportunities.

The important shift is that the gateway is no longer just routing model calls. It is routing agent work.

With LLMs, the stack became:

Models: GPT, Claude, Gemini, Llama
Inference providers: OpenAI, Anthropic, Bedrock, Vertex, Azure, vLLM
Gateway: routing, fallbacks, logging, spend tracking, auth, billing
Applications: copilots, workflows, internal tools, products

With agents, we think the stack becomes:

Models: Claude, GPT, Gemini, open-source models
Harnesses: Claude Code, Codex, OpenCode, Hermes, DeepAgents
Agent runtimes: Claude Managed Agents, Bedrock AgentCore, Gemini Enterprise Agent Platform, self-hosted runtimes
Agent control plane: multi-runtime platform where teams manage agent runtimes, schedules, memory, and sessions.
Applications: coding agents, support agents, data agents, security agents

Why companies will need this

At LiteLLM, we are already seeing our team work across multiple agent runtimes. Some people are building on Claude Managed Agents, others are on N8N or Cursor.

This fragmentation makes it hard for agents built on these platforms to be shareable, and everyone to benefit from the work done so far.

By having the agents live in 1 place, everyone can leverage these agents - even if the PR Babysitter Agent was written in Claude Managed Agents, which not everyone has direct access to.

That is the control plane problem.

This is also why we think the AI Gateway moves up the stack. The gateway starts by managing model calls. But as agents become the dominant use-case for AI, the gateway has to manage agent sessions too.

What we are building

LiteLLM Agent Platform is our experiment in this direction.

LiteLLM Agent Platform is a Rust-based AI Gateway and Agent Control Plane. The goal is to let teams register, invoke, observe, and govern agents across multiple runtimes.

We are starting with coding agents because the need is obvious. They are long-running, stateful, tool-heavy, and expensive enough to require real infrastructure.

We are already seeing early users resonate with this pattern. Some companies want LAP to act as a central control plane for agents built by different teams on different runtimes. For example, one team might build an agent on Elastic’s runtime to analyze Kibana logs, but the company may want to expose that agent internally through a common gateway.

This is the architecture we believe is coming: models become interchangeable, harnesses become specialized, runtimes become managed, and the gateway becomes the control plane for agent work.

If this matches what you are seeing, we would love feedback on LiteLLM Agent Platform:

https://github.com/LiteLLM-Labs/litellm-agent-platform

Frequently Asked Questions

Is LiteLLM building a second product?

No. LAP is an experimental project. The goal is to learn quickly and bring the right pieces into LiteLLM over time.

Is LAP production-ready?

No. LAP is pre-v0. APIs may change as we work with early users and contributors.

If you want to contribute, file an issue or join our Discord:

https://discord.gg/Nkxw3rm3EE

LiteLLM Labs: Announcing Lite-Harness SDK — Unified API for Claude Code, Codex, and Pi AI

June 2, 2026

Krrish Dholakia

CEO, LiteLLM

Ishaan Jaffer

CTO, LiteLLM

Harnesses are the next frontier of vendor lock-in. LiteLLM was built to swap across model providers easily. However, as the models get saturated, the next area for competition becomes the harnesses and managed agents. To make it easy to go across vendors at the harness layer, we're launching the Lite-Harness SDK. This is a simple TypeScript+Python SDK which allows developers to change harnesses, like they change models.

It exposes harnesses in a unified Claude Agents SDK spec. This means that if you wrote your app with the Claude Agents SDK, and want to try another harness (Pi AI, Hermes, Codex, OpenCode), you can do so without rewriting your code.

Today, it supports 3 harnesses - Claude Code, Codex, and Pi AI. Please file an issue here, if you want us to add another harness.

Here's how it works:

TypeScript Example

import { query } from "@lite-harness/sdk";

const prompt = "Fix the failing test";

// Claude Code harness
for await (const message of query({
  prompt,
  options: { harness: "claude-code", model: "claude-opus-4-8" },
})) {
  console.log(message);
}

// Codex harness
for await (const message of query({
  prompt,
  options: { harness: "codex", model: "gpt-5.5" },
})) {
  console.log(message);
}

Python Example

from lite_harness import query, AgentOptions

prompt = "Fix the failing test"

# Claude Code harness
async for message in query(
    prompt=prompt,
    options=AgentOptions(harness="claude-code", model="claude-opus-4-8"),
):
    print(message)

# Codex harness
async for message in query(
    prompt=prompt,
    options=AgentOptions(harness="codex", model="gpt-5.5"),
):
    print(message)

LiteLLM AI Gateway

Lite-Harness supports proxy'ing harnesses via LiteLLM AI Gateway. This enables easy model swapping, cost controls and logging.

Point Lite-Harness at your gateway by setting two environment variables:

export LITELLM_API_BASE=https://litellm.your-company.com/v1
export LITELLM_API_KEY=sk-litellm-...

Then call as usual — every underlying model request routes through the gateway:

from lite_harness import query, AgentOptions

prompt = "Fix the failing test"

# Claude Code harness
async for message in query(
    prompt=prompt,
    options=AgentOptions(harness="claude-code", model="claude-opus-4-8"),
):
    print(message)

# Codex harness
async for message in query(
    prompt=prompt,
    options=AgentOptions(harness="codex", model="gpt-5.5"),
):
    print(message)

Frequently Asked Questions

Do I have to use the LiteLLM AI Gateway?

No. lite-harness works standalone — point it at provider APIs with native keys. AI Gateway integration is opt-in for teams that want central key management, budgets, fallbacks, and a single audit log across every model call.

Does swapping harnesses change agent behavior?

Yes — that's the point. Each harness keeps its native loop, tool-calling semantics, and prompt format. lite-harness unifies how you invoke them, not how they run internally. Run the same prompt across all three to see which combo lands the task best.

Is this ready for production?

lite-harness is an early, experimental project. This is in public beta. Please join our discord, to help design it to your preference.

Is this available in LiteLLM OSS?

Yes. lite-harness is MIT-licensed at github.com/LiteLLM-Labs/lite-harness. LiteLLM Enterprise adds SSO/SCIM, air-gapped deployment, 24/7 SLA, and advanced guardrails on top of the AI Gateway it pairs with.

How we built a background agent to cover 30% of our backlog

May 27, 2026

Krrish Dholakia

CEO, LiteLLM

Ishaan Jaffer

CTO, LiteLLM

LiteLLM Agent Platform: agent.litellm.ai

info

The platform we built is open source. Check out litellm-agent-platform. The swappable harness layer is lite-harness.

Building the same thing inside your company?

Our goal was to 10x the productivity of our company with agents.

Three weeks ago we began building an agent that could own 30% of our engineering tickets. Here's what we've learnt so far.

Announcing Componentized Deployments

May 18, 2026

Yassin Kortam

Senior SWE @ LiteLLM

Last Updated: May 2026

The LiteLLM proxy container does 2 very different things. It's an LLM data plane, /chat/completions, /v1/messages, embeddings, passthroughs, where latency is measured in single-digit milliseconds of overhead and traffic is high-volume and bursty. It's also a management control plane — keys, teams, SSO, audit logs, and the spend/usage analytics that power the dashboard, where a single request can scan millions of rows.

Run both on the same event loop, and the slowest thing the control plane does sets the reliability floor for the fastest thing the data plane does. This post is about how we've improved LiteLLM's reliability at scale by offering a componentized deployment model.

Making the AI Gateway Resilient to Redis Failures

April 11, 2026

Ishaan Jaffer

CTO, LiteLLM

Last Updated: April 2026

Enterprise AI Gateway deployments put Redis in the hot path for nearly every request: rate limiting, cache lookups, spend tracking. When Redis is healthy, the latency contribution is single-digit milliseconds — invisible to end users. When it degrades, a production AI Gateway needs to stay up regardless.

Running LiteLLM at scale across 100+ pods means designing for failure modes before they appear. The easy case is Redis going fully down: fail fast, fall through to the database, continue serving requests. The hard case — the one that takes down gateways — is a slow Redis: still accepting connections, still responding, but timing out after 20-30 seconds per operation.

What you get​

The Stack of the Future​

Why companies will need this​

What we are building​

Frequently Asked Questions​

Is LiteLLM building a second product?​

Is LAP production-ready?​

LiteLLM AI Gateway​

Frequently Asked Questions​

Do I have to use the LiteLLM AI Gateway?​

Does swapping harnesses change agent behavior?​

Is this ready for production?​

Is this available in LiteLLM OSS?​

Recommended Reading​

What you get

The Stack of the Future

Why companies will need this

What we are building

Frequently Asked Questions

Is LiteLLM building a second product?

Is LAP production-ready?

LiteLLM AI Gateway

Frequently Asked Questions

Do I have to use the LiteLLM AI Gateway?

Does swapping harnesses change agent behavior?

Is this ready for production?

Is this available in LiteLLM OSS?

Recommended Reading