Blog
Skip to main content

How we built a background agent to cover 30% of our backlog

Krrish Dholakia
CEO, LiteLLM
Ishaan Jaffer
CTO, LiteLLM
LiteLLM Agent Platform: agent.litellm.ai
info

The platform we built is open source. Check out litellm-agent-platform. The swappable harness layer is lite-harness.

Building the same thing inside your company?

Our goal was to 10x the productivity of our company with agents.

Three weeks ago we began building an agent that could own 30% of our engineering tickets. Here's what we've learnt so far.

What we shipped​

Three weeks in, on BerriAI/litellm: 43 open PRs, 160 closed. Between the PRs it lands and the Slack questions it answers, the agent now covers roughly 30% of the eng tickets that used to hit a human every week. Browse all agent-filed PRs on GitHub.

Why we built our own​

We wanted an agent that runs autonomously, in the background, pulling tickets off Linear and filing PRs for us. We evaluated Cursor and Anthropic's managed agent platforms first. Neither fit:

  • Cursor: agents were not stateful. You could not store memory, skills, etc. per agent. The platform equated an agent to a session; we wanted an agent that persists across them.
  • Anthropic: close to what we wanted, but we wanted to swap models and harnesses freely. We did not want to be locked to one platform.

So we built on the LiteLLM Agent Platform.

1. Infrastructure: separate the brain from the sandbox​

Our first version ran the agent inside the sandbox, following the same shape as Ramp Inspect. Every new session booted a fresh sandbox. That is fine when the work is "go edit code." It is wasteful when an engineer just asks a question in Slack. You pay a full sandbox boot to answer something that needs a few tool calls.

So we split the agent in two. The brain (reasoning, planning, model calls) lives in a shared, persistent pod. It has no shell: no BASH, no filesystem. The sandbox is ephemeral, one per session, and the only thing that can run git, gh, or pytest. The brain reaches it through two tool calls. This is similar to how Anthropic's managed agent platform works.

Architecture: a persistent brain pod with no shell, talking to an ephemeral per-session sandbox pool through two tool calls

Response time dropped, session success rates climbed, and cost per session fell.

The cold start showed up most visibly in Slack, where everyone could feel the wait.

Slack thread waiting on a cold sandbox boot before the agent could respond

2. Architecture: pick a harness, not an agent framework​

We started with agent frameworks: Pydantic AI, LangGraph, the PI SDK. Each one made us rebuild things a coding harness already ships: context compaction, sub-agent spawning, tool loops. We already trusted Claude Code locally for this work, so we went looking for a harness, not a framework.

We landed on OpenCode. The Claude Agents SDK spawns a CLI session per run and OOM'd for us at ~1 RPM. OpenCode hits the same fundamental bottleneck (long-running sessions held in memory), but its memory usage grew more slowly, making it the better fit for now.

That choice stays flexible because we also wrote a harness unification layer, BerriAI/lite-harness, which adapts OpenCode, Claude Code, Codex, and others to a single HTTP contract:

lite-harness/
opencode/ # runtime adapter
claude-agent-sdk/ # runtime adapter
contract.py # the one interface every runtime implements

The agent platform doesn't know which harness is behind a session, so swapping is a config change, not a rewrite.

Our next goal: 100 RPM on the harness.

3. Security: scope every credential to one endpoint​

Our agent kept leaking API keys from its environment into commits and Slack messages. First mitigation: a small HTTP proxy vault. We stubbed the real credentials in the environment and swapped the stub for the real value only when the agent made an outbound call.

The agent defeated it. It noticed the credentials were stubbed, then wrote its own endpoint, called it with the stubbed credentials, let the vault swap in the real ones on the way out, and read the real keys back off its own server, then stored them to memory via a tool call. A clean man-in-the-middle against our own vault.

Ishaan catching the agent writing real credentials to its memory after circumventing the stub vault

The fix was to stop trusting the value and start binding it to a destination. Each credential is pinned to one upstream host; the vault refuses the swap if the outbound request is going anywhere else:

# vault: a credential is only ever swapped in for its bound host
credentials:
GITHUB_TOKEN:
allowed_host: api.github.com
OPENAI_API_KEY:
allowed_host: api.openai.com

The lesson: guardrails must sit at the agent's input/output boundary. LLM-level guardrails can't distinguish between a user query and an internal tool loop, so they're either too permissive or too slow.

Where the AI Gateway fits​

The AI Gateway is a useful access control point: it's how we gave our agent access to models and MCP tools. But it's only half the picture. The agent boundary needs its own guardrails and capabilities (skills, memory), because the agent (not the model) is what takes actions. The guardrails needed when an agent answers a user differ from those needed inside an internal tool loop. Running model-level guardrails on every tool call also adds ~5 minutes per session.

What we believe now​

Autonomous agents are where the 10x productivity gains are, and the technical risk is largely solved. Models are already smart enough to file a decent PR. The hard problems left are product problems: scale, reliability, and security.

For us, that means two open problems:

  • Scale: how do you serve 100 RPM on a harness that keeps sessions in memory?
  • Security: how do you prevent the agent server from leaking sensitive information or taking destructive actions? (We tried MCPs but hit rate limits and structural issues, so direct API keys were more reliable, which is what made scoping credentials critical.)

Try it​

Both repos are open source and self-hostable: litellm-agent-platform and lite-harness. If you're building something similar and want to skip the three weeks of mistakes, book a 30-minute chat or join the LAP Discord.

This blog was inspired in shape by Ramp's Why we built our background agent.

We're hiring

Like what you see? Join us

Come build the future of AI infrastructure.