/evals

LiteLLM Proxy supports OpenAI's Evaluations (Evals) API, allowing you to create, manage, and run evaluations to measure model performance against defined testing criteria.

What are Evals?

OpenAI Evals API provides a structured way to:

Create Evaluations: Define testing criteria and data sources for evaluating model outputs
Run Evaluations: Execute evaluations against specific models and datasets
Track Results: Monitor evaluation progress and review detailed results

Quick Start

Setup LiteLLM Proxy

First, start your LiteLLM Proxy server:

litellm --config config.yaml

# Proxy will run on http://localhost:4000

Initialize OpenAI Client

from openai import OpenAI

# Point to your LiteLLM Proxy
client = OpenAI(
    api_key="sk-1234",  # Your LiteLLM proxy API key
    base_url="http://localhost:4000"  # Your proxy URL
)

For async operations:

from openai import AsyncOpenAI

client = AsyncOpenAI(
    api_key="sk-1234",
    base_url="http://localhost:4000"
)

Evaluation Management

Create an Evaluation

Create an evaluation with testing criteria and data source configuration.

Example: Sentiment Classification Eval

from openai import OpenAI

client = OpenAI(
    api_key="sk-1234",
    base_url="http://localhost:4000"
)

# Create evaluation with label model grader
eval_obj = client.evals.create(
    name="Sentiment Classification",
    data_source_config={
        "type": "stored_completions",
        "metadata": {"usecase": "chatbot"}
    },
    testing_criteria=[
        {
            "type": "label_model",
            "model": "gpt-4o-mini",
            "input": [
                {
                    "role": "developer",
                    "content": "Classify the sentiment of the following statement as one of 'positive', 'neutral', or 'negative'"
                },
                {
                    "role": "user",
                    "content": "Statement: {{item.input}}"
                }
            ],
            "passing_labels": ["positive"],
            "labels": ["positive", "neutral", "negative"],
            "name": "Sentiment Grader"
        }
    ]
)

# Note: If you want to use model-specific credentials for this evaluation, you can specify the model name in the extra body parameters.

print(f"Created eval: {eval_obj.id}")
print(f"Eval name: {eval_obj.name}")

Example: Push Notifications Summarizer Monitoring

This example shows how to monitor prompt changes for regressions in a push notifications summarizer:

from openai import AsyncOpenAI

client = AsyncOpenAI(
    api_key="sk-1234",
    base_url="http://localhost:4000"
)

# Define data source for stored completions
data_source_config = {
    "type": "stored_completions",
    "metadata": {
        "usecase": "push_notifications_summarizer"
    }
}

# Define grader criteria
GRADER_DEVELOPER_PROMPT = """
Label the following push notification summary as either correct or incorrect.
The push notification and the summary will be provided below.
A good push notification summary is concise and snappy.
If it is good, then label it as correct, if not, then incorrect.
"""

GRADER_TEMPLATE_PROMPT = """
Push notifications: {{item.input}}
Summary: {{sample.output_text}}
"""

push_notification_grader = {
    "name": "Push Notification Summary Grader",
    "type": "label_model",
    "model": "gpt-4o-mini",
    "input": [
        {
            "role": "developer",
            "content": GRADER_DEVELOPER_PROMPT,
        },
        {
            "role": "user",
            "content": GRADER_TEMPLATE_PROMPT,
        },
    ],
    "passing_labels": ["correct"],
    "labels": ["correct", "incorrect"],
}

# Create the evaluation
eval_result = await client.evals.create(
    name="Push Notification Completion Monitoring",
    metadata={"description": "This eval monitors completions"},
    data_source_config=data_source_config,
    testing_criteria=[push_notification_grader],
)

eval_id = eval_result.id
print(f"Created eval: {eval_id}")

List Evaluations

Retrieve a list of all your evaluations with pagination support.

# List all evaluations
evals_response = client.evals.list(
    limit=20,
    order="desc"
)

for eval in evals_response.data:
    print(f"Eval ID: {eval.id}, Name: {eval.name}")

# Check if there are more evals
if evals_response.has_more:
    # Fetch next page
    next_evals = client.evals.list(
        after=evals_response.last_id,
        limit=20
    )

Get a Specific Evaluation

Retrieve details of a specific evaluation by ID.

eval = client.evals.retrieve(
    eval_id="eval_abc123"
)

print(f"Eval ID: {eval.id}")
print(f"Name: {eval.name}")
print(f"Data Source: {eval.data_source_config}")
print(f"Testing Criteria: {eval.testing_criteria}")

Update an Evaluation

Update evaluation metadata or name.

updated_eval = client.evals.update(
    eval_id="eval_abc123",
    name="Updated Evaluation Name",
    metadata={
        "version": "2.0",
        "updated_by": "user@example.com"
    }
)

print(f"Updated eval: {updated_eval.name}")

Delete an Evaluation

Permanently delete an evaluation.

delete_response = client.evals.delete(
    eval_id="eval_abc123"
)

print(f"Deleted: {delete_response.deleted}")  # True

Evaluation Runs

Create a Run

Execute an evaluation by creating a run. The run processes your data through the model and applies testing criteria.

Using Stored Completions

First, generate some test data by making chat completions with metadata:

from openai import AsyncOpenAI
import asyncio

client = AsyncOpenAI(
    api_key="sk-1234",
    base_url="http://localhost:4000"
)

# Generate test data with different prompt versions
push_notification_data = [
    """
- New message from Sarah: "Can you call me later?"
- Your package has been delivered!
- Flash sale: 20% off electronics for the next 2 hours!
""",
    """
- Weather alert: Thunderstorm expected in your area.
- Reminder: Doctor's appointment at 3 PM.
- John liked your photo on Instagram.
"""
]

PROMPTS = [
    (
        """
        You are a helpful assistant that summarizes push notifications.
        You are given a list of push notifications and you need to collapse them into a single one.
        Output only the final summary, nothing else.
        """,
        "v1"
    ),
    (
        """
        You are a helpful assistant that summarizes push notifications.
        You are given a list of push notifications and you need to collapse them into a single one.
        The summary should be longer than it needs to be and include more information than is necessary.
        Output only the final summary, nothing else.
        """,
        "v2"
    )
]

# Create completions with metadata for tracking
tasks = []
for notifications in push_notification_data:
    for (prompt, version) in PROMPTS:
        tasks.append(client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "developer", "content": prompt},
                {"role": "user", "content": notifications},
            ],
            metadata={
                "prompt_version": version,
                "usecase": "push_notifications_summarizer"
            }
        ))

await asyncio.gather(*tasks)

Now create runs to evaluate different prompt versions:

# Grade prompt_version=v1
eval_run_result = await client.evals.runs.create(
    eval_id=eval_id,
    name="v1-run",
    data_source={
        "type": "completions",
        "source": {
            "type": "stored_completions",
            "metadata": {
                "prompt_version": "v1",
            }
        }
    }
)

print(f"Run ID: {eval_run_result.id}")
print(f"Status: {eval_run_result.status}")
print(f"Report URL: {eval_run_result.report_url}")

# Grade prompt_version=v2
eval_run_result_v2 = await client.evals.runs.create(
    eval_id=eval_id,
    name="v2-run",
    data_source={
        "type": "completions",
        "source": {
            "type": "stored_completions",
            "metadata": {
                "prompt_version": "v2",
            }
        }
    }
)

print(f"Run ID: {eval_run_result_v2.id}")
print(f"Report URL: {eval_run_result_v2.report_url}")

Using Completions with Different Models

Test how different models perform on the same inputs:

# Test with GPT-4o using stored completions as input
tasks = []
for prompt_version in ["v1", "v2"]:
    tasks.append(client.evals.runs.create(
        eval_id=eval_id,
        name=f"gpt-4o-run-{prompt_version}",
        data_source={
            "type": "completions",
            "input_messages": {
                "type": "item_reference",
                "item_reference": "item.input",
            },
            "model": "gpt-4o",
            "source": {
                "type": "stored_completions",
                "metadata": {
                    "prompt_version": prompt_version,
                }
            }
        }
    ))

results = await asyncio.gather(*tasks)
for run in results:
    print(f"Report URL: {run.report_url}")

List Runs

Get all runs for a specific evaluation.

# List all runs for an evaluation
runs_response = client.evals.runs.list(
    eval_id="eval_abc123",
    limit=20,
    order="desc"
)

for run in runs_response.data:
    print(f"Run ID: {run.id}")
    print(f"Status: {run.status}")
    print(f"Name: {run.name}")
    if run.result_counts:
        print(f"Results: {run.result_counts.passed}/{run.result_counts.total} passed")

Get Run Details

Retrieve detailed information about a specific run, including results.

run = client.evals.runs.retrieve(
    eval_id="eval_abc123",
    run_id="run_def456"
)

print(f"Run ID: {run.id}")
print(f"Status: {run.status}")
print(f"Started: {run.started_at}")
print(f"Completed: {run.completed_at}")

# Check results
if run.result_counts:
    print(f"\nOverall Results:")
    print(f"Total: {run.result_counts.total}")
    print(f"Passed: {run.result_counts.passed}")
    print(f"Failed: {run.result_counts.failed}")
    print(f"Error: {run.result_counts.errored}")

# Per-criteria results
if run.per_testing_criteria_results:
    for criteria_result in run.per_testing_criteria_results:
        print(f"\nCriteria {criteria_result.testing_criteria_index}:")
        print(f"  Passed: {criteria_result.result_counts.passed}")
        print(f"  Average Score: {criteria_result.average_score}")

Delete a Run

Permanently delete a run and its results.

delete_response = await client.evals.runs.delete(
    eval_id="eval_abc123",
    run_id="run_def456"
)

print(f"Deleted: {delete_response.deleted}")  # True
print(f"Run ID: {delete_response.run_id}")

What are Evals?​

Quick Start​

Setup LiteLLM Proxy​

Initialize OpenAI Client​

Evaluation Management​

Create an Evaluation​

Example: Sentiment Classification Eval​

Example: Push Notifications Summarizer Monitoring​

List Evaluations​

Get a Specific Evaluation​

Update an Evaluation​

Delete an Evaluation​

Evaluation Runs​

Create a Run​

Using Stored Completions​

Using Completions with Different Models​

List Runs​

Get Run Details​

Delete a Run​