Skip to main content

5 posts tagged with "gemini"

View All Tags

DAY 0 Support: Gemini 3.1 Flash Lite Preview on LiteLLM

Sameer Kankute
SWE @ LiteLLM (LLM Translation)
Krrish Dholakia
CEO, LiteLLM
Ishaan Jaff
CTO, LiteLLM

LiteLLM now supports gemini-3.1-flash-lite-preview with full day 0 support!

note

If you only want cost tracking, you need no change in your current Litellm version. But if you want the support for new features introduced along with it like thinking levels, you will need to use v1.80.8-stable.1 or above.

Deploy this version​

docker run litellm
docker run \
-e STORE_MODEL_IN_DB=True \
-p 4000:4000 \
ghcr.io/berriai/litellm:main-v1.80.8-stable.1

What's New​

Supports all four thinking levels:

  • MINIMAL: Ultra-fast responses with minimal reasoning
  • LOW: Simple instruction following
  • MEDIUM: Balanced reasoning for complex tasks
  • HIGH: Maximum reasoning depth (dynamic)

Quick Start​

Basic Usage

from litellm import completion

response = completion(
model="gemini/gemini-3.1-flash-lite-preview",
messages=[{"role": "user", "content": "Extract key entities from this text: ..."}],
)

print(response.choices[0].message.content)

With Thinking Levels

from litellm import completion

# Use MEDIUM thinking for complex reasoning tasks
response = completion(
model="gemini/gemini-3.1-flash-lite-preview",
messages=[{"role": "user", "content": "Analyze this dataset and identify patterns"}],
reasoning_effort="medium", # low, medium , high
)

print(response.choices[0].message.content)

Supported Endpoints​

LiteLLM provides full end-to-end support for Gemini 3.1 Flash Lite Preview on:

  • βœ… /v1/chat/completions - OpenAI-compatible chat completions endpoint
  • βœ… /v1/responses - OpenAI Responses API endpoint (streaming and non-streaming)
  • βœ… /v1/messages - Anthropic-compatible messages endpoint
  • βœ… /v1/generateContent – Google Gemini API compatible endpoint

All endpoints support:

  • Streaming and non-streaming responses
  • Function calling with thought signatures
  • Multi-turn conversations
  • All Gemini 3-specific features (thinking levels, thought signatures)
  • Full multimodal support (text, image, audio, video)

reasoning_effort Mapping for Gemini 3.1​

LiteLLM automatically maps OpenAI's reasoning_effort parameter to Gemini's thinkingLevel:

reasoning_effortthinking_levelUse Case
minimalminimalUltra-fast responses, simple queries
lowlowBasic instruction following
mediummediumBalanced reasoning for moderate complexity
highhighMaximum reasoning depth, complex problems
disableminimalDisable extended reasoning
noneminimalNo extended reasoning

DAY 0 Support: Gemini 3.1 Pro on LiteLLM

Sameer Kankute
SWE @ LiteLLM (LLM Translation)
Krrish Dholakia
CEO, LiteLLM
Ishaan Jaff
CTO, LiteLLM

LiteLLM now supports gemini-3.1-pro-preview and all the new API changes along with it.

Deploy this version​

docker run litellm
docker run \
-e STORE_MODEL_IN_DB=True \
-p 4000:4000 \
ghcr.io/berriai/litellm:main-v1.81.9-stable.gemini.3.1-pro

What's New​

1. New Thinking Levels: thinkingLevel with MINIMAL & MEDIUM​

Gemini 3.1 Pro introduces support for medium thinking level

LiteLLM automatically maps the OpenAI reasoning_effort parameter to Gemini's thinkingLevel, so you can use familiar reasoning_effort values (minimal, low, medium, high) without changing your code!


Supported Endpoints​

LiteLLM provides full end-to-end support for Gemini 3.1 Pro on:

  • βœ… /v1/chat/completions - OpenAI-compatible chat completions endpoint
  • βœ… /v1/responses - OpenAI Responses API endpoint (streaming and non-streaming)
  • βœ… /v1/messages - Anthropic-compatible messages endpoint
  • βœ… /v1/generateContent – Google Gemini API compatible endpoint

All endpoints support:

  • Streaming and non-streaming responses
  • Function calling with thought signatures
  • Multi-turn conversations
  • All Gemini 3-specific features
  • Conversion of provider specific thinking related param to thinkingLevel

Quick Start​

Basic Usage with MEDIUM thinking (NEW)

from litellm import completion

# No need to make any changes to your code as we map openai reasoning param to thinkingLevel
response = completion(
model="gemini/gemini-3.1-pro-preview",
messages=[{"role": "user", "content": "Solve this complex math problem: 25 * 4 + 10"}],
reasoning_effort="medium", # NEW: MEDIUM thinking level
)

print(response.choices[0].message.content)

reasoning_effort Mapping for Gemini 3+​

reasoning_effortthinking_level
minimalminimal
lowlow
mediummedium
highhigh
disableminimal
noneminimal

DAY 0 Support: Gemini 3 Flash on LiteLLM

Sameer Kankute
SWE @ LiteLLM (LLM Translation)
Krrish Dholakia
CEO, LiteLLM
Ishaan Jaff
CTO, LiteLLM

LiteLLM now supports gemini-3-flash-preview and all the new API changes along with it.

note

If you only want cost tracking, you need no change in your current Litellm version. But if you want the support for new features introduced along with it like thinking levels, you will need to use v1.80.8-stable.1 or above.

Deploy this version​

docker run litellm
docker run \
-e STORE_MODEL_IN_DB=True \
-p 4000:4000 \
ghcr.io/berriai/litellm:main-v1.80.8-stable.1

What's New​

1. New Thinking Levels: thinkingLevel with MINIMAL & MEDIUM​

Gemini 3 Flash introduces granular thinking control with thinkingLevel instead of thinkingBudget.

  • MINIMAL: Ultra-lightweight thinking for fast responses
  • MEDIUM: Balanced thinking for complex reasoning
  • HIGH: Maximum reasoning depth

LiteLLM automatically maps the OpenAI reasoning_effort parameter to Gemini's thinkingLevel, so you can use familiar reasoning_effort values (minimal, low, medium, high) without changing your code!

2. Thought Signatures​

Like gemini-3-pro, this model also includes thought signatures for tool calls. LiteLLM handles signature extraction and embedding internally. Learn more about thought signatures.

Edge Case Handling: If thought signatures are missing in the request, LiteLLM adds a dummy signature ensuring the API call doesn't break


Supported Endpoints​

LiteLLM provides full end-to-end support for Gemini 3 Flash on:

  • βœ… /v1/chat/completions - OpenAI-compatible chat completions endpoint
  • βœ… /v1/responses - OpenAI Responses API endpoint (streaming and non-streaming)
  • βœ… /v1/messages - Anthropic-compatible messages endpoint
  • βœ… /v1/generateContent – Google Gemini API compatible endpoint All endpoints support:
  • Streaming and non-streaming responses
  • Function calling with thought signatures
  • Multi-turn conversations
  • All Gemini 3-specific features
  • Converstion of provider specific thinking related param to thinkingLevel

Quick Start​

Basic Usage with MEDIUM thinking (NEW)

from litellm import completion

# No need to make any changes to your code as we map openai reasoning param to thinkingLevel
response = completion(
model="gemini/gemini-3-flash-preview",
messages=[{"role": "user", "content": "Solve this complex math problem: 25 * 4 + 10"}],
reasoning_effort="medium", # NEW: MEDIUM thinking level
)

print(response.choices[0].message.content)

Key Features​

βœ… Thinking Levels: MINIMAL, LOW, MEDIUM, HIGH
βœ… Thought Signatures: Track reasoning with unique identifiers
βœ… Seamless Integration: Works with existing OpenAI-compatible client
βœ… Backward Compatible: Gemini 2.5 models continue using thinkingBudget


Installation​

pip install litellm --upgrade
import litellm
from litellm import completion

response = completion(
model="gemini/gemini-3-flash-preview",
messages=[{"role": "user", "content": "Your question here"}],
reasoning_effort="medium", # Use MEDIUM thinking
)
print(response)
note

If using this model via vertex_ai, keep the location as global as this is the only supported location as of now.

reasoning_effort Mapping for Gemini 3+​

reasoning_effortthinking_level
minimalminimal
lowlow
mediummedium
highhigh
disableminimal
noneminimal

DAY 0 Support: Gemini 3 on LiteLLM

Sameer Kankute
SWE @ LiteLLM (LLM Translation)
Krrish Dholakia
CEO, LiteLLM
Ishaan Jaff
CTO, LiteLLM
info

This guide covers common questions and best practices for using gemini-3-pro-preview with LiteLLM Proxy and SDK.

Quick Start​

from litellm import completion
import os

os.environ["GEMINI_API_KEY"] = "your-api-key"

response = completion(
model="gemini/gemini-3-pro-preview",
messages=[{"role": "user", "content": "Hello!"}],
reasoning_effort="low"
)

print(response.choices[0].message.content)

Supported Endpoints​

LiteLLM provides full end-to-end support for Gemini 3 Pro Preview on:

  • βœ… /v1/chat/completions - OpenAI-compatible chat completions endpoint
  • βœ… /v1/responses - OpenAI Responses API endpoint (streaming and non-streaming)
  • βœ… /v1/messages - Anthropic-compatible messages endpoint
  • βœ… /v1/generateContent – Google Gemini API compatible endpoint (for code, see: client.models.generate_content(...))

All endpoints support:

  • Streaming and non-streaming responses
  • Function calling with thought signatures
  • Multi-turn conversations
  • All Gemini 3-specific features

Thought Signatures​

What are Thought Signatures?​

Thought signatures are encrypted representations of the model's internal reasoning process. They're essential for maintaining context across multi-turn conversations, especially with function calling.

How Thought Signatures Work​

  1. Automatic Extraction: When Gemini 3 returns a function call, LiteLLM automatically extracts the thought_signature from the response
  2. Storage: Thought signatures are stored in provider_specific_fields.thought_signature of tool calls
  3. Automatic Preservation: When you include the assistant's message in conversation history, LiteLLM automatically preserves and returns thought signatures to Gemini

Example: Multi-Turn Function Calling​

Streaming with Thought Signatures​

When using streaming mode with stream_chunk_builder(), thought signatures are now automatically preserved:

import os
import litellm
from litellm import completion

os.environ["GEMINI_API_KEY"] = "your-api-key"

MODEL = "gemini/gemini-3-pro-preview"

messages = [
{"role": "system", "content": "You are a helpful assistant. Use the calculate tool."},
{"role": "user", "content": "What is 2+2?"},
]

tools = [{
"type": "function",
"function": {
"name": "calculate",
"description": "Calculate a mathematical expression",
"parameters": {
"type": "object",
"properties": {"expression": {"type": "string"}},
"required": ["expression"],
},
},
}]

print("Step 1: Sending request with stream=True...")
response = completion(
model=MODEL,
messages=messages,
stream=True,
tools=tools,
reasoning_effort="low"
)

# Collect all chunks
chunks = []
for part in response:
chunks.append(part)

# Reconstruct message using stream_chunk_builder
# Thought signatures are now preserved automatically!
full_response = litellm.stream_chunk_builder(chunks, messages=messages)
print(f"Full response: {full_response}")

assistant_msg = full_response.choices[0].message

# βœ… Thought signature is now preserved in provider_specific_fields
if assistant_msg.tool_calls and assistant_msg.tool_calls[0].provider_specific_fields:
thought_sig = assistant_msg.tool_calls[0].provider_specific_fields.get("thought_signature")
print(f"Thought signature preserved: {thought_sig is not None}")

# Append assistant message (includes thought signatures automatically)
messages.append(assistant_msg)

# Mock tool execution
messages.append({
"role": "tool",
"content": "4",
"tool_call_id": assistant_msg.tool_calls[0].id
})

print("\nStep 2: Sending tool result back to model...")
response_2 = completion(
model=MODEL,
messages=messages,
stream=True,
tools=tools,
reasoning_effort="low"
)

for part in response_2:
if part.choices[0].delta.content:
print(part.choices[0].delta.content, end="")
print() # New line

Key Points:

  • βœ… stream_chunk_builder() now preserves provider_specific_fields including thought signatures
  • βœ… Thought signatures are automatically included when appending assistant_msg to conversation history
  • βœ… Multi-turn conversations work seamlessly with streaming

Important Notes on Thought Signatures​

  1. Automatic Handling: LiteLLM automatically extracts and preserves thought signatures. You don't need to manually manage them.

  2. Parallel Function Calls: When the model makes parallel function calls, only the first function call has a thought signature.

  3. Sequential Function Calls: In multi-step function calling, each step's first function call has its own thought signature that must be preserved.

  4. Required for Context: Thought signatures are essential for maintaining reasoning context. Without them, the model may lose context of its previous reasoning.

Conversation History: Switching from Non-Gemini-3 Models​

Common Question: Will switching from a non-Gemini-3 model to Gemini-3 break conversation history?​

Answer: No! LiteLLM automatically handles this by adding dummy thought signatures when needed.

How It Works​

When you switch from a model that doesn't use thought signatures (e.g., gemini-2.5-flash) to Gemini 3, LiteLLM:

  1. Detects missing signatures: Identifies assistant messages with tool calls that lack thought signatures
  2. Adds dummy signature: Automatically injects a dummy thought signature (skip_thought_signature_validator) for compatibility
  3. Maintains conversation flow: Your conversation history continues to work seamlessly

Example: Switching Models Mid-Conversation​

from openai import OpenAI

client = OpenAI(api_key="sk-1234", base_url="http://localhost:4000")

# Step 1: Start with gemini-2.5-flash (no thought signatures)
messages = [{"role": "user", "content": "What's the weather?"}]

response1 = client.chat.completions.create(
model="gemini-2.5-flash",
messages=messages,
tools=[...],
reasoning_effort="low"
)

# Append assistant message (no tool call thought signature from gemini-2.5-flash)
messages.append(response1.choices[0].message)

# Step 2: Switch to gemini-3-pro-preview
# LiteLLM automatically adds dummy thought signature to the previous assistant message
response2 = client.chat.completions.create(
model="gemini-3-pro-preview", # πŸ‘ˆ Switched model
messages=messages, # πŸ‘ˆ Same conversation history
tools=[...],
reasoning_effort="low"
)

# βœ… Works seamlessly! No errors, no breaking changes
print(response2.choices[0].message.content)

Dummy Signature Details​

The dummy signature used is: base64("skip_thought_signature_validator")

This is the recommended approach by Google for handling conversation history from models that don't support thought signatures. It allows Gemini 3 to:

  • Accept the conversation history without validation errors
  • Continue the conversation seamlessly
  • Maintain context across model switches

Thinking Level Parameter​

How reasoning_effort Maps to thinking_level​

For Gemini 3 Pro Preview, LiteLLM automatically maps reasoning_effort to the new thinking_level parameter:

reasoning_effortthinking_levelNotes
"minimal""low"Maps to low thinking level
"low""low"Default for most use cases
"medium""high"Medium not available yet, maps to high
"high""high"Maximum reasoning depth
"disable""low"Gemini 3 cannot fully disable thinking
"none""low"Gemini 3 cannot fully disable thinking

Default Behavior​

If you don't specify reasoning_effort, LiteLLM automatically sets thinking_level="low" for Gemini 3 models, to avoid high costs.

Example Usage​

from litellm import completion

# Low thinking level (faster, lower cost)
response = completion(
model="gemini/gemini-3-pro-preview",
messages=[{"role": "user", "content": "What's the weather?"}],
reasoning_effort="low" # Maps to thinking_level="low"
)

# High thinking level (deeper reasoning, higher cost)
response = completion(
model="gemini/gemini-3-pro-preview",
messages=[{"role": "user", "content": "Solve this complex math problem step by step."}],
reasoning_effort="high" # Maps to thinking_level="high"
)

Important Notes​

  1. Gemini 3 Cannot Disable Thinking: Unlike Gemini 2.5 models, Gemini 3 cannot fully disable thinking. Even when you set reasoning_effort="none" or "disable", it maps to thinking_level="low".

  2. Temperature Recommendation: For Gemini 3 models, LiteLLM defaults temperature to 1.0 and strongly recommends keeping it at this default. Setting temperature < 1.0 can cause:

    • Infinite loops
    • Degraded reasoning performance
    • Failure on complex tasks
  3. Automatic Defaults: If you don't specify reasoning_effort, LiteLLM automatically sets thinking_level="low" for optimal performance.

Cost Tracking: Prompt Caching & Context Window​

LiteLLM provides comprehensive cost tracking for Gemini 3 Pro Preview, including support for prompt caching and tiered pricing based on context window size.

Prompt Caching Cost Tracking​

Gemini 3 supports prompt caching, which allows you to cache frequently used prompt prefixes to reduce costs. LiteLLM automatically tracks and calculates costs for:

  • Cache Hit Tokens: Tokens that are read from cache (charged at a lower rate)
  • Cache Creation Tokens: Tokens that are written to cache (one-time cost)
  • Text Tokens: Regular prompt tokens that are processed normally

How It Works​

LiteLLM extracts caching information from the prompt_tokens_details field in the usage object:

{
"usage": {
"prompt_tokens": 50000,
"completion_tokens": 1000,
"total_tokens": 51000,
"prompt_tokens_details": {
"cached_tokens": 30000, # Cache hit tokens
"cache_creation_tokens": 5000, # Tokens written to cache
"text_tokens": 15000 # Regular processed tokens
}
}
}

Context Window Tiered Pricing​

Gemini 3 Pro Preview supports up to 1M tokens of context, with tiered pricing that automatically applies when your prompt exceeds 200k tokens.

Automatic Tier Detection​

LiteLLM automatically detects when your prompt exceeds the 200k token threshold and applies the appropriate tiered pricing:

from litellm import completion_cost

# Example: Small prompt (< 200k tokens)
response_small = completion(
model="gemini/gemini-3-pro-preview",
messages=[{"role": "user", "content": "Hello!"}]
)
# Uses base pricing: $0.000002/input token, $0.000012/output token

# Example: Large prompt (> 200k tokens)
response_large = completion(
model="gemini/gemini-3-pro-preview",
messages=[{"role": "user", "content": "..." * 250000}] # 250k tokens
)
# Automatically uses tiered pricing: $0.000004/input token, $0.000018/output token

Cost Breakdown​

The cost calculation includes:

  1. Text Processing Cost: Regular tokens processed at base or tiered rate
  2. Cache Read Cost: Cached tokens read at discounted rate
  3. Cache Creation Cost: One-time cost for writing tokens to cache (applies tiered rate if above 200k)
  4. Output Cost: Generated tokens at base or tiered rate

Example: Viewing Cost Breakdown​

You can view the detailed cost breakdown using LiteLLM's cost tracking:

from litellm import completion, completion_cost

response = completion(
model="gemini/gemini-3-pro-preview",
messages=[{"role": "user", "content": "Explain prompt caching"}],
caching=True # Enable prompt caching
)

# Get total cost
total_cost = completion_cost(completion_response=response)
print(f"Total cost: ${total_cost:.6f}")

# Access usage details
usage = response.usage
print(f"Prompt tokens: {usage.prompt_tokens}")
print(f"Completion tokens: {usage.completion_tokens}")

# Access caching details
if usage.prompt_tokens_details:
print(f"Cache hit tokens: {usage.prompt_tokens_details.cached_tokens}")
print(f"Cache creation tokens: {usage.prompt_tokens_details.cache_creation_tokens}")
print(f"Text tokens: {usage.prompt_tokens_details.text_tokens}")

Cost Optimization Tips​

  1. Use Prompt Caching: For repeated prompt prefixes, enable caching to reduce costs by up to 90% for cached portions
  2. Monitor Context Size: Be aware that prompts above 200k tokens use tiered pricing (2x for input, 1.5x for output)
  3. Cache Management: Cache creation tokens are charged once when writing to cache, then subsequent reads are much cheaper
  4. Track Usage: Use LiteLLM's built-in cost tracking to monitor spending across different token types

Integration with LiteLLM Proxy​

When using LiteLLM Proxy, all cost tracking is automatically logged and available through:

  • Usage Logs: Detailed token and cost breakdowns in proxy logs
  • Budget Management: Set budgets and alerts based on actual usage
  • Analytics Dashboard: View cost trends and breakdowns by token type
# config.yaml
model_list:
- model_name: gemini-3-pro-preview
litellm_params:
model: gemini/gemini-3-pro-preview
api_key: os.environ/GEMINI_API_KEY

litellm_settings:
# Enable detailed cost tracking
success_callback: ["langfuse"] # or your preferred logging service

Using with Claude Code CLI​

You can use gemini-3-pro-preview with Claude Code CLI - Anthropic's command-line interface. This allows you to use Gemini 3 Pro Preview with Claude Code's native syntax and workflows.

Setup​

1. Add Gemini 3 Pro Preview to your config.yaml:

model_list:
- model_name: gemini-3-pro-preview
litellm_params:
model: gemini/gemini-3-pro-preview
api_key: os.environ/GEMINI_API_KEY

litellm_settings:
master_key: os.environ/LITELLM_MASTER_KEY

2. Set environment variables:

export GEMINI_API_KEY="your-gemini-api-key"
export LITELLM_MASTER_KEY="sk-1234567890" # Generate a secure key

3. Start LiteLLM Proxy:

litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

4. Configure Claude Code to use LiteLLM Proxy:

export ANTHROPIC_BASE_URL="http://0.0.0.0:4000"
export ANTHROPIC_AUTH_TOKEN="$LITELLM_MASTER_KEY"

5. Use Gemini 3 Pro Preview with Claude Code:

# Claude Code will use gemini-3-pro-preview from your LiteLLM proxy
claude --model gemini-3-pro-preview

Example Usage​

Once configured, you can interact with Gemini 3 Pro Preview using Claude Code's native interface:

$ claude --model gemini-3-pro-preview
> Explain how thought signatures work in multi-turn conversations.

# Gemini 3 Pro Preview responds through Claude Code interface

Benefits​

  • βœ… Native Claude Code Experience: Use Gemini 3 Pro Preview with Claude Code's familiar CLI interface
  • βœ… Unified Authentication: Single API key for all models through LiteLLM proxy
  • βœ… Cost Tracking: All usage tracked through LiteLLM's centralized logging
  • βœ… Seamless Model Switching: Easily switch between Claude and Gemini models
  • βœ… Full Feature Support: All Gemini 3 features (thought signatures, function calling, etc.) work through Claude Code

Troubleshooting​

Claude Code not finding the model:

  • Ensure the model name in Claude Code matches exactly: gemini-3-pro-preview
  • Verify your proxy is running: curl http://0.0.0.0:4000/health
  • Check that ANTHROPIC_BASE_URL points to your LiteLLM proxy

Authentication errors:

  • Verify ANTHROPIC_AUTH_TOKEN matches your LiteLLM master key
  • Ensure GEMINI_API_KEY is set correctly
  • Check LiteLLM proxy logs for detailed error messages

Responses API Support​

LiteLLM fully supports the OpenAI Responses API for Gemini 3 Pro Preview, including both streaming and non-streaming modes. The Responses API provides a structured way to handle multi-turn conversations with function calling, and LiteLLM automatically preserves thought signatures throughout the conversation.

Example: Using Responses API with Gemini 3​

from openai import OpenAI
import json

client = OpenAI()

# 1. Define a list of callable tools for the model
tools = [
{
"type": "function",
"name": "get_horoscope",
"description": "Get today's horoscope for an astrological sign.",
"parameters": {
"type": "object",
"properties": {
"sign": {
"type": "string",
"description": "An astrological sign like Taurus or Aquarius",
},
},
"required": ["sign"],
},
},
]

def get_horoscope(sign):
return f"{sign}: Next Tuesday you will befriend a baby otter."

# Create a running input list we will add to over time
input_list = [
{"role": "user", "content": "What is my horoscope? I am an Aquarius."}
]

# 2. Prompt the model with tools defined
response = client.responses.create(
model="gemini-3-pro-preview",
tools=tools,
input=input_list,
)

# Save function call outputs for subsequent requests
input_list += response.output

for item in response.output:
if item.type == "function_call":
if item.name == "get_horoscope":
# 3. Execute the function logic for get_horoscope
horoscope = get_horoscope(json.loads(item.arguments))

# 4. Provide function call results to the model
input_list.append({
"type": "function_call_output",
"call_id": item.call_id,
"output": json.dumps({
"horoscope": horoscope
})
})

print("Final input:")
print(input_list)

response = client.responses.create(
model="gemini-3-pro-preview",
instructions="Respond only with a horoscope generated by a tool.",
tools=tools,
input=input_list,
)

# 5. The model should be able to give a response!
print("Final output:")
print(response.model_dump_json(indent=2))
print("\n" + response.output_text)

Key Points:

  • βœ… Thought signatures are automatically preserved in function calls
  • βœ… Works seamlessly with multi-turn conversations
  • βœ… All Gemini 3-specific features are fully supported

Responses API Benefits​

  • βœ… Structured Output: Responses API provides a clear structure for handling function calls and multi-turn conversations
  • βœ… Thought Signature Preservation: LiteLLM automatically preserves thought signatures in both streaming and non-streaming modes
  • βœ… Seamless Integration: Works with existing OpenAI SDK patterns
  • βœ… Full Feature Support: All Gemini 3 features (thought signatures, function calling, reasoning) are fully supported

Best Practices​

1. Always Include Thought Signatures in Conversation History​

When building multi-turn conversations with function calling:

βœ… Do:

# Append the full assistant message (includes thought signatures)
messages.append(response.choices[0].message)

❌ Don't:

# Don't manually construct assistant messages without thought signatures
messages.append({
"role": "assistant",
"tool_calls": [...] # Missing thought signatures!
})

2. Use Appropriate Thinking Levels​

  • reasoning_effort="low": For simple queries, quick responses, cost optimization
  • reasoning_effort="high": For complex problems requiring deep reasoning

3. Keep Temperature at Default​

For Gemini 3 models, always use temperature=1.0 (default). Lower temperatures can cause issues.

4. Handle Model Switches Gracefully​

When switching from non-Gemini-3 to Gemini-3:

  • βœ… LiteLLM automatically handles missing thought signatures
  • βœ… No manual intervention needed
  • βœ… Conversation history continues seamlessly

Troubleshooting​

Issue: Missing Thought Signatures​

Symptom: Error when including assistant messages in conversation history

Solution: Ensure you're appending the full assistant message from the response:

messages.append(response.choices[0].message)  # βœ… Includes thought signatures

Issue: Conversation Breaks When Switching Models​

Symptom: Errors when switching from gemini-2.5-flash to gemini-3-pro-preview

Solution: This should work automatically! LiteLLM adds dummy signatures. If you see errors, ensure you're using the latest LiteLLM version.

Issue: Infinite Loops or Poor Performance​

Symptom: Model gets stuck or produces poor results

Solution:

  • Ensure temperature=1.0 (default for Gemini 3)
  • Check that reasoning_effort is set appropriately
  • Verify you're using the correct model name: gemini/gemini-3-pro-preview

Additional Resources​

Gemini Embedding 2 Preview: Multimodal Embeddings on LiteLLM

Sameer Kankute
SWE @ LiteLLM (LLM Translation)

LiteLLM now supports multimodal embeddings with gemini-embedding-2-previewβ€”generating a single embedding from a mix of text, images, audio, video, and PDF content. Available via both the Gemini API (API key) and Vertex AI (GCP credentials).

Supported Input Types​

ModalitySupported Formats
TextPlain text
ImagePNG, JPEG
AudioMP3, WAV
VideoMP4, MOV
DocumentsPDF

Input Formats​

LiteLLM accepts three input formats for multimodal content:

  1. Data URIs – Base64-encoded inline: data:image/png;base64,<encoded_data>
  2. GCS URLs – Cloud Storage paths (Vertex AI): gs://bucket/path/to/file.png
  3. Gemini File References – Pre-uploaded files (Gemini API): files/abc123

Quick Start​

from litellm import embedding
import os

os.environ["GEMINI_API_KEY"] = "your-api-key"

# Text + Image (base64)
response = embedding(
model="gemini/gemini-embedding-2-preview",
input=[
"The food was delicious and the waiter...",
"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAgAAAAIAQMAAAD+wSzIAAAABlBMVEX///+/v7+jQ3Y5AAAADklEQVQI12P4AIX8EAgALgAD/aNpbtEAAAAASUVORK5CYII"
],
)
print(response)

Input Format Examples​

FormatExampleProvider
Data URIdata:image/png;base64,...Gemini, Vertex AI
GCS URLgs://bucket/path/image.pngVertex AI
File referencefiles/abc123Gemini API only

Supported MIME Types for Data URIs​

  • Images: image/png, image/jpeg
  • Audio: audio/mpeg, audio/wav
  • Video: video/mp4, video/quicktime
  • Documents: application/pdf

GCS URL MIME Inference​

For Vertex AI, MIME types are inferred from file extensions:

  • .png β†’ image/png
  • .jpg / .jpeg β†’ image/jpeg
  • .mp3 β†’ audio/mpeg
  • .wav β†’ audio/wav
  • .mp4 β†’ video/mp4
  • .mov β†’ video/quicktime
  • .pdf β†’ application/pdf

Optional Parameters​

ParameterDescriptionMaps to
dimensionsOutput embedding sizeoutputDimensionality
response = embedding(
model="gemini/gemini-embedding-2-preview",
input=["text to embed"],
dimensions=768, # Optional: control output vector size
)