Gemini Embedding 2 Preview: Multimodal Embeddings on LiteLLM

March 11, 2025

SWE @ LiteLLM (LLM Translation)

LiteLLM now supports multimodal embeddings with gemini-embedding-2-preview—mixing text, images, audio, video, and PDF content in a single request. Available via both the Gemini API (API key) and Vertex AI (GCP credentials).

Response shape differs by provider

Gemini API (gemini/...): each input element returns its own embedding, indexed 0..N-1 — same shape as OpenAI's /embeddings. LiteLLM routes to the batchEmbedContents endpoint with one EmbedContentRequest per input.
Vertex AI (vertex_ai/...): all input elements are combined into a single unified embedding via embedContent. Vertex AI does not expose batchEmbedContents for Gemini embedding models, so N parts → 1 vector. To get one vector per item, call embedding(...) once per input.

Supported Input Types

Modality	Supported Formats
Text	Plain text
Image	PNG, JPEG
Audio	MP3, WAV
Video	MP4, MOV
Documents	PDF

Input Formats

LiteLLM accepts three input formats for multimodal content:

Data URIs – Base64-encoded inline: data:image/png;base64,<encoded_data>
GCS URLs – Cloud Storage paths (Vertex AI): gs://bucket/path/to/file.png
Gemini File References – Pre-uploaded files (Gemini API): files/abc123

Quick Start

Gemini API
Vertex AI
LiteLLM Proxy

from litellm import embedding
import os

os.environ["GEMINI_API_KEY"] = "your-api-key"

# Text + Image (base64)
response = embedding(
    model="gemini/gemini-embedding-2-preview",
    input=[
        "The food was delicious and the waiter...",
        "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAgAAAAIAQMAAAD+wSzIAAAABlBMVEX///+/v7+jQ3Y5AAAADklEQVQI12P4AIX8EAgALgAD/aNpbtEAAAAASUVORK5CYII"
    ],
)
print(response)

import litellm
from litellm import embedding

litellm.vertex_project = "your-project-id"
litellm.vertex_location = "us-central1"

# Text + Image (GCS URL)
response = embedding(
    model="vertex_ai/gemini-embedding-2-preview",
    input=[
        "Describe this image",
        "gs://my-bucket/images/photo.png"
    ],
)
print(response)

1. Config (config.yaml)

model_list:
  - model_name: gemini-embedding-2-preview
    litellm_params:
      model: gemini/gemini-embedding-2-preview
      api_key: os.environ/GEMINI_API_KEY
  - model_name: vertex-gemini-embedding-2-preview
    litellm_params:
      model: vertex_ai/gemini-embedding-2-preview
      vertex_project: os.environ/VERTEXAI_PROJECT
      vertex_location: os.environ/VERTEXAI_LOCATION

general_settings:
  master_key: sk-1234

2. Start proxy

litellm --config config.yaml

3. Call embeddings

curl -X POST http://localhost:4000/embeddings \
  -H "Authorization: Bearer sk-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-embedding-2-preview",
    "input": [
      "The food was delicious and the waiter...",
      "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAgAAAAIAQMAAAD+wSzIAAAABlBMVEX///+/v7+jQ3Y5AAAADklEQVQI12P4AIX8EAgALgAD/aNpbtEAAAAASUVORK5CYII"
    ]
  }'

Input Format Examples

Format	Example	Provider
Data URI	`data:image/png;base64,...`	Gemini, Vertex AI
GCS URL	`gs://bucket/path/image.png`	Vertex AI
File reference	`files/abc123`	Gemini API only

Supported MIME Types for Data URIs

Images: image/png, image/jpeg
Audio: audio/mpeg, audio/wav
Video: video/mp4, video/quicktime
Documents: application/pdf

GCS URL MIME Inference

For Vertex AI, MIME types are inferred from file extensions:

.png → image/png
.jpg / .jpeg → image/jpeg
.mp3 → audio/mpeg
.wav → audio/wav
.mp4 → video/mp4
.mov → video/quicktime
.pdf → application/pdf

Optional Parameters

Parameter	Description	Maps to
`dimensions`	Output embedding size	`outputDimensionality`

response = embedding(
    model="gemini/gemini-embedding-2-preview",
    input=["text to embed"],
    dimensions=768,  # Optional: control output vector size
)

Combined Embeddings (Gemini API, opt-in)

By default the Gemini API path returns one embedding per input element (OpenAI-compatible). To fuse several modalities into a single vector — e.g., a product represented by its name + photo — wrap them in a nested list:

from litellm import embedding

# Default: 2 inputs → 2 separate embeddings
embedding(
    model="gemini/gemini-embedding-2-preview",
    input=["a red shoe", "data:image/png;base64,..."],
)

# Combined: text + image fused into 1 embedding
embedding(
    model="gemini/gemini-embedding-2-preview",
    input=[["a red shoe", "data:image/png;base64,..."]],
)

# Mixed: 1 combined entity + 1 plain text → 2 embeddings total
embedding(
    model="gemini/gemini-embedding-2-preview",
    input=[["a red shoe", "data:image/png;base64,..."], "just text"],
)

Useful for multi-modal retrieval where a single entity has more than one modality. See the embedding docs for details. On Vertex AI this opt-in is unnecessary — every request already returns one combined vector.

Supported Input Types​

Input Formats​

Quick Start​

Input Format Examples​

Supported MIME Types for Data URIs​

GCS URL MIME Inference​

Optional Parameters​

Combined Embeddings (Gemini API, opt-in)​