Blog
Skip to main content

Gemini Embedding 2 Preview: Multimodal Embeddings on LiteLLM

Sameer Kankute
SWE @ LiteLLM (LLM Translation)

LiteLLM now supports multimodal embeddings with gemini-embedding-2-previewβ€”mixing text, images, audio, video, and PDF content in a single request. Available via both the Gemini API (API key) and Vertex AI (GCP credentials).

Response shape differs by provider
  • Gemini API (gemini/...): each input element returns its own embedding, indexed 0..N-1 β€” same shape as OpenAI's /embeddings. LiteLLM routes to the batchEmbedContents endpoint with one EmbedContentRequest per input.
  • Vertex AI (vertex_ai/...): all input elements are combined into a single unified embedding via embedContent. Vertex AI does not expose batchEmbedContents for Gemini embedding models, so N parts β†’ 1 vector. To get one vector per item, call embedding(...) once per input.

Supported Input Types​

ModalitySupported Formats
TextPlain text
ImagePNG, JPEG
AudioMP3, WAV
VideoMP4, MOV
DocumentsPDF

Input Formats​

LiteLLM accepts three input formats for multimodal content:

  1. Data URIs – Base64-encoded inline: data:image/png;base64,<encoded_data>
  2. GCS URLs – Cloud Storage paths (Vertex AI): gs://bucket/path/to/file.png
  3. Gemini File References – Pre-uploaded files (Gemini API): files/abc123

Quick Start​

from litellm import embedding
import os

os.environ["GEMINI_API_KEY"] = "your-api-key"

# Text + Image (base64)
response = embedding(
model="gemini/gemini-embedding-2-preview",
input=[
"The food was delicious and the waiter...",
"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAgAAAAIAQMAAAD+wSzIAAAABlBMVEX///+/v7+jQ3Y5AAAADklEQVQI12P4AIX8EAgALgAD/aNpbtEAAAAASUVORK5CYII"
],
)
print(response)

Input Format Examples​

FormatExampleProvider
Data URIdata:image/png;base64,...Gemini, Vertex AI
GCS URLgs://bucket/path/image.pngVertex AI
File referencefiles/abc123Gemini API only

Supported MIME Types for Data URIs​

  • Images: image/png, image/jpeg
  • Audio: audio/mpeg, audio/wav
  • Video: video/mp4, video/quicktime
  • Documents: application/pdf

GCS URL MIME Inference​

For Vertex AI, MIME types are inferred from file extensions:

  • .png β†’ image/png
  • .jpg / .jpeg β†’ image/jpeg
  • .mp3 β†’ audio/mpeg
  • .wav β†’ audio/wav
  • .mp4 β†’ video/mp4
  • .mov β†’ video/quicktime
  • .pdf β†’ application/pdf

Optional Parameters​

ParameterDescriptionMaps to
dimensionsOutput embedding sizeoutputDimensionality
response = embedding(
model="gemini/gemini-embedding-2-preview",
input=["text to embed"],
dimensions=768, # Optional: control output vector size
)

Combined Embeddings (Gemini API, opt-in)​

By default the Gemini API path returns one embedding per input element (OpenAI-compatible). To fuse several modalities into a single vector β€” e.g., a product represented by its name + photo β€” wrap them in a nested list:

from litellm import embedding

# Default: 2 inputs β†’ 2 separate embeddings
embedding(
model="gemini/gemini-embedding-2-preview",
input=["a red shoe", "data:image/png;base64,..."],
)

# Combined: text + image fused into 1 embedding
embedding(
model="gemini/gemini-embedding-2-preview",
input=[["a red shoe", "data:image/png;base64,..."]],
)

# Mixed: 1 combined entity + 1 plain text β†’ 2 embeddings total
embedding(
model="gemini/gemini-embedding-2-preview",
input=[["a red shoe", "data:image/png;base64,..."], "just text"],
)

Useful for multi-modal retrieval where a single entity has more than one modality. See the embedding docs for details. On Vertex AI this opt-in is unnecessary β€” every request already returns one combined vector.

We're hiring

Like what you see? Join us

Come build the future of AI infrastructure.