Skip to main content

Using Audio Models

How to send / receieve audio to a /chat/completions endpoint

Audio Output from a model

Example for creating a human-like audio response to a prompt

import os 
import base64
from litellm import completion

os.environ["OPENAI_API_KEY"] = "your-api-key"

# openai call
completion = await litellm.acompletion(
model="gpt-4o-audio-preview",
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "wav"},
messages=[{"role": "user", "content": "Is a golden retriever a good family dog?"}],
)

wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("dog.wav", "wb") as f:
f.write(wav_bytes)

Audio Input to a model

import base64
import requests

url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"
response = requests.get(url)
response.raise_for_status()
wav_data = response.content
encoded_string = base64.b64encode(wav_data).decode("utf-8")

completion = litellm.completion(
model="gpt-4o-audio-preview",
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this recording?"},
{
"type": "input_audio",
"input_audio": {"data": encoded_string, "format": "wav"},
},
],
},
],
)

print(completion.choices[0].message)

Checking if a model supports audio_input and audio_output

Use litellm.supports_audio_output(model="") -> returns True if model can generate audio output

Use litellm.supports_audio_input(model="") -> returns True if model can accept audio input

assert litellm.supports_audio_output(model="gpt-4o-audio-preview") == True
assert litellm.supports_audio_input(model="gpt-4o-audio-preview") == True

assert litellm.supports_audio_output(model="gpt-3.5-turbo") == False
assert litellm.supports_audio_input(model="gpt-3.5-turbo") == False

Response Format with Audio

Below is an example JSON data structure for a message you might receive from a /chat/completions endpoint when sending audio input to a model.

{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"refusal": null,
"audio": {
"id": "audio_abc123",
"expires_at": 1729018505,
"data": "<bytes omitted>",
"transcript": "Yes, golden retrievers are known to be ..."
}
},
"finish_reason": "stop"
}
  • audio If the audio output modality is requested, this object contains data about the audio response from the model
    • audio.id Unique identifier for the audio response
    • audio.expires_at The Unix timestamp (in seconds) for when this audio response will no longer be accessible on the server for use in multi-turn conversations.
    • audio.data Base64 encoded audio bytes generated by the model, in the format specified in the request.
    • audio.transcript Transcript of the audio generated by the model.