Skip to main content

Lemonade

Lemonade Server is an OpenAI-compatible local language model inference provider optimized for AMD GPUs and NPUs. The lemonade litellm provider supports standard chat completions with full OpenAI API compatibility.

PropertyDetails
DescriptionOpenAI-compatible AI provider for local and cloud-based language model inference
Provider Route on LiteLLMlemonade/ (add this prefix to the model name - e.g. lemonade/your-model-name)
API Endpoint for Providerhttp://localhost:8000/api/v1 (default)
Supported Endpoints/chat/completions

Supported OpenAI Parametersโ€‹

Lemonade is fully OpenAI-compatible and supports the following parameters:

"repeat_penalty"
"functions"
"logit_bias"
"max_tokens"
"max_completion_tokens"
"presence_penalty"
"stop"
"temperature"
"top_p"
"top_k"
"response_format"
"tools"

API Key Setupโ€‹

Lemonade can be configured with custom API URLs and doesn't require strict API key validation. Set the LEMONADE_API_BASE environment variable to modify the base URL.

Usageโ€‹

from litellm import completion
import os

# Optional: Set custom API base. Useful if your lemonade server is on
# a different port
os.environ['LEMONADE_API_BASE'] = "http://localhost:8000/api/v1"

response = completion(
model="lemonade/your-model-name",
messages=[
{"role": "user", "content": "Hello from LiteLLM!"}
],
)
print(response)

Streamingโ€‹

from litellm import completion
import os

# Optional: Set custom API base. Useful if your lemonade server is on
# a different port
os.environ['LEMONADE_API_BASE'] = "http://localhost:8000/api/v1"

response = completion(
model="lemonade/your-model-name",
messages=[
{"role": "user", "content": "Write a short story"}
],
stream=True
)

for chunk in response:
print(chunk.choices[0].delta.content, end='', flush=True)

Advanced Usageโ€‹

Custom Parametersโ€‹

Lemonade supports additional parameters beyond the standard OpenAI set:

from litellm import completion

response = completion(
model="lemonade/your-model-name",
messages=[{"role": "user", "content": "Explain quantum computing"}],
temperature=0.7,
max_tokens=500,
top_p=0.9,
top_k=50,
repeat_penalty=1.1,
stop=["Human:", "AI:"]
)
print(response)

Function Callingโ€‹

Lemonade supports OpenAI-compatible function calling:

from litellm import completion

functions = [
{
"name": "get_weather",
"description": "Get current weather information",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state"
}
},
"required": ["location"]
}
}
]

response = completion(
model="lemonade/your-model-name",
messages=[{"role": "user", "content": "What's the weather in San Francisco?"}],
tools=[{"type": "function", "function": f} for f in functions],
tool_choice="auto"
)
print(response)

Response Formatโ€‹

Lemonade supports structured output with response format:

from litellm import completion
import json

# Define schema in response_format
response = completion(
model="lemonade/Qwen3-Coder-30B-A3B-Instruct-GGUF",
messages=[{"role": "user", "content": "Generate JSON data for a person with their name, age, and city."}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "person",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"city": {"type": "string"}
},
"required": ["name", "age"]
}
}
}
)

print(f"Model: {response.model}")
print(f"JSON Output:")
json_data = json.loads(response.choices[0].message.content)
print(json.dumps(json_data, indent=2))

Available Modelsโ€‹

Lemonade automatically validates available models by querying the /models endpoint. You can check available models programmatically:

import httpx

api_base = "http://localhost:8000" # or your custom base
response = httpx.get(f"{api_base}/api/v1/models")
models = response.json()
print("Available models:", [model['id'] for model in models.get('data', [])])

Supportโ€‹

For more information regarding Lemonade please go to to the Lemonade website or Lemonade repository.