Lemonade
Lemonade Server is an OpenAI-compatible local language model inference provider optimized for AMD GPUs and NPUs. The lemonade
litellm provider supports standard chat completions with full OpenAI API compatibility.
Property | Details |
---|---|
Description | OpenAI-compatible AI provider for local and cloud-based language model inference |
Provider Route on LiteLLM | lemonade/ (add this prefix to the model name - e.g. lemonade/your-model-name ) |
API Endpoint for Provider | http://localhost:8000/api/v1 (default) |
Supported Endpoints | /chat/completions |
Supported OpenAI Parametersโ
Lemonade is fully OpenAI-compatible and supports the following parameters:
"repeat_penalty"
"functions"
"logit_bias"
"max_tokens"
"max_completion_tokens"
"presence_penalty"
"stop"
"temperature"
"top_p"
"top_k"
"response_format"
"tools"
API Key Setupโ
Lemonade can be configured with custom API URLs and doesn't require strict API key validation. Set the LEMONADE_API_BASE
environment variable to modify the base URL.
Usageโ
- SDK
from litellm import completion
import os
# Optional: Set custom API base. Useful if your lemonade server is on
# a different port
os.environ['LEMONADE_API_BASE'] = "http://localhost:8000/api/v1"
response = completion(
model="lemonade/your-model-name",
messages=[
{"role": "user", "content": "Hello from LiteLLM!"}
],
)
print(response)
Streamingโ
from litellm import completion
import os
# Optional: Set custom API base. Useful if your lemonade server is on
# a different port
os.environ['LEMONADE_API_BASE'] = "http://localhost:8000/api/v1"
response = completion(
model="lemonade/your-model-name",
messages=[
{"role": "user", "content": "Write a short story"}
],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content, end='', flush=True)
Advanced Usageโ
Custom Parametersโ
Lemonade supports additional parameters beyond the standard OpenAI set:
from litellm import completion
response = completion(
model="lemonade/your-model-name",
messages=[{"role": "user", "content": "Explain quantum computing"}],
temperature=0.7,
max_tokens=500,
top_p=0.9,
top_k=50,
repeat_penalty=1.1,
stop=["Human:", "AI:"]
)
print(response)
Function Callingโ
Lemonade supports OpenAI-compatible function calling:
from litellm import completion
functions = [
{
"name": "get_weather",
"description": "Get current weather information",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state"
}
},
"required": ["location"]
}
}
]
response = completion(
model="lemonade/your-model-name",
messages=[{"role": "user", "content": "What's the weather in San Francisco?"}],
tools=[{"type": "function", "function": f} for f in functions],
tool_choice="auto"
)
print(response)
Response Formatโ
Lemonade supports structured output with response format:
from litellm import completion
import json
# Define schema in response_format
response = completion(
model="lemonade/Qwen3-Coder-30B-A3B-Instruct-GGUF",
messages=[{"role": "user", "content": "Generate JSON data for a person with their name, age, and city."}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "person",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"city": {"type": "string"}
},
"required": ["name", "age"]
}
}
}
)
print(f"Model: {response.model}")
print(f"JSON Output:")
json_data = json.loads(response.choices[0].message.content)
print(json.dumps(json_data, indent=2))
Available Modelsโ
Lemonade automatically validates available models by querying the /models
endpoint. You can check available models programmatically:
import httpx
api_base = "http://localhost:8000" # or your custom base
response = httpx.get(f"{api_base}/api/v1/models")
models = response.json()
print("Available models:", [model['id'] for model in models.get('data', [])])
Supportโ
For more information regarding Lemonade please go to to the Lemonade website or Lemonade repository.