Skip to content

Commit 92ccffa

Browse files
committed
New quants + Environment update
README.md: - Added new quants (Q2_K, Q4_K_M) TTS_ENGINE: - Updated the External Inference Server section: - Made the model parameter configurable via ORPHEUS_MODEL_NAME environment variable Environment: - Updated .env.example to include this new parameter
1 parent a95e814 commit 92ccffa

File tree

3 files changed

+25
-4
lines changed

3 files changed

+25
-4
lines changed

.env.example

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ ORPHEUS_TOP_P=0.9
1212
# Repetition penalty is now hardcoded to 1.1 for stability (this is a model constraint) - this setting is no longer used
1313
# ORPHEUS_REPETITION_PENALTY=1.1
1414
ORPHEUS_SAMPLE_RATE=24000
15+
ORPHEUS_MODEL_NAME=Orpheus-3b-FT-Q8_0.gguf # Model name sent to inference server (Q2_K, Q4_K_M, or Q8_0 variants)
1516

1617
# Web UI settings (keep in mind that the web UI is not secure and should not be exposed to the internet)
1718
ORPHEUS_PORT=5005

README.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,15 @@ High-performance Text-to-Speech server with OpenAI-compatible API, 8 voices, emo
1818

1919
[GitHub Repository](https://github.com/Lex-au/Orpheus-FastAPI)
2020

21+
## Model Collection
22+
23+
🚀 **NEW:** Try the quantized models for improved performance!
24+
- **Q2_K**: Ultra-fast inference with 2-bit quantization
25+
- **Q4_K_M**: Balanced quality/speed with 4-bit quantization (mixed)
26+
- **Q8_0**: Original high-quality 8-bit model
27+
28+
[Browse the Orpheus-FASTAPI Model Collection on HuggingFace](https://huggingface.co/collections/lex-au/orpheus-fastapi-67e125ae03fc96dae0517707)
29+
2130
## Voice Demos
2231

2332
Listen to sample outputs with different voices and emotions:
@@ -271,7 +280,14 @@ This application requires a separate LLM inference server running the Orpheus mo
271280
- [llama.cpp server](https://github.com/ggerganov/llama.cpp) - Run with the appropriate model parameters
272281
- Any compatible OpenAI API-compatible server
273282

274-
Download the quantised model from [lex-au/Orpheus-3b-FT-Q8_0.gguf](https://huggingface.co/lex-au/Orpheus-3b-FT-Q8_0.gguf) and load it in your inference server.
283+
**Quantized Model Options:**
284+
- **lex-au/Orpheus-3b-FT-Q2_K.gguf**: Fastest inference (~50% faster tokens/sec than Q8_0)
285+
- **lex-au/Orpheus-3b-FT-Q4_K_M.gguf**: Balanced quality/speed
286+
- **lex-au/Orpheus-3b-FT-Q8_0.gguf**: Original high-quality model
287+
288+
Choose based on your hardware and needs. Lower bit models (Q2_K, Q4_K_M) provide ~2x realtime performance on high-end GPUs.
289+
290+
[Browse all models in the collection](https://huggingface.co/collections/lex-au/orpheus-fastapi-67e125ae03fc96dae0517707)
275291

276292
The inference server should be configured to expose an API endpoint that this FastAPI application will connect to.
277293

@@ -313,7 +329,7 @@ To add new voices, update the `AVAILABLE_VOICES` list in `tts_engine/inference.p
313329
When running the Orpheus model with llama.cpp, use these parameters to ensure optimal performance:
314330

315331
```bash
316-
./llama-server -m models/Orpheus-3b-FT-Q8_0.gguf \
332+
./llama-server -m models/Modelname.gguf \
317333
--ctx-size={{your ORPHEUS_MAX_TOKENS from .env}} \
318334
--n-predict={{your ORPHEUS_MAX_TOKENS from .env}} \
319335
--rope-scaling=linear

tts_engine/inference.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -218,9 +218,8 @@ def generate_tokens_from_api(prompt: str, voice: str = DEFAULT_VOICE, temperatur
218218
elif torch.cuda.is_available():
219219
print("Using optimized parameters for GPU acceleration")
220220

221-
# Create the request payload
221+
# Create the request payload (model field may not be required by some endpoints but included for compatibility)
222222
payload = {
223-
"model": "Orpheus-3b-FT-Q8_0.gguf", # Model name can be anything, endpoint will use loaded model
224223
"prompt": formatted_prompt,
225224
"max_tokens": max_tokens,
226225
"temperature": temperature,
@@ -229,6 +228,11 @@ def generate_tokens_from_api(prompt: str, voice: str = DEFAULT_VOICE, temperatur
229228
"stream": True # Always stream for better performance
230229
}
231230

231+
# Add model field - this is ignored by many local inference servers for /v1/completions
232+
# but included for compatibility with OpenAI API and some servers that may use it
233+
model_name = os.environ.get("ORPHEUS_MODEL_NAME", "Orpheus-3b-FT-Q8_0.gguf")
234+
payload["model"] = model_name
235+
232236
# Session for connection pooling and retry logic
233237
session = requests.Session()
234238

0 commit comments

Comments
 (0)