New quants + Environment update

Lex-au · Lex-au · commit 92ccffab93b3 · 2025-03-24T21:32:21.000+11:00
README.md:
- Added new quants (Q2_K, Q4_K_M)

TTS_ENGINE:
- Updated the External Inference Server section:
- Made the model parameter configurable via ORPHEUS_MODEL_NAME environment variable

Environment:
- Updated .env.example to include this new parameter
diff --git a/.env.example b/.env.example
@@ -12,6 +12,7 @@ ORPHEUS_TOP_P=0.9
 # Repetition penalty is now hardcoded to 1.1 for stability (this is a model constraint) - this setting is no longer used
 # ORPHEUS_REPETITION_PENALTY=1.1
 ORPHEUS_SAMPLE_RATE=24000
+ORPHEUS_MODEL_NAME=Orpheus-3b-FT-Q8_0.gguf # Model name sent to inference server (Q2_K, Q4_K_M, or Q8_0 variants)
 
 # Web UI settings (keep in mind that the web UI is not secure and should not be exposed to the internet)
 ORPHEUS_PORT=5005
diff --git a/README.md b/README.md
@@ -18,6 +18,15 @@ High-performance Text-to-Speech server with OpenAI-compatible API, 8 voices, emo
 
 [GitHub Repository](https://github.com/Lex-au/Orpheus-FastAPI)
 
+## Model Collection
+
+🚀 **NEW:** Try the quantized models for improved performance!
+- **Q2_K**: Ultra-fast inference with 2-bit quantization
+- **Q4_K_M**: Balanced quality/speed with 4-bit quantization (mixed)
+- **Q8_0**: Original high-quality 8-bit model
+
+[Browse the Orpheus-FASTAPI Model Collection on HuggingFace](https://huggingface.co/collections/lex-au/orpheus-fastapi-67e125ae03fc96dae0517707)
+
 ## Voice Demos
 
 Listen to sample outputs with different voices and emotions:
@@ -271,7 +280,14 @@ This application requires a separate LLM inference server running the Orpheus mo
 - [llama.cpp server](https://github.com/ggerganov/llama.cpp) - Run with the appropriate model parameters
 - Any compatible OpenAI API-compatible server
 
-Download the quantised model from [lex-au/Orpheus-3b-FT-Q8_0.gguf](https://huggingface.co/lex-au/Orpheus-3b-FT-Q8_0.gguf) and load it in your inference server.
+**Quantized Model Options:**
+- **lex-au/Orpheus-3b-FT-Q2_K.gguf**: Fastest inference (~50% faster tokens/sec than Q8_0)
+- **lex-au/Orpheus-3b-FT-Q4_K_M.gguf**: Balanced quality/speed 
+- **lex-au/Orpheus-3b-FT-Q8_0.gguf**: Original high-quality model
+
+Choose based on your hardware and needs. Lower bit models (Q2_K, Q4_K_M) provide ~2x realtime performance on high-end GPUs.
+
+[Browse all models in the collection](https://huggingface.co/collections/lex-au/orpheus-fastapi-67e125ae03fc96dae0517707)
 
 The inference server should be configured to expose an API endpoint that this FastAPI application will connect to.
 
@@ -313,7 +329,7 @@ To add new voices, update the `AVAILABLE_VOICES` list in `tts_engine/inference.p
 When running the Orpheus model with llama.cpp, use these parameters to ensure optimal performance:
 
 ```bash
-./llama-server -m models/Orpheus-3b-FT-Q8_0.gguf \
+./llama-server -m models/Modelname.gguf \
   --ctx-size={{your ORPHEUS_MAX_TOKENS from .env}} \
   --n-predict={{your ORPHEUS_MAX_TOKENS from .env}} \
   --rope-scaling=linear
diff --git a/tts_engine/inference.py b/tts_engine/inference.py
@@ -218,9 +218,8 @@ def generate_tokens_from_api(prompt: str, voice: str = DEFAULT_VOICE, temperatur
     elif torch.cuda.is_available():
         print("Using optimized parameters for GPU acceleration")
     
-    # Create the request payload
+    # Create the request payload (model field may not be required by some endpoints but included for compatibility)
     payload = {
-        "model": "Orpheus-3b-FT-Q8_0.gguf",  # Model name can be anything, endpoint will use loaded model
         "prompt": formatted_prompt,
         "max_tokens": max_tokens,
         "temperature": temperature,
@@ -229,6 +228,11 @@ def generate_tokens_from_api(prompt: str, voice: str = DEFAULT_VOICE, temperatur
         "stream": True  # Always stream for better performance
     }
     
+    # Add model field - this is ignored by many local inference servers for /v1/completions
+    # but included for compatibility with OpenAI API and some servers that may use it
+    model_name = os.environ.get("ORPHEUS_MODEL_NAME", "Orpheus-3b-FT-Q8_0.gguf")
+    payload["model"] = model_name
+    
     # Session for connection pooling and retry logic
     session = requests.Session()