You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
README.md:
- Added new quants (Q2_K, Q4_K_M)
TTS_ENGINE:
- Updated the External Inference Server section:
- Made the model parameter configurable via ORPHEUS_MODEL_NAME environment variable
Environment:
- Updated .env.example to include this new parameter
🚀 **NEW:** Try the quantized models for improved performance!
24
+
-**Q2_K**: Ultra-fast inference with 2-bit quantization
25
+
-**Q4_K_M**: Balanced quality/speed with 4-bit quantization (mixed)
26
+
-**Q8_0**: Original high-quality 8-bit model
27
+
28
+
[Browse the Orpheus-FASTAPI Model Collection on HuggingFace](https://huggingface.co/collections/lex-au/orpheus-fastapi-67e125ae03fc96dae0517707)
29
+
21
30
## Voice Demos
22
31
23
32
Listen to sample outputs with different voices and emotions:
@@ -271,7 +280,14 @@ This application requires a separate LLM inference server running the Orpheus mo
271
280
-[llama.cpp server](https://github.com/ggerganov/llama.cpp) - Run with the appropriate model parameters
272
281
- Any compatible OpenAI API-compatible server
273
282
274
-
Download the quantised model from [lex-au/Orpheus-3b-FT-Q8_0.gguf](https://huggingface.co/lex-au/Orpheus-3b-FT-Q8_0.gguf) and load it in your inference server.
283
+
**Quantized Model Options:**
284
+
-**lex-au/Orpheus-3b-FT-Q2_K.gguf**: Fastest inference (~50% faster tokens/sec than Q8_0)
0 commit comments