See the top-level README.md for more information.
This provides the Rust backend (including MoshiRAG with ARC-Encoder, Mimi, and the STT model) and client implementation.
You will need a recent version of the Rust toolchain.
To compile GPU support, you will also need the CUDA properly installed for your GPU, in particular with nvcc.
If you don't have ssl certificates yet, generate a key.pem and cert.pem file
using the following command.
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes -subj "/CN=localhost"Use a config where the required paths are set. Either copy moshi-backend/config.json and edit it, or add these keys to your existing config:
lm_model_file- path to the main MoshiRAG LM safetensorstext_tokenizer_file- path to the text tokenizer of the main MoshiRAG LMmimi_model_file- path to the mimi safetensor checkpointstt_lm_model_file— path to the STT LM safetensorsstt_text_tokenizer_file— path to the STT text tokenizer (e.g..model)stt_mimi_model_file— path to the STT Mimi checkpoint (can be the same asmimi_model_fileif you use one Mimi for both)arc_encoder_tokenizer_path- path to the text tokenizer of ARC-Encoderarc_encoder_model_file- path to the ARC-Encoder safetensors checkpoint
MoshiRAG calls an OpenAI-compatible chat endpoint for retrieval. You can choose a local vLLM server or an online API.
- Local vLLM retrieval server:
- Start vLLM with your retrieval model exposed at
/v1/chat/completions.vllm serve google/gemma-3-27b-it --host 0.0.0.0 --port 8002
- Set
LLM_BASE_URL=http://127.0.0.1:8002/v1 - Set
LLM_MODEL_NAME=<your-vllm-model-name> - Set
LLM_API_KEY=""
- Start vLLM with your retrieval model exposed at
- Online API (OpenAI-compatible provider; choose low-latency service to guarantee MoshiRAG response quality):
- Set
LLM_BASE_URL=<provider-base-url>/v1 - Set
LLM_MODEL_NAME=<provider-model-name> - Set
LLM_API_KEY=<your-api-key>
- Set
- Multiple retrieval LLMs: set
MOSHI_RETRIEVAL_LLMS_JSONto a JSON array of objects withid,base_url,model, optionalapi_key, and exactly one"default": truewhen you list two or more profiles. In Rust, this can only be set through environment variables. When set and non-empty, it replaces single-endpointLLM_BASE_URL/LLM_MODEL_NAMEretrieval selection.- Example:
MOSHI_RETRIEVAL_LLMS_JSON='[{"id": "gpt-oss-20b", "base_url": "https://api.groq.com/openai/v1", "model": "openai/gpt-oss-20b", "prompt_style": "simplified"}, {"id": "gemma-3-27b-it", "base_url": "http://localhost:8002/v1", "model": "google/gemma-3-27b-it", "default": true, "prompt_style": "original"}]' LLM_API_KEY=... - In this example:
gpt-oss-20buses thesimplifiedbundled reference prompt.gemma-3-27b-itis marked as the default fallback profile viadefault: true.- If a profile does not set
api_key, Rust uses globalLLM_API_KEY. - The prompt style of each individual profile may be
"original"or"simplified"(defaultsimplified).
- Example:
Run the comment below to start the MoshiRAG server (MoshiRAG model with ARC encoder + Kyutai STT ASR + mimi).
# From the rust/ directory (use --features metal on macOS)
export MOSHI_RETRIEVAL_LLMS_JSON='[{"id": "gpt-oss-20b", "base_url": "https://api.groq.com/openai/v1", "model": "openai/gpt-oss-20b", "prompt_style": "simplified"}, {"id": "gemma-3-27b-it", "base_url": "http://localhost:8002/v1", "model": "google/gemma-3-27b-it", "default": true, "prompt_style": "original"}]'
export LLM_API_KEY=...
cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config.json standaloneWhen using macOS, you can replace --features cuda with --features metal.
Once the server prints standalone worker listening, open the web UI at
localhost:8998 (HTTPS by default).
You will get some warnings about the site being unsafe. When using chrome you can bypass it by selecting "Details" or "Advanced", then "Visit this unsafe site" or "Proceed to localhost (unsafe)".
We recommend using the web UI as it provides some echo cancellation that helps the overall model quality.
The present code is provided under the Apache license.