A proxy service that converts the Xiaomi MiMo TTS API into an OpenAI-compatible TTS API, ready to serve as a TTS backend for OpenClaw and other applications.
- OpenAI-Compatible API — Standard
POST /v1/audio/speechendpoint, works with any OpenAI TTS client - OpenClaw Support — Drop-in TTS provider for OpenClaw
- Full Format Support — mp3, wav, opus, aac, flac, pcm; unsupported formats are automatically transcoded
- Voice Mapping — OpenAI voice names auto-map to MiMo voices; unsupported voices fall back gracefully
- Style Tags — Web UI with built-in MiMo style and audio event tags, one-click insertion
- Web UI — Built-in visual TTS testing page at
/ - Docker Ready — Single command deployment
One of the primary use cases for this service is as a TTS backend for OpenClaw, enabling speech synthesis via Xiaomi MiMo.
docker run -d -p 8080:8080 -e XIAOMI_API_KEY=sk-your-xiaomi-key yshtcn/xiaomitts-to-openai:latestAdd or modify the TTS configuration in .openclaw/openclaw.json:
{
"tts": {
"auto": "always",
"provider": "openai",
"providers": {
"openai": {
"apiKey": "sk-your-xiaomi-key",
"baseUrl": "http://your-server:8080/v1/",
"model": "mimo-v2-tts",
"voice": "mimo_default"
}
}
}
}| Field | Description |
|---|---|
| apiKey | Xiaomi MiMo API key (required, passed through to Xiaomi API) |
| baseUrl | Proxy service URL, must end with /v1/ |
| model | Use mimo-v2-tts |
| voice | mimo_default, default_zh (Chinese female), or default_en (English female) |
Any audio format requested by OpenClaw (including opus) is automatically handled by the proxy.
docker run -d -p 8080:8080 -e XIAOMI_API_KEY=sk-xxx yshtcn/xiaomitts-to-openai:latestThe Docker image includes ffmpeg out of the box.
- Copy
.env.exampleto.envand fill in your Xiaomi API key:
XIAOMI_API_KEY=sk-your-key-here
- Install dependencies and run (local setup requires ffmpeg for audio format transcoding):
pip install -r requirements.txt
python main.py- Open http://localhost:8080 for the Web UI
curl http://localhost:8080/v1/audio/speech \
-H "Authorization: Bearer sk-your-xiaomi-key" \
-H "Content-Type: application/json" \
-d '{"model":"mimo-v2-tts","input":"Hello world","voice":"mimo_default"}' \
--output speech.mp3| Parameter | Type | Required | Description |
|---|---|---|---|
| model | string | Yes | Model name (any value maps to mimo-v2-tts) |
| input | string | Yes | Text to synthesize (max 4096 chars), supports <style> and audio event tags |
| voice | string | Yes | Voice (see voice list below; unsupported voices fall back to mimo_default) |
| response_format | string | No | Audio format: mp3 (default), wav, opus, aac, flac, pcm |
| speed | number | No | Speed 0.25–4.0 (default 1.0) |
| instructions | string | No | Natural language instruction |
| Value | Description |
|---|---|
| mimo_default | MiMo default |
| default_zh | Chinese female |
| default_en | English female |
Standard OpenAI voice names (alloy, echo, nova, etc.) are automatically mapped to mimo_default.
Xiaomi API only supports mp3, wav, pcm, and pcm16. When the caller requests opus, aac, or flac, the proxy will:
- Request the closest supported format from Xiaomi (mp3 or wav)
- Transcode to the requested format using ffmpeg
- Return with the correct Content-Type
If transcoding fails, the proxy gracefully falls back to the original format instead of returning an error.
MiMo TTS supports style control and audio event embedding through specific text formats. The Web UI includes shortcut buttons for these tags.
Add a <style> tag at the beginning of the text to specify a style. Use commas to combine multiple styles:
<style>happy</style>Tomorrow is Friday, I'm so excited!
<style>Northeastern dialect</style>哎呀妈呀,这天儿也忒冷了吧!
<style>Cantonese</style>呢个真係好正啊!食过一次就唔会忘记!
<style>singing</style>原谅我这一生不羁放纵爱自由...
Recommended styles (custom styles beyond this list are also supported):
| Category | Styles |
|---|---|
| Speed | faster, slower |
| Emotion | happy, sad, angry |
| Role-play | Sun Wukong, Lin Daiyu |
| Style | whisper, cutesy voice, Taiwanese accent |
| Dialect | Northeastern, Sichuan, Henan, Cantonese |
Use Chinese parentheses () to wrap audio event descriptions anywhere in the text:
(nervous, deep breath)Calm down... it's just an interview...
(exhausted, barely audible)Driver... wake me when we arrive...
(sobbing)Why did it have to end this way?
(shouting)Hey! Fresh fish here! Just caught this morning!
For more details, see the Xiaomi MiMo TTS official documentation.
| Variable | Default | Description |
|---|---|---|
| XIAOMI_API_KEY | - | Xiaomi MiMo API key |
| XIAOMI_BASE_URL | https://api.xiaomimimo.com | Xiaomi API base URL |
| XIAOMI_MODEL | mimo-v2-tts | Xiaomi TTS model name |
| DEFAULT_VOICE | mimo_default | Default voice |
| PORT | 8080 | Server port |
Three build scripts are provided:
build-docker-gitea.ps1— Push to Gitea (latest tag)build-docker-gitea-beta.ps1— Push to Gitea (beta tag)build-docker-hub.ps1— Push to Docker Hub