XiaomiTTS-to-OpenAI-TTS-API

中文

A proxy service that converts the Xiaomi MiMo TTS API into an OpenAI-compatible TTS API, ready to serve as a TTS backend for OpenClaw and other applications.

Features

OpenAI-Compatible API — Standard POST /v1/audio/speech endpoint, works with any OpenAI TTS client
OpenClaw Support — Drop-in TTS provider for OpenClaw
Full Format Support — mp3, wav, opus, aac, flac, pcm; unsupported formats are automatically transcoded
Voice Mapping — OpenAI voice names auto-map to MiMo voices; unsupported voices fall back gracefully
Style Tags — Web UI with built-in MiMo style and audio event tags, one-click insertion
Web UI — Built-in visual TTS testing page at /
Docker Ready — Single command deployment

OpenClaw Integration

One of the primary use cases for this service is as a TTS backend for OpenClaw, enabling speech synthesis via Xiaomi MiMo.

Deploy the Proxy

docker run -d -p 8080:8080 -e XIAOMI_API_KEY=sk-your-xiaomi-key yshtcn/xiaomitts-to-openai:latest

Configure OpenClaw

Add or modify the TTS configuration in .openclaw/openclaw.json:

{
  "tts": {
    "auto": "always",
    "provider": "openai",
    "providers": {
      "openai": {
        "apiKey": "sk-your-xiaomi-key",
        "baseUrl": "http://your-server:8080/v1/",
        "model": "mimo-v2-tts",
        "voice": "mimo_default"
      }
    }
  }
}

Field	Description
apiKey	Xiaomi MiMo API key (required, passed through to Xiaomi API)
baseUrl	Proxy service URL, must end with `/v1/`
model	Use `mimo-v2-tts`
voice	`mimo_default`, `default_zh` (Chinese female), or `default_en` (English female)

Any audio format requested by OpenClaw (including opus) is automatically handled by the proxy.

Quick Start

Docker

docker run -d -p 8080:8080 -e XIAOMI_API_KEY=sk-xxx yshtcn/xiaomitts-to-openai:latest

The Docker image includes ffmpeg out of the box.

Local

Copy .env.example to .env and fill in your Xiaomi API key:

XIAOMI_API_KEY=sk-your-key-here

Install dependencies and run (local setup requires ffmpeg for audio format transcoding):

pip install -r requirements.txt
python main.py

Open http://localhost:8080 for the Web UI

API Usage

Generate Speech

curl http://localhost:8080/v1/audio/speech \
  -H "Authorization: Bearer sk-your-xiaomi-key" \
  -H "Content-Type: application/json" \
  -d '{"model":"mimo-v2-tts","input":"Hello world","voice":"mimo_default"}' \
  --output speech.mp3

Parameters

Parameter	Type	Required	Description
model	string	Yes	Model name (any value maps to mimo-v2-tts)
input	string	Yes	Text to synthesize (max 4096 chars), supports `<style>` and audio event tags
voice	string	Yes	Voice (see voice list below; unsupported voices fall back to mimo_default)
response_format	string	No	Audio format: mp3 (default), wav, opus, aac, flac, pcm
speed	number	No	Speed 0.25–4.0 (default 1.0)
instructions	string	No	Natural language instruction

Voices

Value	Description
mimo_default	MiMo default
default_zh	Chinese female
default_en	English female

Standard OpenAI voice names (alloy, echo, nova, etc.) are automatically mapped to mimo_default.

Audio Format Compatibility

Xiaomi API only supports mp3, wav, pcm, and pcm16. When the caller requests opus, aac, or flac, the proxy will:

Request the closest supported format from Xiaomi (mp3 or wav)
Transcode to the requested format using ffmpeg
Return with the correct Content-Type

If transcoding fails, the proxy gracefully falls back to the original format instead of returning an error.

Style and Audio Tags

MiMo TTS supports style control and audio event embedding through specific text formats. The Web UI includes shortcut buttons for these tags.

Style Tags

Add a <style> tag at the beginning of the text to specify a style. Use commas to combine multiple styles:

<style>happy</style>Tomorrow is Friday, I'm so excited!
<style>Northeastern dialect</style>哎呀妈呀，这天儿也忒冷了吧！
<style>Cantonese</style>呢个真係好正啊！食过一次就唔会忘记！
<style>singing</style>原谅我这一生不羁放纵爱自由...

Recommended styles (custom styles beyond this list are also supported):

Category	Styles
Speed	faster, slower
Emotion	happy, sad, angry
Role-play	Sun Wukong, Lin Daiyu
Style	whisper, cutesy voice, Taiwanese accent
Dialect	Northeastern, Sichuan, Henan, Cantonese

Audio Event Tags

Use Chinese parentheses （） to wrap audio event descriptions anywhere in the text:

（nervous, deep breath）Calm down... it's just an interview...
（exhausted, barely audible）Driver... wake me when we arrive...
（sobbing）Why did it have to end this way?
（shouting）Hey! Fresh fish here! Just caught this morning!

For more details, see the Xiaomi MiMo TTS official documentation.

Environment Variables

Variable	Default	Description
XIAOMI_API_KEY	-	Xiaomi MiMo API key
XIAOMI_BASE_URL	https://api.xiaomimimo.com	Xiaomi API base URL
XIAOMI_MODEL	mimo-v2-tts	Xiaomi TTS model name
DEFAULT_VOICE	mimo_default	Default voice
PORT	8080	Server port

Docker Build

Three build scripts are provided:

build-docker-gitea.ps1 — Push to Gitea (latest tag)
build-docker-gitea-beta.ps1 — Push to Gitea (beta tag)
build-docker-hub.ps1 — Push to Docker Hub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XiaomiTTS-to-OpenAI-TTS-API

Features

OpenClaw Integration

Deploy the Proxy

Configure OpenClaw

Quick Start

Docker

Local

API Usage

Generate Speech

Parameters

Voices

Audio Format Compatibility

Style and Audio Tags

Style Tags

Audio Event Tags

Environment Variables

Docker Build

FilesExpand file tree

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

XiaomiTTS-to-OpenAI-TTS-API

Features

OpenClaw Integration

Deploy the Proxy

Configure OpenClaw

Quick Start

Docker

Local

API Usage

Generate Speech

Parameters

Voices

Audio Format Compatibility

Style and Audio Tags

Style Tags

Audio Event Tags

Environment Variables

Docker Build