Skip to content

Latest commit

 

History

History
181 lines (130 loc) · 5.84 KB

File metadata and controls

181 lines (130 loc) · 5.84 KB

XiaomiTTS-to-OpenAI-TTS-API

中文

A proxy service that converts the Xiaomi MiMo TTS API into an OpenAI-compatible TTS API, ready to serve as a TTS backend for OpenClaw and other applications.

Features

  • OpenAI-Compatible API — Standard POST /v1/audio/speech endpoint, works with any OpenAI TTS client
  • OpenClaw Support — Drop-in TTS provider for OpenClaw
  • Full Format Support — mp3, wav, opus, aac, flac, pcm; unsupported formats are automatically transcoded
  • Voice Mapping — OpenAI voice names auto-map to MiMo voices; unsupported voices fall back gracefully
  • Style Tags — Web UI with built-in MiMo style and audio event tags, one-click insertion
  • Web UI — Built-in visual TTS testing page at /
  • Docker Ready — Single command deployment

OpenClaw Integration

One of the primary use cases for this service is as a TTS backend for OpenClaw, enabling speech synthesis via Xiaomi MiMo.

Deploy the Proxy

docker run -d -p 8080:8080 -e XIAOMI_API_KEY=sk-your-xiaomi-key yshtcn/xiaomitts-to-openai:latest

Configure OpenClaw

Add or modify the TTS configuration in .openclaw/openclaw.json:

{
  "tts": {
    "auto": "always",
    "provider": "openai",
    "providers": {
      "openai": {
        "apiKey": "sk-your-xiaomi-key",
        "baseUrl": "http://your-server:8080/v1/",
        "model": "mimo-v2-tts",
        "voice": "mimo_default"
      }
    }
  }
}
Field Description
apiKey Xiaomi MiMo API key (required, passed through to Xiaomi API)
baseUrl Proxy service URL, must end with /v1/
model Use mimo-v2-tts
voice mimo_default, default_zh (Chinese female), or default_en (English female)

Any audio format requested by OpenClaw (including opus) is automatically handled by the proxy.

Quick Start

Docker

docker run -d -p 8080:8080 -e XIAOMI_API_KEY=sk-xxx yshtcn/xiaomitts-to-openai:latest

The Docker image includes ffmpeg out of the box.

Local

  1. Copy .env.example to .env and fill in your Xiaomi API key:
XIAOMI_API_KEY=sk-your-key-here
  1. Install dependencies and run (local setup requires ffmpeg for audio format transcoding):
pip install -r requirements.txt
python main.py
  1. Open http://localhost:8080 for the Web UI

API Usage

Generate Speech

curl http://localhost:8080/v1/audio/speech \
  -H "Authorization: Bearer sk-your-xiaomi-key" \
  -H "Content-Type: application/json" \
  -d '{"model":"mimo-v2-tts","input":"Hello world","voice":"mimo_default"}' \
  --output speech.mp3

Parameters

Parameter Type Required Description
model string Yes Model name (any value maps to mimo-v2-tts)
input string Yes Text to synthesize (max 4096 chars), supports <style> and audio event tags
voice string Yes Voice (see voice list below; unsupported voices fall back to mimo_default)
response_format string No Audio format: mp3 (default), wav, opus, aac, flac, pcm
speed number No Speed 0.25–4.0 (default 1.0)
instructions string No Natural language instruction

Voices

Value Description
mimo_default MiMo default
default_zh Chinese female
default_en English female

Standard OpenAI voice names (alloy, echo, nova, etc.) are automatically mapped to mimo_default.

Audio Format Compatibility

Xiaomi API only supports mp3, wav, pcm, and pcm16. When the caller requests opus, aac, or flac, the proxy will:

  1. Request the closest supported format from Xiaomi (mp3 or wav)
  2. Transcode to the requested format using ffmpeg
  3. Return with the correct Content-Type

If transcoding fails, the proxy gracefully falls back to the original format instead of returning an error.

Style and Audio Tags

MiMo TTS supports style control and audio event embedding through specific text formats. The Web UI includes shortcut buttons for these tags.

Style Tags

Add a <style> tag at the beginning of the text to specify a style. Use commas to combine multiple styles:

<style>happy</style>Tomorrow is Friday, I'm so excited!
<style>Northeastern dialect</style>哎呀妈呀,这天儿也忒冷了吧!
<style>Cantonese</style>呢个真係好正啊!食过一次就唔会忘记!
<style>singing</style>原谅我这一生不羁放纵爱自由...

Recommended styles (custom styles beyond this list are also supported):

Category Styles
Speed faster, slower
Emotion happy, sad, angry
Role-play Sun Wukong, Lin Daiyu
Style whisper, cutesy voice, Taiwanese accent
Dialect Northeastern, Sichuan, Henan, Cantonese

Audio Event Tags

Use Chinese parentheses () to wrap audio event descriptions anywhere in the text:

(nervous, deep breath)Calm down... it's just an interview...
(exhausted, barely audible)Driver... wake me when we arrive...
(sobbing)Why did it have to end this way?
(shouting)Hey! Fresh fish here! Just caught this morning!

For more details, see the Xiaomi MiMo TTS official documentation.

Environment Variables

Variable Default Description
XIAOMI_API_KEY - Xiaomi MiMo API key
XIAOMI_BASE_URL https://api.xiaomimimo.com Xiaomi API base URL
XIAOMI_MODEL mimo-v2-tts Xiaomi TTS model name
DEFAULT_VOICE mimo_default Default voice
PORT 8080 Server port

Docker Build

Three build scripts are provided:

  • build-docker-gitea.ps1 — Push to Gitea (latest tag)
  • build-docker-gitea-beta.ps1 — Push to Gitea (beta tag)
  • build-docker-hub.ps1 — Push to Docker Hub