-
Notifications
You must be signed in to change notification settings - Fork 159
Add local Kokoro TTS server #1208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
rophec
wants to merge
36
commits into
Project-N-E-K-O:main
Choose a base branch
from
rophec:tinytts
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 5 commits
Commits
Show all changes
36 commits
Select commit
Hold shift + click to select a range
7ed3e85
Add local Kokoro TTS server
rophec 8c46d01
Merge branch 'Project-N-E-K-O:main' into tinytts
rophec 99a02ee
Address local TTS review feedback
rophec 0956b20
Merge branch 'main' into tinytts
rophec f51c588
Address follow-up local TTS review
rophec 2499338
Merge remote-tracking branch 'upstream/main' into tinytts
rophec 14470f9
Fix Kokoro panel voice remapping
rophec 2002c00
Merge branch 'main' into tinytts
rophec 42d8cb0
Route websocket custom TTS by endpoint
rophec 7b584b0
Accept unprefixed local TTS voices
rophec 55ad618
Add Kokoro profile selection and probe guards
rophec 3a2a29e
Tighten local TTS URL and voice parsing
rophec ef6e9e3
Clarify local TTS websocket voice routing
rophec 02d531f
Repair Kokoro endpoint restore and probing
rophec a460398
Narrow Kokoro profile inference
rophec d1dafa5
Add standalone Kokoro package builder
rophec a30fe5a
Harden local Kokoro TTS review fixes
rophec 44d23c0
Fix latest Kokoro review feedback
rophec 1e874d3
Fix latest local TTS review issues
rophec 44d1a49
Merge branch 'main' into tinytts
rophec fa1602a
Merge remote-tracking branch 'upstream/main' into tinytts
rophec f336018
refactor(local-tts): centralize Kokoro profile handling
rophec 41bdca6
Merge remote-tracking branch 'upstream/main' into tinytts
rophec 3c48d65
fix(scripts): ignore local caches in hygiene checks
rophec d7f79c6
fix(local-tts): validate websocket speed and chunk limits
rophec 4fbc899
feat(local-tts): package portable Kokoro runtime
rophec 299f7f9
Merge remote-tracking branch 'upstream/main' into tinytts
rophec 8aa7753
Revert "feat(local-tts): package portable Kokoro runtime"
rophec 9538a26
Merge remote-tracking branch 'upstream/main' into tinytts
rophec b638f96
fix(local-tts): preserve legacy speed and clarify output mode
rophec 94d66f3
fix(local-tts): enable custom API for Kokoro apply
rophec dfc5dd7
fix(local-tts): validate websocket origins
rophec 3cd9cc5
fix(local-tts): hide Kokoro preset for other TTS
rophec 97c9aeb
fix(local-tts): probe Kokoro server before showing preset
rophec b4dbd4c
fix(local-tts): allow custom Kokoro local ports
rophec b1a0e16
fix(local-tts): keep Kokoro profile and voice aligned
rophec File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| # NEKO Local Lightweight TTS | ||
|
|
||
| This service is the first-phase local TTS bridge for NEKO. It deliberately | ||
| implements the same WebSocket protocol expected by `local_cosyvoice_worker`, so | ||
| NEKO can use it without changing the main TTS pipeline. | ||
|
|
||
| ## Protocol | ||
|
|
||
| Endpoint: | ||
|
|
||
| ```text | ||
| ws://127.0.0.1:50000/v1/audio/speech/stream | ||
| ``` | ||
|
|
||
| Client messages: | ||
|
|
||
| ```json | ||
| {"voice":"kokoro:zf_001","speed":1.0} | ||
| {"text":"Hello from NEKO."} | ||
| {"text":"Local Kokoro TTS test."} | ||
| {"event":"end"} | ||
| ``` | ||
|
|
||
| Server response: | ||
|
|
||
| ```text | ||
| binary PCM s16le chunks, mono, 22050 Hz | ||
| ``` | ||
|
|
||
| NEKO's existing `local_cosyvoice_worker` then resamples this audio to 48 kHz. | ||
|
|
||
| ## Start | ||
|
|
||
| From the repository root: | ||
|
|
||
| ```bash | ||
| uv run python local_server/local_tts_server/server.py --host 127.0.0.1 --port 50000 | ||
| ``` | ||
|
|
||
| In NEKO settings, use the existing local custom TTS path: | ||
|
|
||
| ```text | ||
| ws://127.0.0.1:50000 | ||
| ``` | ||
|
|
||
| Keep the existing custom/GPT-SoVITS toggle enabled, because the current router | ||
| uses that switch to route `ws://` custom TTS URLs into `local_cosyvoice_worker`. | ||
|
|
||
| ## Voice Selector | ||
|
|
||
| The service accepts a model prefix in `voice`: | ||
|
|
||
| ```text | ||
| kokoro:<voice> | ||
| melotts:<voice> | ||
| melo:<voice> | ||
| chattts:<voice> | ||
| ``` | ||
|
|
||
| If the prefix is missing, `LOCAL_TTS_DEFAULT_MODEL` is used. The default is | ||
| `kokoro`. | ||
|
|
||
| ## Kokoro / MeloTTS / ChatTTS | ||
|
|
||
| These are exposed through command adapters for now. The command must write a | ||
| 16-bit WAV file to `{out_file}`. | ||
|
|
||
| The Kokoro launcher defaults to the Chinese-enhanced | ||
| `hexgrad/Kokoro-82M-v1.1-zh` model and voice `zf_001`. | ||
| If `local_server/local_tts_server/kokoro_models/Kokoro-82M-v1.1-zh` exists, | ||
| the launcher uses that local model directory before falling back to Hugging | ||
| Face cache/download. | ||
|
|
||
| ```bash | ||
| set LOCAL_TTS_KOKORO_MODEL_DIR=F:\models\Kokoro-82M-v1.1-zh | ||
| set LOCAL_TTS_KOKORO_REPO_ID=hexgrad/Kokoro-82M-v1.1-zh | ||
| set LOCAL_TTS_KOKORO_DEFAULT_VOICE=zf_001 | ||
| set LOCAL_TTS_KOKORO_CMD=python F:\tts_wrappers\kokoro_cli.py "{text_file}" "{out_file}" "{voice}" {speed} | ||
| set LOCAL_TTS_MELOTTS_CMD=python F:\tts_wrappers\melotts_cli.py --text-file "{text_file}" --out "{out_file}" --voice "{voice}" --speed {speed} | ||
| set LOCAL_TTS_CHATTTS_CMD=python F:\tts_wrappers\chattts_cli.py --text-file "{text_file}" --out "{out_file}" --voice "{voice}" --speed {speed} | ||
| ``` | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
| ChatTTS is AGPL-3.0. Keep it as an optional external backend unless the product | ||
| licensing story is settled. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,202 @@ | ||
| """Minimal Kokoro v1.1-zh CLI wrapper for local_tts_server. | ||
|
|
||
| Usage: | ||
| python kokoro_cli.py <text_file> <out_file> <voice> <speed> | ||
|
|
||
| Reads text from <text_file>, synthesizes with kokoro, writes WAV to <out_file>. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import argparse | ||
| import os | ||
| import sys | ||
| import wave | ||
| from pathlib import Path | ||
|
|
||
| import numpy as np | ||
|
|
||
|
|
||
| DEFAULT_REPO_ID = "hexgrad/Kokoro-82M-v1.1-zh" | ||
| DEFAULT_VOICE = "zf_001" | ||
| SAMPLE_RATE = 24000 | ||
| SCRIPT_DIR = Path(__file__).resolve().parent | ||
| DEFAULT_LOCAL_REPO = SCRIPT_DIR / "kokoro_models" / "Kokoro-82M-v1.1-zh" | ||
|
|
||
|
|
||
| def _audio_from_result(result): | ||
| if hasattr(result, "audio"): | ||
| return result.audio | ||
| if isinstance(result, tuple) and result: | ||
| return result[-1] | ||
| return None | ||
|
|
||
|
|
||
| def _infer_lang_code(voice: str) -> str: | ||
| # Kokoro uses single-letter lang codes: z=zh, a=en-us, b=en-gb. | ||
| if voice.startswith(("a", "af", "am")): | ||
| return "a" | ||
| if voice.startswith(("b", "bf", "bm")): | ||
| return "b" | ||
| return "z" | ||
|
|
||
|
|
||
| def _speed_callable(base_speed: float): | ||
| """Mitigate rushed long Chinese phoneme sequences in v1.1-zh.""" | ||
|
|
||
| base = base_speed if base_speed > 0 else 1.0 | ||
|
|
||
| def speed_by_len(len_ps: int) -> float: | ||
| speed = 1.0 | ||
| if len_ps > 83 and len_ps < 183: | ||
| speed = 1.0 - (len_ps - 83) / 500.0 | ||
| elif len_ps >= 183: | ||
| speed = 0.8 | ||
| return max(0.5, speed * base) | ||
|
|
||
| return speed_by_len | ||
|
|
||
|
|
||
| def _resolve_local_model_dir() -> Path | None: | ||
| raw = os.getenv("LOCAL_TTS_KOKORO_MODEL_DIR", "").strip() | ||
| if raw: | ||
| path = Path(raw) | ||
| return path if path.is_dir() else None | ||
| return DEFAULT_LOCAL_REPO if DEFAULT_LOCAL_REPO.is_dir() else None | ||
|
|
||
|
|
||
| def _find_model_file(model_dir: Path) -> Path | None: | ||
| preferred = model_dir / "kokoro-v1_1-zh.pth" | ||
| if preferred.is_file(): | ||
| return preferred | ||
| candidates = sorted(model_dir.glob("*.pth")) | ||
| return candidates[0] if candidates else None | ||
|
|
||
|
|
||
| def _resolve_voice(voice: str, model_dir: Path | None) -> str: | ||
| if not model_dir: | ||
| return voice | ||
| if voice.endswith(".pt"): | ||
| return voice | ||
| local_voice = model_dir / "voices" / f"{voice}.pt" | ||
| return str(local_voice) if local_voice.is_file() else voice | ||
|
|
||
|
|
||
| def _available_local_voices(model_dir: Path | None) -> set[str]: | ||
| if not model_dir: | ||
| return set() | ||
| voices_dir = model_dir / "voices" | ||
| if not voices_dir.is_dir(): | ||
| return set() | ||
| return {path.stem for path in voices_dir.glob("*.pt") if path.is_file()} | ||
|
|
||
|
|
||
| def synthesize(text_path: str, out_path: str, voice: str, speed: float) -> int: | ||
| try: | ||
| import torch | ||
| from kokoro import KModel, KPipeline | ||
| except ImportError: | ||
| print( | ||
| 'kokoro v1.1-zh deps missing. Run: uv pip install "kokoro>=0.8.2" "misaki[zh]>=0.8.2"', | ||
| file=sys.stderr, | ||
| ) | ||
| return 1 | ||
|
|
||
| text = Path(text_path).read_text(encoding="utf-8").strip() | ||
| if not text: | ||
| print("Empty text file", file=sys.stderr) | ||
| return 1 | ||
|
|
||
| model_dir = _resolve_local_model_dir() | ||
| repo_id = os.getenv("LOCAL_TTS_KOKORO_REPO_ID", DEFAULT_REPO_ID).strip() or DEFAULT_REPO_ID | ||
| voice = (voice or "").strip() or os.getenv("LOCAL_TTS_KOKORO_DEFAULT_VOICE", DEFAULT_VOICE) | ||
| available_voices = _available_local_voices(model_dir) | ||
| if available_voices and voice not in available_voices: | ||
| fallback_voice = os.getenv("LOCAL_TTS_KOKORO_DEFAULT_VOICE", DEFAULT_VOICE).strip() or DEFAULT_VOICE | ||
| if fallback_voice not in available_voices: | ||
| fallback_voice = sorted(available_voices)[0] | ||
| print( | ||
| f"Kokoro voice '{voice}' not found in local model dir; falling back to '{fallback_voice}'.", | ||
| file=sys.stderr, | ||
| ) | ||
| voice = fallback_voice | ||
| pipeline_voice = _resolve_voice(voice, model_dir) | ||
| lang = _infer_lang_code(voice) | ||
| device = os.getenv("LOCAL_TTS_KOKORO_DEVICE", "").strip() | ||
| if not device: | ||
| device = "cuda" if torch.cuda.is_available() else "cpu" | ||
|
|
||
| if model_dir: | ||
| config_path = model_dir / "config.json" | ||
| model_path = _find_model_file(model_dir) | ||
| if not config_path.is_file() or model_path is None: | ||
| print( | ||
| f"Invalid LOCAL_TTS_KOKORO_MODEL_DIR: {model_dir} " | ||
| "(expected config.json and a .pth model file)", | ||
| file=sys.stderr, | ||
| ) | ||
| return 1 | ||
| model = KModel(repo_id=repo_id, config=str(config_path), model=str(model_path)).to(device).eval() | ||
| else: | ||
| model = KModel(repo_id=repo_id).to(device).eval() | ||
|
|
||
| en_pipeline = None | ||
| en_callable = None | ||
| if lang == "z": | ||
| en_pipeline = KPipeline(lang_code="a", repo_id=repo_id, model=False) | ||
|
|
||
| def en_callable(text_part: str): | ||
| if text_part == "Kokoro": | ||
| return "kˈOkəɹO" | ||
| if text_part == "Sol": | ||
| return "sˈOl" | ||
| return next(en_pipeline(text_part)).phonemes | ||
|
|
||
| pipeline = KPipeline( | ||
| lang_code=lang, | ||
| repo_id=repo_id, | ||
| model=model, | ||
| en_callable=en_callable, | ||
| ) | ||
| effective_speed = _speed_callable(speed) if lang == "z" else speed | ||
| generator = pipeline(text, voice=pipeline_voice, speed=effective_speed) | ||
|
|
||
| chunks: list[np.ndarray] = [] | ||
| for result in generator: | ||
| audio = _audio_from_result(result) | ||
| if audio is not None: | ||
| chunks.append(np.asarray(audio, dtype=np.float32)) | ||
|
|
||
| if not chunks: | ||
| print("No audio generated", file=sys.stderr) | ||
| return 1 | ||
|
|
||
| pcm = np.concatenate(chunks) | ||
| pcm = np.clip(pcm, -1.0, 1.0) | ||
| pcm_int16 = (pcm * 32767.0).astype(np.int16) | ||
|
|
||
| with wave.open(out_path, "wb") as wf: | ||
| wf.setnchannels(1) | ||
| wf.setsampwidth(2) | ||
| wf.setframerate(SAMPLE_RATE) | ||
| wf.writeframes(pcm_int16.tobytes()) | ||
|
|
||
| print( | ||
| f"Wrote {out_path}: {len(pcm_int16)} samples @ {SAMPLE_RATE} Hz " | ||
| f"repo={repo_id} model_dir={model_dir or '<hf-cache>'} voice={voice} device={device}" | ||
| ) | ||
| return 0 | ||
|
|
||
|
|
||
| def main() -> int: | ||
| parser = argparse.ArgumentParser(description="Kokoro CLI wrapper for local_tts") | ||
| parser.add_argument("text_file") | ||
| parser.add_argument("out_file") | ||
| parser.add_argument("voice") | ||
| parser.add_argument("speed", type=float) | ||
| args = parser.parse_args() | ||
| return synthesize(args.text_file, args.out_file, args.voice, args.speed) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| raise SystemExit(main()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
|
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.