Skip to content

Feature request: silent mode / no-TTS option (bring-your-own voiceover) #74

@iavankda

Description

@iavankda

Summary

Add a "silent mode" option to POST /api/short-video that produces the video without TTS narration, returning visuals + captions + optional background music only. Users who want spoken audio record their own voiceover externally and mux it in post.

Proposed API shape (either works):

{
  "config": { "voice": "none" }
}

or

{
  "config": { "tts": false }
}

When set, the render skips the Kokoro TTS step entirely; timing comes from durationInSeconds per scene (or the sum of paddingBack values).

Motivation

Non-English content is blocked today

Kokoro only ships English voices (28 in af_*, am_*, bf_*, bm_*). Submitting non-English text either produces phonetically-mangled pronunciation or crashes the render process — I hit a pod restart today submitting Brazilian Portuguese text, losing all in-memory video state for that session. The current workarounds are unpleasant:

  1. Use English TTS and live with robotic/mismatched narration for non-EN audiences (not viable for brand work).
  2. Render with placeholder English text, then strip audio with ffmpeg, then mux in a human voiceover — three manual steps per video.

A silent mode would collapse that to one render + one mux (or zero mux if music-only is fine).

Beyond non-English users

Silent mode is also valuable for creators who want:

  • Human voice for trust — founder narration, customer testimonial audio, recorded interview clips. TTS can't substitute for a known voice.
  • Multi-language distribution — render visuals once, overlay different VO tracks per locale.
  • Higher production quality on a budget — self-recorded VO on a decent mic beats Kokoro for brand content, at zero API cost.
  • Integration with other TTS providers — users who already pay for ElevenLabs/Azure/Google TTS in another pipeline can feed output into the mux step.

Proposed behavior

  • If config.voice === "none" (or config.tts === false) is set, skip TTS entirely.
  • Captions are still rendered on-screen if config.captionPosition or config.captionStyle is set — they're visual elements, not audio-derived.
  • Scene timing:
    • If durationInSeconds is set (global or per-scene), honor it.
    • Otherwise, use the sum of paddingBack values, or a sensible default (e.g., 3s per scene).
  • Output: MP4 with video track + optional music track + no voice track.
  • GET /api/voices should include "none" in the list (or null) for clients enumerating options.

Related

Context

Using the service via https://remotion.abckx.com.br in production for a small Brazilian-Portuguese content holding. Happy to test a PR or provide real-world payload examples if that helps. Thanks for the work on the project — the REST API is really clean once you get past the language limit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions