Skip to content

[Feature Request] Could the tts generator support the output arg too? #225

@xfyuan

Description

@xfyuan

First thanks so much for your great work!

I found the stt command support output arg:

python -m mlx_audio.stt.generate --help
usage: generate.py [-h] --model MODEL --audio AUDIO --output OUTPUT [--format {txt,srt,vtt,json}] [--verbose]
                   [--max_tokens MAX_TOKENS]

Generate transcriptions from audio files

options:
  -h, --help            show this help message and exit
  --model MODEL         Path to the model
  --audio AUDIO         Path to the audio file
  --output OUTPUT       Path to save the output
  --format {txt,srt,vtt,json}
                        Output format (txt, srt, vtt, or json)
  --verbose             Verbose output
  --max_tokens MAX_TOKENS
                        Maximum number of new tokens to generate

But the tts command DOES NOT support:

python -m mlx_audio.tts.generate --help
usage: generate.py [-h] [--model MODEL] [--max_tokens MAX_TOKENS] [--text TEXT] [--voice VOICE] [--speed SPEED]
                   [--gender GENDER] [--pitch PITCH] [--lang_code LANG_CODE] [--file_prefix FILE_PREFIX] [--verbose]
                   [--join_audio] [--play] [--audio_format AUDIO_FORMAT] [--ref_audio REF_AUDIO] [--ref_text REF_TEXT]
                   [--stt_model STT_MODEL] [--temperature TEMPERATURE] [--top_p TOP_P] [--top_k TOP_K]
                   [--repetition_penalty REPETITION_PENALTY] [--stream] [--streaming_interval STREAMING_INTERVAL]

Generate audio from text using TTS.

options:
  -h, --help            show this help message and exit
  --model MODEL         Path or repo id of the model
  --max_tokens MAX_TOKENS
                        Maximum number of tokens to generate
  --text TEXT           Text to generate (leave blank to input via stdin)
  --voice VOICE         Voice name
  --speed SPEED         Speed of the audio
  --gender GENDER       Gender of the voice [male, female]
  --pitch PITCH         Pitch of the voice
  --lang_code LANG_CODE
                        Language code
  --file_prefix FILE_PREFIX
                        Output file name prefix
  --verbose             Print verbose output
  --join_audio          Join all audio files into one
  --play                Play the output audio
  --audio_format AUDIO_FORMAT
                        Output audio format
  --ref_audio REF_AUDIO
                        Path to reference audio
  --ref_text REF_TEXT   Caption for reference audio
  --stt_model STT_MODEL
                        STT model to use to transcribe reference audio
  --temperature TEMPERATURE
                        Temperature for the model
  --top_p TOP_P         Top-p for the model
  --top_k TOP_K         Top-k for the model
  --repetition_penalty REPETITION_PENALTY
                        Repetition penalty for the model
  --stream              Stream the audio as segments instead of saving to a file
  --streaming_interval STREAMING_INTERVAL
                        The time interval in seconds for streaming segments

Could tts still support the output args too? I think it's a so good job If it could.

Maybe there's some tech reason NOT to implement it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions