Conversation
demos/audio/requirements.txt
Outdated
There was a problem hiding this comment.
use cpu only versions so it doesnt download cuda, check out other demos of ours, same issue appears in openwebui demo
demos/audio/README.md
Outdated
There was a problem hiding this comment.
pip install -r pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt
| from speechbrain.inference.speaker import EncoderClassifier | ||
| import sys | ||
|
|
||
| file = sys.argv[1] |
|
|
||
| file = sys.argv[1] | ||
| signal, fs = torchaudio.load(file) | ||
| if signal.shape[0] > 1: |
| embedding = torch.nn.functional.normalize(embedding, dim=2) | ||
| embedding = embedding.squeeze().cpu().numpy().astype("float32") | ||
|
|
||
| output_file = sys.argv[2] |
demos/audio/README.md
Outdated
There was a problem hiding this comment.
| When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recorded as well: | |
| When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well: |
| pip install -r pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt | ||
| mkdir -p audio_samples | ||
| curl --output audio_samples/audio.wav "https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0032_8k.wav" | ||
| mkdir -p models |
There was a problem hiding this comment.
| mkdir -p models |
demos/audio/requirements.txt
Outdated
There was a problem hiding this comment.
Add specific versions used in demo to make sure we do not break when dependencies update components.
speechbrain=
openai=
| add_common_arguments(parser_text2speech) | ||
| parser_text2speech.add_argument('--num_streams', default=0, type=int, help='The number of parallel execution streams to use for the models in the pipeline.', dest='num_streams') | ||
| parser_text2speech.add_argument('--vocoder', type=str, help='The vocoder model to use for text2speech. For example microsoft/speecht5_hifigan', dest='vocoder') | ||
| parser_text2speech.add_argument('--speaker_name', type=str, help='Name of the speaker', dest='speaker_name') |
There was a problem hiding this comment.
When making changes to export_model we should also incorporate them into ovms --pull and ovms parameters. If this applies here please file a jira to plan changes in ovms cpp.
There was a problem hiding this comment.
CVS-181526
There was a problem hiding this comment.
Pull request overview
This PR improves the audio demo’s speaker-voice workflow by adding a speaker embedding generator, documenting how to use it, and wiring optional voice configuration into the text-to-speech export graph template.
Changes:
- Add
create_speaker_embedding.pyutility to generate a.binspeaker embedding from an audio file. - Extend
export_model.py(text2speech) to optionally emit avoicessection ingraph.pbtxtbased on--speaker_name/--speaker_path. - Add
demos/audio/requirements.txtand update the audio demo README with speaker-embedding and request examples.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| demos/common/export_models/export_model.py | Adds CLI args and conditional voices rendering to include speaker embeddings in the TTS graph config. |
| demos/audio/requirements.txt | Introduces demo dependencies needed for embedding creation and OpenAI client usage. |
| demos/audio/create_speaker_embedding.py | New script to produce normalized speaker embeddings from audio input. |
| demos/audio/README.md | Documents how to create embeddings, export with speaker config, and call the endpoint with/without a voice. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ### Prepare speaker embeddings | ||
| When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well: | ||
| ```bash | ||
| pip install -r pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt |
There was a problem hiding this comment.
The install command is malformed (pip install -r pip3 install -r ...) and will fail as written. Please replace it with a single valid pip command (e.g., pip3 install -r <url> or pip install -r <url>).
| pip install -r pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt | |
| pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt |
| ### Prepare speaker embeddings | ||
| When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well: |
There was a problem hiding this comment.
The PR description still contains the default template text (no linked issue/summary and checklist not filled out). Please update the PR description to reflect the actual changes and any validation performed so reviewers can assess intent and risk.
| parser_text2speech.add_argument('--speaker_name', type=str, help='Name of the speaker', dest='speaker_name') | ||
| parser_text2speech.add_argument('--speaker_path', type=str, help='Path to the speaker.bin file.', dest='speaker_path') |
There was a problem hiding this comment.
--speaker_name and --speaker_path are intended to be used together, but the CLI currently allows providing only one and silently ignores it (because the template checks speaker_name and speaker_path). Consider adding argument validation (e.g., raise an error if exactly one of the two is provided) so misconfiguration is caught early.
| file = sys.argv[1] | ||
| signal, fs = torchaudio.load(file) |
There was a problem hiding this comment.
file shadows Python’s built-in file name (from legacy Python 2) and is less descriptive. Rename to something like input_path or audio_path to avoid confusion.
| file = sys.argv[1] | |
| signal, fs = torchaudio.load(file) | |
| input_audio_file = sys.argv[1] | |
| signal, fs = torchaudio.load(input_audio_file) |
| embedding = classifier.encode_batch(signal) | ||
| embedding = torch.nn.functional.normalize(embedding, dim=2) |
There was a problem hiding this comment.
Speaker embedding extraction runs with autograd enabled. Wrapping the encode/normalize steps in torch.no_grad() (and optionally setting the model to eval mode) will reduce memory usage and speed up the script for large inputs.
| embedding = classifier.encode_batch(signal) | |
| embedding = torch.nn.functional.normalize(embedding, dim=2) | |
| classifier.eval() | |
| with torch.no_grad(): | |
| embedding = classifier.encode_batch(signal) | |
| embedding = torch.nn.functional.normalize(embedding, dim=2) |
🛠 Summary
JIRA/Issue if applicable.
Describe the changes.
🧪 Checklist
``