Skip to content

Speaker embeddings demo improvements#3987

Open
michalkulakowski wants to merge 6 commits intomainfrom
mkulakow/speaker_embeddings_demo_fix
Open

Speaker embeddings demo improvements#3987
michalkulakowski wants to merge 6 commits intomainfrom
mkulakow/speaker_embeddings_demo_fix

Conversation

@michalkulakowski
Copy link
Collaborator

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

  • Unit tests added.
  • The documentation updated.
  • Change follows security best practices.
    ``

@michalkulakowski michalkulakowski changed the title improvements Speaker embeddings demo improvements Feb 17, 2026
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use cpu only versions so it doesnt download cuda, check out other demos of ours, same issue appears in openwebui demo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from speechbrain.inference.speaker import EncoderClassifier
import sys

file = sys.argv[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check arg size


file = sys.argv[1]
signal, fs = torchaudio.load(file)
if signal.shape[0] > 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check shape size

embedding = torch.nn.functional.normalize(embedding, dim=2)
embedding = embedding.squeeze().cpu().numpy().astype("float32")

output_file = sys.argv[2]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check arg size

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recorded as well:
When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well:

pip install -r pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt
mkdir -p audio_samples
curl --output audio_samples/audio.wav "https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0032_8k.wav"
mkdir -p models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mkdir -p models

Copy link
Collaborator

@rasapala rasapala Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add specific versions used in demo to make sure we do not break when dependencies update components.
speechbrain=
openai=

add_common_arguments(parser_text2speech)
parser_text2speech.add_argument('--num_streams', default=0, type=int, help='The number of parallel execution streams to use for the models in the pipeline.', dest='num_streams')
parser_text2speech.add_argument('--vocoder', type=str, help='The vocoder model to use for text2speech. For example microsoft/speecht5_hifigan', dest='vocoder')
parser_text2speech.add_argument('--speaker_name', type=str, help='Name of the speaker', dest='speaker_name')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When making changes to export_model we should also incorporate them into ovms --pull and ovms parameters. If this applies here please file a jira to plan changes in ovms cpp.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CVS-181526

Copilot AI review requested due to automatic review settings February 19, 2026 13:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the audio demo’s speaker-voice workflow by adding a speaker embedding generator, documenting how to use it, and wiring optional voice configuration into the text-to-speech export graph template.

Changes:

  • Add create_speaker_embedding.py utility to generate a .bin speaker embedding from an audio file.
  • Extend export_model.py (text2speech) to optionally emit a voices section in graph.pbtxt based on --speaker_name/--speaker_path.
  • Add demos/audio/requirements.txt and update the audio demo README with speaker-embedding and request examples.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
demos/common/export_models/export_model.py Adds CLI args and conditional voices rendering to include speaker embeddings in the TTS graph config.
demos/audio/requirements.txt Introduces demo dependencies needed for embedding creation and OpenAI client usage.
demos/audio/create_speaker_embedding.py New script to produce normalized speaker embeddings from audio input.
demos/audio/README.md Documents how to create embeddings, export with speaker config, and call the endpoint with/without a voice.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

### Prepare speaker embeddings
When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well:
```bash
pip install -r pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The install command is malformed (pip install -r pip3 install -r ...) and will fail as written. Please replace it with a single valid pip command (e.g., pip3 install -r <url> or pip install -r <url>).

Suggested change
pip install -r pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +20
### Prepare speaker embeddings
When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well:
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description still contains the default template text (no linked issue/summary and checklist not filled out). Please update the PR description to reflect the actual changes and any validation performed so reviewers can assess intent and risk.

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +95
parser_text2speech.add_argument('--speaker_name', type=str, help='Name of the speaker', dest='speaker_name')
parser_text2speech.add_argument('--speaker_path', type=str, help='Path to the speaker.bin file.', dest='speaker_path')
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--speaker_name and --speaker_path are intended to be used together, but the CLI currently allows providing only one and silently ignores it (because the template checks speaker_name and speaker_path). Consider adding argument validation (e.g., raise an error if exactly one of the two is provided) so misconfiguration is caught early.

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +15
file = sys.argv[1]
signal, fs = torchaudio.load(file)
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file shadows Python’s built-in file name (from legacy Python 2) and is less descriptive. Rename to something like input_path or audio_path to avoid confusion.

Suggested change
file = sys.argv[1]
signal, fs = torchaudio.load(file)
input_audio_file = sys.argv[1]
signal, fs = torchaudio.load(input_audio_file)

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +32
embedding = classifier.encode_batch(signal)
embedding = torch.nn.functional.normalize(embedding, dim=2)
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaker embedding extraction runs with autograd enabled. Wrapping the encode/normalize steps in torch.no_grad() (and optionally setting the model to eval mode) will reduce memory usage and speed up the script for large inputs.

Suggested change
embedding = classifier.encode_batch(signal)
embedding = torch.nn.functional.normalize(embedding, dim=2)
classifier.eval()
with torch.no_grad():
embedding = classifier.encode_batch(signal)
embedding = torch.nn.functional.normalize(embedding, dim=2)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments