Speaker embeddings demo improvements by michalkulakowski · Pull Request #3987 · openvinotoolkit/model_server

michalkulakowski · 2026-02-17T11:44:10Z

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

Unit tests added.
The documentation updated.
Change follows security best practices.
``

dkalinowski · 2026-02-17T12:39:07Z

demos/audio/requirements.txt

use cpu only versions so it doesnt download cuda, check out other demos of ours, same issue appears in openwebui demo

demos/audio/create_speaker_embedding.py

demos/audio/README.md

porlows1 · 2026-02-17T13:12:06Z

demos/audio/README.md

pip install -r pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt

rasapala · 2026-02-19T11:09:27Z

demos/audio/create_speaker_embedding.py

+from speechbrain.inference.speaker import EncoderClassifier
+import sys
+
+file = sys.argv[1]


check arg size

rasapala · 2026-02-19T11:09:55Z

demos/audio/create_speaker_embedding.py

+
+file = sys.argv[1]
+signal, fs = torchaudio.load(file)
+if signal.shape[0] > 1:


check shape size

rasapala · 2026-02-19T11:11:18Z

demos/audio/create_speaker_embedding.py

+embedding = torch.nn.functional.normalize(embedding, dim=2)
+embedding = embedding.squeeze().cpu().numpy().astype("float32")
+
+output_file = sys.argv[2]


check arg size

rasapala · 2026-02-19T11:12:40Z

demos/audio/README.md

Suggested change

When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recorded as well:

When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well:

rasapala · 2026-02-19T11:13:09Z

demos/audio/README.md

+pip install -r pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt
+mkdir -p audio_samples
+curl --output audio_samples/audio.wav "https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0032_8k.wav"
+mkdir -p models


Suggested change

mkdir -p models

rasapala · 2026-02-19T11:15:51Z

demos/audio/requirements.txt

Add specific versions used in demo to make sure we do not break when dependencies update components.
speechbrain=
openai=

rasapala · 2026-02-19T11:18:21Z

demos/common/export_models/export_model.py

 add_common_arguments(parser_text2speech)
 parser_text2speech.add_argument('--num_streams', default=0, type=int, help='The number of parallel execution streams to use for the models in the pipeline.', dest='num_streams')
 parser_text2speech.add_argument('--vocoder', type=str, help='The vocoder model to use for text2speech. For example microsoft/speecht5_hifigan', dest='vocoder')
+parser_text2speech.add_argument('--speaker_name', type=str, help='Name of the speaker', dest='speaker_name')


When making changes to export_model we should also incorporate them into ovms --pull and ovms parameters. If this applies here please file a jira to plan changes in ovms cpp.

Copilot

Pull request overview

This PR improves the audio demo’s speaker-voice workflow by adding a speaker embedding generator, documenting how to use it, and wiring optional voice configuration into the text-to-speech export graph template.

Changes:

Add create_speaker_embedding.py utility to generate a .bin speaker embedding from an audio file.
Extend export_model.py (text2speech) to optionally emit a voices section in graph.pbtxt based on --speaker_name/--speaker_path.
Add demos/audio/requirements.txt and update the audio demo README with speaker-embedding and request examples.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
demos/common/export_models/export_model.py	Adds CLI args and conditional `voices` rendering to include speaker embeddings in the TTS graph config.
demos/audio/requirements.txt	Introduces demo dependencies needed for embedding creation and OpenAI client usage.
demos/audio/create_speaker_embedding.py	New script to produce normalized speaker embeddings from audio input.
demos/audio/README.md	Documents how to create embeddings, export with speaker config, and call the endpoint with/without a voice.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-19T13:40:03Z

demos/audio/README.md

+### Prepare speaker embeddings
+When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well:
+```bash
+pip install -r pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt


The install command is malformed (pip install -r pip3 install -r ...) and will fail as written. Please replace it with a single valid pip command (e.g., pip3 install -r <url> or pip install -r <url>).

Suggested change

pip install -r pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt

pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt

Copilot · 2026-02-19T13:40:03Z

demos/audio/README.md

+### Prepare speaker embeddings
+When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well:


The PR description still contains the default template text (no linked issue/summary and checklist not filled out). Please update the PR description to reflect the actual changes and any validation performed so reviewers can assess intent and risk.

Copilot · 2026-02-19T13:40:04Z

demos/common/export_models/export_model.py

+parser_text2speech.add_argument('--speaker_name', type=str, help='Name of the speaker', dest='speaker_name')
+parser_text2speech.add_argument('--speaker_path', type=str, help='Path to the speaker.bin file.', dest='speaker_path')


--speaker_name and --speaker_path are intended to be used together, but the CLI currently allows providing only one and silently ignores it (because the template checks speaker_name and speaker_path). Consider adding argument validation (e.g., raise an error if exactly one of the two is provided) so misconfiguration is caught early.

Copilot · 2026-02-19T13:40:04Z

demos/audio/create_speaker_embedding.py

+file = sys.argv[1]
+signal, fs = torchaudio.load(file)


file shadows Python’s built-in file name (from legacy Python 2) and is less descriptive. Rename to something like input_path or audio_path to avoid confusion.

Suggested change

file = sys.argv[1]

signal, fs = torchaudio.load(file)

input_audio_file = sys.argv[1]

signal, fs = torchaudio.load(input_audio_file)

Copilot · 2026-02-19T13:40:04Z

demos/audio/create_speaker_embedding.py

+embedding = classifier.encode_batch(signal)
+embedding = torch.nn.functional.normalize(embedding, dim=2)


Speaker embedding extraction runs with autograd enabled. Wrapping the encode/normalize steps in torch.no_grad() (and optionally setting the model to eval mode) will reduce memory usage and speed up the script for large inputs.

Suggested change

embedding = classifier.encode_batch(signal)

embedding = torch.nn.functional.normalize(embedding, dim=2)

classifier.eval()

with torch.no_grad():

embedding = classifier.encode_batch(signal)

embedding = torch.nn.functional.normalize(embedding, dim=2)

michalkulakowski added 3 commits February 17, 2026 12:43

improvements

a6b1600

fix

5baf47f

fix

6685112

michalkulakowski changed the title ~~improvements~~ Speaker embeddings demo improvements Feb 17, 2026

dkalinowski reviewed Feb 17, 2026

View reviewed changes

demos/audio/create_speaker_embedding.py Outdated Show resolved Hide resolved

dkalinowski reviewed Feb 17, 2026

View reviewed changes

demos/audio/README.md Show resolved Hide resolved

porlows1 reviewed Feb 17, 2026

View reviewed changes

fix

a0547a8

michalkulakowski requested a review from dkalinowski February 18, 2026 11:52

review

8253f1f

dkalinowski approved these changes Feb 18, 2026

View reviewed changes

michalkulakowski requested review from porlows1 and rasapala February 18, 2026 13:15

rasapala reviewed Feb 19, 2026

View reviewed changes

rasapala requested changes Feb 19, 2026

View reviewed changes

fix

5df4fe7

Copilot AI review requested due to automatic review settings February 19, 2026 13:34

Copilot started reviewing on behalf of michalkulakowski February 19, 2026 13:35 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

	When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recorded as well:
	When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well:

	pip install -r pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt
	pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt

		### Prepare speaker embeddings
		When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recording as well:

		parser_text2speech.add_argument('--speaker_name', type=str, help='Name of the speaker', dest='speaker_name')
		parser_text2speech.add_argument('--speaker_path', type=str, help='Path to the speaker.bin file.', dest='speaker_path')

		embedding = classifier.encode_batch(signal)
		embedding = torch.nn.functional.normalize(embedding, dim=2)

Conversation

michalkulakowski commented Feb 17, 2026

🛠 Summary

🧪 Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rasapala Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

rasapala Feb 19, 2026 •

edited

Loading