-
Notifications
You must be signed in to change notification settings - Fork 238
Speaker embeddings demo improvements #3987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 5 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -16,6 +16,17 @@ Check supported [Speech Recognition Models](https://openvinotoolkit.github.io/op | |||||
| **Client**: curl or Python for using OpenAI client package | ||||||
|
|
||||||
| ## Speech generation | ||||||
| ### Prepare speaker embeddings | ||||||
| When generating speech you can use default speaker voice or you can prepare your own speaker embedding file. Here you can see how to do it with downloaded file from online repository, but you can try with your own speech recorded as well: | ||||||
| ```bash | ||||||
| pip install -r pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt | ||||||
|
||||||
| pip install -r pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt | |
| pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/audio/requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| mkdir -p models |
dkalinowski marked this conversation as resolved.
Show resolved
Hide resolved
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,25 @@ | ||||||||||||||
| #!/usr/bin/env python3 | ||||||||||||||
| # Copyright (C) 2026 Intel Corporation | ||||||||||||||
| # SPDX-License-Identifier: Apache-2.0 | ||||||||||||||
|
|
||||||||||||||
| import torch | ||||||||||||||
| import torchaudio | ||||||||||||||
| from speechbrain.inference.speaker import EncoderClassifier | ||||||||||||||
| import sys | ||||||||||||||
|
|
||||||||||||||
| file = sys.argv[1] | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. check arg size |
||||||||||||||
| signal, fs = torchaudio.load(file) | ||||||||||||||
|
Comment on lines
+14
to
+15
|
||||||||||||||
| file = sys.argv[1] | |
| signal, fs = torchaudio.load(file) | |
| input_audio_file = sys.argv[1] | |
| signal, fs = torchaudio.load(input_audio_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check shape size
Copilot
AI
Feb 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaker embedding extraction runs with autograd enabled. Wrapping the encode/normalize steps in torch.no_grad() (and optionally setting the model to eval mode) will reduce memory usage and speed up the script for large inputs.
| embedding = classifier.encode_batch(signal) | |
| embedding = torch.nn.functional.normalize(embedding, dim=2) | |
| classifier.eval() | |
| with torch.no_grad(): | |
| embedding = classifier.encode_batch(signal) | |
| embedding = torch.nn.functional.normalize(embedding, dim=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check arg size
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| --extra-index-url "https://download.pytorch.org/whl/cpu" | ||
| torch==2.9.1+cpu | ||
| torchaudio==2.9.1+cpu | ||
| speechbrain | ||
|
||
| openai | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -91,6 +91,9 @@ def add_common_arguments(parser): | |
| add_common_arguments(parser_text2speech) | ||
| parser_text2speech.add_argument('--num_streams', default=0, type=int, help='The number of parallel execution streams to use for the models in the pipeline.', dest='num_streams') | ||
| parser_text2speech.add_argument('--vocoder', type=str, help='The vocoder model to use for text2speech. For example microsoft/speecht5_hifigan', dest='vocoder') | ||
| parser_text2speech.add_argument('--speaker_name', type=str, help='Name of the speaker', dest='speaker_name') | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When making changes to export_model we should also incorporate them into ovms --pull and ovms parameters. If this applies here please file a jira to plan changes in ovms cpp.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CVS-181526 |
||
| parser_text2speech.add_argument('--speaker_path', type=str, help='Path to the speaker.bin file.', dest='speaker_path') | ||
|
Comment on lines
+94
to
+95
|
||
|
|
||
|
|
||
| parser_speech2text = subparsers.add_parser('speech2text', help='export model for speech2text endpoint') | ||
| add_common_arguments(parser_speech2text) | ||
|
|
@@ -110,7 +113,14 @@ def add_common_arguments(parser): | |
| [type.googleapis.com / mediapipe.T2sCalculatorOptions]: { | ||
| models_path: "{{model_path}}", | ||
| plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }', | ||
| target_device: "{{target_device|default("CPU", true)}}" | ||
| target_device: "{{target_device|default("CPU", true)}}", | ||
| {%- if speaker_name and speaker_path %} | ||
| voices: [ | ||
| { | ||
| name: "{{speaker_name}}", | ||
| path: "{{speaker_path}}" | ||
| } | ||
| ]{% endif %} | ||
| } | ||
| } | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.