Skip to content

Commit ea90796

Browse files
committed
Add speech processing use case
1 parent af0e449 commit ea90796

File tree

6 files changed

+266
-5
lines changed

6 files changed

+266
-5
lines changed

site/docs/use-cases/speech-processing.md

Lines changed: 0 additions & 5 deletions
This file was deleted.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
import CodeBlock from '@theme/CodeBlock';
2+
3+
<CodeBlock language="cpp" showLineNumbers>
4+
{`#include "openvino/genai/whisper_pipeline.hpp"
5+
#include "audio_utils.hpp"
6+
#include <iostream>
7+
8+
int main(int argc, char* argv[]) {
9+
std::filesystem::path models_path = argv[1];
10+
std::string wav_file_path = argv[2];
11+
12+
ov::genai::RawSpeechInput raw_speech = utils::audio::read_wav(wav_file_path);
13+
14+
ov::genai::WhisperPipeline pipe(models_path, "${props.device || 'CPU'}");
15+
auto result = pipe.generate(raw_speech, ov::genai::max_new_tokens(100));
16+
std::cout << result << std::endl;
17+
}
18+
`}
19+
</CodeBlock>
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
import CodeBlock from '@theme/CodeBlock';
2+
3+
<CodeBlock language="python" showLineNumbers>
4+
{`import openvino_genai as ov_genai
5+
import librosa
6+
7+
def read_wav(filepath):
8+
raw_speech, samplerate = librosa.load(filepath, sr=16000)
9+
return raw_speech.tolist()
10+
11+
raw_speech = read_wav('sample.wav')
12+
13+
pipe = ov_genai.WhisperPipeline(model_path, "${props.device || 'CPU'}")
14+
result = pipe.generate(raw_speech, max_new_tokens=100)
15+
print(result)
16+
`}
17+
</CodeBlock>
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
import CodeExampleCPP from './_code_example_cpp.mdx';
2+
import CodeExamplePython from './_code_example_python.mdx';
3+
4+
## Run Model Using OpenVINO GenAI
5+
6+
OpenVINO GenAI introduces the [`WhisperPipeline`](https://docs.openvino.ai/2025/api/genai_api/_autosummary/openvino_genai.WhisperPipeline.html) pipeline for inference of speech processing Whisper models.
7+
You can construct it straight away from the folder with the converted model.
8+
It will automatically load the model, tokenizer, detokenizer and default generation configuration.
9+
10+
:::info
11+
`WhisperPipeline` expects normalized audio files in WAV format at sampling rate of 16 kHz as input.
12+
:::
13+
14+
<LanguageTabs>
15+
<TabItemPython>
16+
<Tabs groupId="device">
17+
<TabItem label="CPU" value="cpu">
18+
<CodeExamplePython device="CPU" />
19+
</TabItem>
20+
<TabItem label="GPU" value="gpu">
21+
<CodeExamplePython device="GPU" />
22+
</TabItem>
23+
</Tabs>
24+
</TabItemPython>
25+
<TabItemCpp>
26+
<Tabs groupId="device">
27+
<TabItem label="CPU" value="cpu">
28+
<CodeExampleCPP device="CPU" />
29+
</TabItem>
30+
<TabItem label="GPU" value="gpu">
31+
<CodeExampleCPP device="GPU" />
32+
</TabItem>
33+
</Tabs>
34+
</TabItemCpp>
35+
</LanguageTabs>
36+
37+
:::tip
38+
39+
Use CPU or GPU as devices without any other code change.
40+
41+
:::
Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
import BasicGenerationConfiguration from '@site/docs/use-cases/_shared/_basic_generation_configuration.mdx';
2+
import GenerationConfigurationWorkflow from '@site/docs/use-cases/_shared/_generation_configuration_workflow.mdx';
3+
4+
## Additional Usage Options
5+
6+
:::tip
7+
Check out [Python](https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/whisper_speech_recognition) and [C++](https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/cpp/whisper_speech_recognition) Whisper speech recognition samples.
8+
:::
9+
10+
### Use Different Generation Parameters
11+
12+
<GenerationConfigurationWorkflow />
13+
14+
:::info
15+
For the full list of generation parameters, refer to the [Whisper Generation Config API](https://docs.openvino.ai/2025/api/genai_api/_autosummary/openvino_genai.WhisperGenerationConfig.html).
16+
:::
17+
18+
### Transcription
19+
20+
Whisper models can automatically detect the language of the input audio, or you can specify the language to improve accuracy:
21+
22+
<LanguageTabs>
23+
<TabItemPython>
24+
```python
25+
pipe = ov_genai.WhisperPipeline(model_path, "CPU")
26+
27+
# Automatic language detection
28+
raw_speech = read_wav("speech_sample.wav")
29+
result = pipe.generate(raw_speech)
30+
31+
# Explicitly specify language (English)
32+
result = pipe.generate(raw_speech, language="<|en|>")
33+
34+
# French speech sample
35+
raw_speech = read_wav("french_sample.wav")
36+
result = pipe.generate(raw_speech, language="<|fr|>")
37+
```
38+
</TabItemPython>
39+
<TabItemCpp>
40+
```cpp
41+
int main() {
42+
ov::genai::WhisperPipeline pipe(model_path, "CPU");
43+
44+
// Automatic language detection
45+
auto result = pipe.generate(raw_speech);
46+
47+
// Explicitly specify language (English)
48+
result = pipe.generate(raw_speech, ov::genai::language("<|en|>"));
49+
50+
// French speech sample
51+
raw_speech = utils::audio::read_wav("french_sample.wav");
52+
result = pipe.generate(raw_speech, ov::genai::language("<|fr|>"));
53+
}
54+
```
55+
</TabItemCpp>
56+
</LanguageTabs>
57+
58+
### Translation
59+
60+
By default, Whisper performs transcription, keeping the output in the same language as the input.
61+
To translate non-English speech to English, use the `translate` task:
62+
63+
<LanguageTabs>
64+
<TabItemPython>
65+
```python
66+
pipe = ov_genai.WhisperPipeline(model_path, "CPU")
67+
68+
# Translate French audio to English
69+
raw_speech = read_wav("french_sample.wav")
70+
result = pipe.generate(raw_speech, task="translate")
71+
```
72+
</TabItemPython>
73+
<TabItemCpp>
74+
```cpp
75+
int main() {
76+
ov::genai::WhisperPipeline pipe(model_path, "CPU");
77+
78+
// Translate French audio to English
79+
raw_speech = utils::audio::read_wav("french_sample.wav");
80+
result = pipe.generate(raw_speech, ov::genai::task("translate"));
81+
}
82+
```
83+
</TabItemCpp>
84+
</LanguageTabs>
85+
86+
### Timestamps Prediction
87+
88+
Whisper can predict timestamps for each segment of speech, which is useful for synchronization or creating subtitles:
89+
90+
<LanguageTabs>
91+
<TabItemPython>
92+
```python
93+
pipe = ov_genai.WhisperPipeline(model_path, "CPU")
94+
95+
# Enable timestamp prediction
96+
result = pipe.generate(raw_speech, return_timestamps=True)
97+
98+
# Print timestamps and text segments
99+
for chunk in result.chunks:
100+
print(f"timestamps: [{chunk.start_ts:.2f}, {chunk.end_ts:.2f}] text: {chunk.text}")
101+
```
102+
</TabItemPython>
103+
<TabItemCpp>
104+
```cpp
105+
int main() {
106+
ov::genai::WhisperPipeline pipe(model_path, "CPU");
107+
108+
// Enable timestamp prediction
109+
result = pipe.generate(raw_speech, ov::genai::return_timestamps(true));
110+
111+
// Print timestamps and text segments
112+
for (auto& chunk : *result.chunks) {
113+
std::cout << "timestamps: [" << chunk.start_ts << ", " << chunk.end_ts
114+
<< "] text: " << chunk.text << "\n";
115+
}
116+
}
117+
```
118+
</TabItemCpp>
119+
</LanguageTabs>
120+
121+
### Long-Form Audio Processing
122+
123+
Whisper models are designed for audio segments up to 30 seconds in length.
124+
For longer audio, the OpenVINO GenAI Whisper pipeline automatically handles the processing using a sequential chunking algorithm ("sliding window"):
125+
126+
1. The audio is divided into 30-second segments
127+
2. Each segment is processed sequentially
128+
3. Results are combined to produce the complete transcription
129+
130+
This happens automatically when you input longer audio files.
131+
132+
### Using Initial Prompts and Hotwords
133+
134+
You can improve transcription quality and guide the model's output style by providing initial prompts or hotwords using the following parameters:
135+
136+
- `initial_prompt`: initial prompt tokens passed as a previous transcription (after `<|startofprev|>` token) to the first processing window.
137+
- `hotwords`: hotwords tokens passed as a previous transcription (after `<|startofprev|>` token) to the all processing windows.
138+
139+
Whisper models can use that context to better understand the speech and maintain a consistent writing style.
140+
However, prompts do not need to be genuine transcripts from prior audio segments.
141+
Such prompts can be used to steer the model to use particular spellings or styles:
142+
143+
<LanguageTabs>
144+
<TabItemPython>
145+
```python
146+
pipe = ov_genai.WhisperPipeline(model_path, "CPU")
147+
148+
result = pipe.generate(raw_speech)
149+
# He has gone and gone for good answered Paul Icrom who...
150+
151+
result = pipe.generate(raw_speech, initial_prompt="Polychrome")
152+
# He has gone and gone for good answered Polychrome who...
153+
```
154+
</TabItemPython>
155+
<TabItemCpp>
156+
```cpp
157+
int main() {
158+
ov::genai::WhisperPipeline pipe(model_path, "CPU");
159+
160+
auto result = pipeline.generate(raw_speech);
161+
// He has gone and gone for good answered Paul Icrom who...
162+
163+
result = pipeline.generate(raw_speech, ov::genai::initial_prompt("Polychrome"));
164+
// He has gone and gone for good answered Polychrome who...
165+
}
166+
```
167+
</TabItemCpp>
168+
</LanguageTabs>
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
---
2+
sidebar_position: 3
3+
---
4+
import OptimumCLI from '@site/src/components/OptimumCLI';
5+
import ConvertModelSection from '../_shared/_convert_model.mdx';
6+
import RunModelSection from './_sections/_run_model/index.mdx';
7+
import UsageOptionsSection from './_sections/_usage_options/index.mdx';
8+
9+
# Speech Processing Using Whisper
10+
11+
<ConvertModelSection>
12+
Download and convert model (e.g. [openai/whisper-base](https://huggingface.co/openai/whisper-base)) to OpenVINO format from Hugging Face:
13+
14+
<OptimumCLI model='openai/whisper-base' outputDir='whisper_ov' trustRemoteCode />
15+
16+
See all supported [Speech Processing Models](/docs/supported-models/#speech-processing-models-whisper-based).
17+
</ConvertModelSection>
18+
19+
<RunModelSection />
20+
21+
<UsageOptionsSection />

0 commit comments

Comments
 (0)