Skip to content

Commit 3e6f904

Browse files
committed
Support Huggingface ASR model
1 parent 7e5dc48 commit 3e6f904

11 files changed

Lines changed: 105 additions & 14 deletions

File tree

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,12 @@ The commands on Colab [![Open In Colab](https://colab.research.google.com/assets
132132

133133
```stream-translator-gpt {URL} --language {input_language} --use_openai_transcription_api --openai_api_key {your_openai_key}```
134134

135+
- Transcribe by a **HuggingFace ASR** model (requires `pip install stream-translator-gpt[hf_asr]`):
136+
137+
```stream-translator-gpt {URL} --model openai/whisper-large-v3-turbo --language {input_language} --use_hf_asr```
138+
139+
Only models with `pipeline_tag: automatic-speech-recognition` on Hugging Face Hub are supported.
140+
135141
- Translate to other language by **Gemini**:
136142

137143
```stream-translator-gpt {URL} --model large --language ja --translation_prompt "Translate from Japanese to Chinese" --google_api_key {your_google_key}```
@@ -206,6 +212,7 @@ The commands on Colab [![Open In Colab](https://colab.research.google.com/assets
206212
| `--use_faster_whisper` | | Set this flag to use Faster-Whisper instead of Whisper. If used with --use_simul_streaming, SimulStreaming with Faster-Whisper as the encoder will be used. |
207213
| `--use_simul_streaming` | | Set this flag to use SimulStreaming instead of Whisper. If used with --use_faster_whisper, SimulStreaming with Faster-Whisper as the encoder will be used. |
208214
| `--use_openai_transcription_api` | | Set this flag to use OpenAI transcription API instead of the original local Whipser. |
215+
| `--use_hf_asr` | | Set this flag to use a HuggingFace ASR model. Use `--model` to specify the model ID. Requires `pip install stream-translator-gpt[hf_asr]`. |
209216
| `--transcription_filters` | emoji_filter,repetition_filter | Filters apply to transcription results, separated by ",". We provide emoji_filter, repetition_filter and japanese_stream_filter. |
210217
| `--transcription_initial_prompt` | | General purpose prompt/glossary for transcription. Format: "Word1, Word2, Word3, ...". This text is always included in the prompt passed to the model. |
211218
| `--disable_transcription_context` | | Set this flag to disable context (previous sentence) propagation in transcription. |

README_CN.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,12 @@ Colab上的命令 [![Open In Colab](https://colab.research.google.com/assets/col
132132
133133
```stream-translator-gpt {网址} --language {输入语言} --use_openai_transcription_api --openai_api_key {您的 OpenAI 密钥}```
134134
135+
- 使用 **HuggingFace ASR** 模型进行转录(需要先执行 `pip install stream-translator-gpt[hf_asr]`):
136+
137+
```stream-translator-gpt {网址} --model openai/whisper-large-v3-turbo --language {输入语言} --use_hf_asr```
138+
139+
仅支持在 Hugging Face Hub 上 `pipeline_tag` 为 `automatic-speech-recognition` 的模型。
140+
135141
- 使用 **Gemini** 翻译成其他语言:
136142
137143
```stream-translator-gpt {网址} --model large --language ja --translation_prompt "翻译以下日语为中文,只输出译文,不要输出原文,在一行内输出" --google_api_key {您的 Google 密钥}```
@@ -206,6 +212,7 @@ Colab上的命令 [![Open In Colab](https://colab.research.google.com/assets/col
206212
| `--use_faster_whisper` | | 设置此标志以使用 Faster-Whisper 进行语音转文字,而不是原始的 OpenAI Whisper。如果与 --use_simul_streaming 一起使用,将使用以 Faster-Whisper 作为编码器的 SimulStreaming。 |
207213
| `--use_simul_streaming` | | 设置此标志以使用 SimulStreaming 进行语音转文字,而不是原始的 OpenAI Whisper。如果与 --use_faster_whisper 一起使用,将使用以 Faster-Whisper 作为编码器的 SimulStreaming。 |
208214
| `--use_openai_transcription_api` | | 设置此标志以使用 OpenAI transcription API,而不是原始的本地 Whisper。 |
215+
| `--use_hf_asr` | | 设置此标志以使用 HuggingFace ASR 模型。通过 `--model` 指定模型 ID。需要先执行 `pip install stream-translator-gpt[hf_asr]`。 |
209216
| `--transcription_filters` | emoji_filter,repetition_filter | 应用于语音转文字结果的过滤器,用 "," 分隔。我们提供 emoji_filter、repetition_filter 和 japanese_stream_filter。 |
210217
| `--transcription_initial_prompt` | | 通用的转录固定提示词/术语表。格式:"提示词1, 提示词2, ..."。此文本将始终包含在传递给模型的提示词中。 |
211218
| `--disable_transcription_context` | | 设置此标志以禁用转录中的上下文(上一句)传递。 |

README_PyPI.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,12 @@ The commands on Colab [![Open In Colab](https://colab.research.google.com/assets
7272

7373
```stream-translator-gpt {URL} --language {input_language} --use_openai_transcription_api --openai_api_key {your_openai_key}```
7474

75+
- Transcribe by a **HuggingFace ASR** model (requires `pip install stream-translator-gpt[hf_asr]`):
76+
77+
```stream-translator-gpt {URL} --model openai/whisper-large-v3-turbo --language {input_language} --use_hf_asr```
78+
79+
Only models with `pipeline_tag: automatic-speech-recognition` on Hugging Face Hub are supported.
80+
7581
- Translate to other language by **Gemini**:
7682

7783
```stream-translator-gpt {URL} --model large --language ja --translation_prompt "Translate from Japanese to Chinese" --google_api_key {your_google_key}```

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ dependencies = [
5353

5454
[project.optional-dependencies]
5555
webui = ["gradio>=5.0,<6.0", "platformdirs>=4.0"]
56+
hf_asr = ["transformers>=4.40.0"]
5657

5758
[project.scripts]
5859
stream-translator-gpt = "stream_translator_gpt.main:cli"

requirements_hf_asr.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
-r requirements.txt
2+
transformers>=4.40.0

stream_translator_gpt/audio_transcriber.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -230,3 +230,41 @@ def transcribe(self, audio: np.array, initial_prompt: str = None) -> tuple[str,
230230
client = OpenAI(api_key=api_key, http_client=httpx.Client(proxy=self.proxy, verify=False))
231231
result = client.audio.transcriptions.create(**call_args).text
232232
return result, None
233+
234+
235+
class HFTranscriber(AudioTranscriber):
236+
237+
def __init__(self, model: str, language: str, proxy: str, **kwargs) -> None:
238+
super().__init__(**kwargs)
239+
from transformers import pipeline
240+
241+
if proxy:
242+
_apply_hf_proxy(proxy)
243+
244+
if not os.path.exists(model):
245+
try:
246+
from huggingface_hub import model_info
247+
info = model_info(model)
248+
tag = info.pipeline_tag
249+
if tag and tag != 'automatic-speech-recognition':
250+
raise ValueError(
251+
f'Model "{model}" has pipeline_tag="{tag}", not "automatic-speech-recognition". '
252+
f'It is not compatible with --use_hf_asr. '
253+
f'Please choose a model with pipeline_tag="automatic-speech-recognition" on HuggingFace Hub.'
254+
)
255+
except ImportError:
256+
pass
257+
258+
print(f'{INFO}Loading HuggingFace ASR model: {model}')
259+
self.language = language
260+
self.pipe = pipeline('automatic-speech-recognition', model=model, device_map='auto')
261+
262+
def transcribe(self, audio: np.array, initial_prompt: str = None) -> tuple[str, list | None]:
263+
generate_kwargs = {}
264+
if self.language:
265+
generate_kwargs['language'] = self.language
266+
result = self.pipe(
267+
{'array': audio, 'sampling_rate': SAMPLE_RATE},
268+
generate_kwargs=generate_kwargs or None,
269+
)
270+
return result['text'], None

stream_translator_gpt/main.py

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
from .common import ApiKeyPool, start_daemon_thread, is_url, WARNING, ERROR, INFO
1717
from .audio_getter import StreamAudioGetter, LocalFileAudioGetter, DeviceAudioGetter
1818
from .audio_slicer import AudioSlicer
19-
from .audio_transcriber import OpenaiWhisper, FasterWhisper, SimulStreaming, RemoteOpenaiTranscriber
19+
from .audio_transcriber import OpenaiWhisper, FasterWhisper, SimulStreaming, RemoteOpenaiTranscriber, HFTranscriber
2020
from .llm_translator import LLMClient, ParallelTranslator, SerialTranslator
2121
from .result_exporter import ResultExporter
2222
from . import __version__
@@ -25,7 +25,7 @@
2525
def main(url, openai_api_key, google_api_key, openai_base_url, google_base_url, proxy, format, cookies, input_proxy,
2626
device_index, device_recording_interval, mic, min_audio_length, max_audio_length, target_audio_length,
2727
continuous_no_speech_threshold, disable_dynamic_no_speech_threshold, prefix_retention_length, vad_threshold,
28-
disable_dynamic_vad_threshold, model, language, use_faster_whisper, use_simul_streaming,
28+
disable_dynamic_vad_threshold, model, language, use_faster_whisper, use_simul_streaming, use_hf_asr,
2929
use_openai_transcription_api, openai_transcription_model, transcription_filters, disable_transcription_context,
3030
transcription_initial_prompt, gpt_model, gemini_model, translation_prompt, translation_history_size,
3131
translation_timeout, use_json_result, retry_if_translation_fails, temperature, top_p, top_k, prompt_cache_key,
@@ -97,6 +97,8 @@ def init_transcriber():
9797
language=language,
9898
proxy=processing_proxy,
9999
**common_args)
100+
elif use_hf_asr:
101+
return HFTranscriber(model=model, language=language, proxy=processing_proxy, **common_args)
100102
else:
101103
return OpenaiWhisper(model=model, language=language, **common_args)
102104

@@ -334,6 +336,10 @@ def cli():
334336
type=str,
335337
default='gpt-4o-mini-transcribe',
336338
help='OpenAI\'s transcription model name, whisper-1 / gpt-4o-mini-transcribe / gpt-4o-transcribe')
339+
parser.add_argument(
340+
'--use_hf_asr',
341+
action='store_true',
342+
help='Set this flag to use a HuggingFace ASR model (via transformers pipeline) specified by --model.')
337343
parser.add_argument(
338344
'--transcription_filters',
339345
type=str,
@@ -541,11 +547,14 @@ def cli():
541547
if args['use_openai_transcription_api']:
542548
transcription_encoder_flag_num += 1
543549
transcription_decoder_flag_num += 1
550+
if args['use_hf_asr']:
551+
transcription_encoder_flag_num += 1
552+
transcription_decoder_flag_num += 1
544553
if transcription_encoder_flag_num > 1:
545-
print(f'{ERROR}Cannot use Faster Whisper or OpenAI Transcription API at the same time')
554+
print(f'{ERROR}Cannot use Faster Whisper, OpenAI Transcription API or HuggingFace ASR at the same time')
546555
sys.exit(0)
547556
if transcription_decoder_flag_num > 1:
548-
print(f'{ERROR}Cannot use Simul Streaming or OpenAI Transcription API at the same time')
557+
print(f'{ERROR}Cannot use Simul Streaming, OpenAI Transcription API or HuggingFace ASR at the same time')
549558
sys.exit(0)
550559

551560
if args['use_openai_transcription_api'] and not args['openai_api_key']:

webui/locales/en.json

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,5 +93,7 @@
9393
"program_exited": "Program exited. You can close this tab now.",
9494
"delete_confirmation": "Are you sure you want to delete this preset?",
9595
"extra_cli_args": "Extra Arguments",
96-
"extra_cli_args_ph": "CLI arguments not available in the UI. They will be appended to the command as-is."
96+
"extra_cli_args_ph": "CLI arguments not available in the UI. They will be appended to the command as-is.",
97+
"hf_model_name": "Model Name",
98+
"hf_model_name_ph": "e.g. openai/whisper-large-v3-turbo"
9799
}

webui/locales/ja.json

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,5 +93,7 @@
9393
"program_exited": "プログラムは終了しました。このタブを閉じることができます。",
9494
"delete_confirmation": "このプリセットを削除してもよろしいですか?",
9595
"extra_cli_args": "追加引数",
96-
"extra_cli_args_ph": "UI にない CLI 引数。コマンドにそのまま追加されます。"
96+
"extra_cli_args_ph": "UI にない CLI 引数。コマンドにそのまま追加されます。",
97+
"hf_model_name": "モデル名",
98+
"hf_model_name_ph": "例:openai/whisper-large-v3-turbo"
9799
}

webui/locales/zh.json

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,5 +93,7 @@
9393
"program_exited": "程序已退出。您现在可以关闭此标签页。",
9494
"delete_confirmation": "确定要删除此预设吗?",
9595
"extra_cli_args": "额外参数",
96-
"extra_cli_args_ph": "WebUI 中没有的 CLI 参数,将原样追加到命令中。"
96+
"extra_cli_args_ph": "WebUI 中没有的 CLI 参数,将原样追加到命令中。",
97+
"hf_model_name": "模型名称",
98+
"hf_model_name_ph": "例:openai/whisper-large-v3-turbo"
9799
}

0 commit comments

Comments
 (0)