Support Huggingface ASR model

ionic-bond · ionic-bond · commit 3e6f90443d0a · 2026-05-09T23:39:59.000+08:00
diff --git a/README.md b/README.md
@@ -132,6 +132,12 @@ The commands on Colab [![Open In Colab](https://colab.research.google.com/assets
 
     ```stream-translator-gpt {URL} --language {input_language} --use_openai_transcription_api --openai_api_key {your_openai_key}```
 
+- Transcribe by a **HuggingFace ASR** model (requires `pip install stream-translator-gpt[hf_asr]`):
+
+    ```stream-translator-gpt {URL} --model openai/whisper-large-v3-turbo --language {input_language} --use_hf_asr```
+
+    Only models with `pipeline_tag: automatic-speech-recognition` on Hugging Face Hub are supported.
+
 - Translate to other language by **Gemini**:
 
     ```stream-translator-gpt {URL} --model large --language ja --translation_prompt "Translate from Japanese to Chinese" --google_api_key {your_google_key}```
@@ -206,6 +212,7 @@ The commands on Colab [![Open In Colab](https://colab.research.google.com/assets
 | `--use_faster_whisper`                  |                                | Set this flag to use Faster-Whisper instead of Whisper. If used with --use_simul_streaming, SimulStreaming with Faster-Whisper as the encoder will be used.                                                        |
 | `--use_simul_streaming`                 |                                | Set this flag to use SimulStreaming instead of Whisper. If used with --use_faster_whisper, SimulStreaming with Faster-Whisper as the encoder will be used.                                                         |
 | `--use_openai_transcription_api`        |                                | Set this flag to use OpenAI transcription API instead of the original local Whipser.                                                                                                                               |
+| `--use_hf_asr`                          |                                | Set this flag to use a HuggingFace ASR model. Use `--model` to specify the model ID. Requires `pip install stream-translator-gpt[hf_asr]`.                                                                         |
 | `--transcription_filters`               | emoji_filter,repetition_filter | Filters apply to transcription results, separated by ",". We provide emoji_filter, repetition_filter and japanese_stream_filter.                                                                                   |
 | `--transcription_initial_prompt`        |                                | General purpose prompt/glossary for transcription. Format: "Word1, Word2, Word3, ...". This text is always included in the prompt passed to the model.                                                             |
 | `--disable_transcription_context`       |                                | Set this flag to disable context (previous sentence) propagation in transcription.                                                                                                                                 |
diff --git a/README_CN.md b/README_CN.md
@@ -132,6 +132,12 @@ Colab上的命令 [![Open In Colab](https://colab.research.google.com/assets/col
 
     ```stream-translator-gpt {网址} --language {输入语言} --use_openai_transcription_api --openai_api_key {您的 OpenAI 密钥}```
 
+- 使用 **HuggingFace ASR** 模型进行转录（需要先执行 `pip install stream-translator-gpt[hf_asr]`）：
+
+    ```stream-translator-gpt {网址} --model openai/whisper-large-v3-turbo --language {输入语言} --use_hf_asr```
+
+    仅支持在 Hugging Face Hub 上 `pipeline_tag` 为 `automatic-speech-recognition` 的模型。
+
 - 使用 **Gemini** 翻译成其他语言:
 
     ```stream-translator-gpt {网址} --model large --language ja --translation_prompt "翻译以下日语为中文，只输出译文，不要输出原文，在一行内输出" --google_api_key {您的 Google 密钥}```
@@ -206,6 +212,7 @@ Colab上的命令 [![Open In Colab](https://colab.research.google.com/assets/col
 | `--use_faster_whisper`                  |                                | 设置此标志以使用 Faster-Whisper 进行语音转文字，而不是原始的 OpenAI Whisper。如果与 --use_simul_streaming 一起使用，将使用以 Faster-Whisper 作为编码器的 SimulStreaming。 |
 | `--use_simul_streaming`                 |                                | 设置此标志以使用 SimulStreaming 进行语音转文字，而不是原始的 OpenAI Whisper。如果与 --use_faster_whisper 一起使用，将使用以 Faster-Whisper 作为编码器的 SimulStreaming。  |
 | `--use_openai_transcription_api`        |                                | 设置此标志以使用 OpenAI transcription API，而不是原始的本地 Whisper。                                                                                                     |
+| `--use_hf_asr`                          |                                | 设置此标志以使用 HuggingFace ASR 模型。通过 `--model` 指定模型 ID。需要先执行 `pip install stream-translator-gpt[hf_asr]`。                                                |
 | `--transcription_filters`               | emoji_filter,repetition_filter | 应用于语音转文字结果的过滤器，用 "," 分隔。我们提供 emoji_filter、repetition_filter 和 japanese_stream_filter。                                                           |
 | `--transcription_initial_prompt`        |                                | 通用的转录固定提示词/术语表。格式："提示词1, 提示词2, ..."。此文本将始终包含在传递给模型的提示词中。                                                                      |
 | `--disable_transcription_context`       |                                | 设置此标志以禁用转录中的上下文（上一句）传递。                                                                                                                            |
diff --git a/README_PyPI.md b/README_PyPI.md
@@ -72,6 +72,12 @@ The commands on Colab [![Open In Colab](https://colab.research.google.com/assets
 
     ```stream-translator-gpt {URL} --language {input_language} --use_openai_transcription_api --openai_api_key {your_openai_key}```
 
+- Transcribe by a **HuggingFace ASR** model (requires `pip install stream-translator-gpt[hf_asr]`):
+
+    ```stream-translator-gpt {URL} --model openai/whisper-large-v3-turbo --language {input_language} --use_hf_asr```
+
+    Only models with `pipeline_tag: automatic-speech-recognition` on Hugging Face Hub are supported.
+
 - Translate to other language by **Gemini**:
 
     ```stream-translator-gpt {URL} --model large --language ja --translation_prompt "Translate from Japanese to Chinese" --google_api_key {your_google_key}```
diff --git a/pyproject.toml b/pyproject.toml
@@ -53,6 +53,7 @@ dependencies = [
 
 [project.optional-dependencies]
 webui = ["gradio>=5.0,<6.0", "platformdirs>=4.0"]
+hf_asr = ["transformers>=4.40.0"]
 
 [project.scripts]
 stream-translator-gpt = "stream_translator_gpt.main:cli"
diff --git a/requirements_hf_asr.txt b/requirements_hf_asr.txt
@@ -0,0 +1,2 @@
+-r requirements.txt
+transformers>=4.40.0
diff --git a/stream_translator_gpt/audio_transcriber.py b/stream_translator_gpt/audio_transcriber.py
@@ -230,3 +230,41 @@ def transcribe(self, audio: np.array, initial_prompt: str = None) -> tuple[str,
         client = OpenAI(api_key=api_key, http_client=httpx.Client(proxy=self.proxy, verify=False))
         result = client.audio.transcriptions.create(**call_args).text
         return result, None
+
+
+class HFTranscriber(AudioTranscriber):
+
+    def __init__(self, model: str, language: str, proxy: str, **kwargs) -> None:
+        super().__init__(**kwargs)
+        from transformers import pipeline
+
+        if proxy:
+            _apply_hf_proxy(proxy)
+
+        if not os.path.exists(model):
+            try:
+                from huggingface_hub import model_info
+                info = model_info(model)
+                tag = info.pipeline_tag
+                if tag and tag != 'automatic-speech-recognition':
+                    raise ValueError(
+                        f'Model "{model}" has pipeline_tag="{tag}", not "automatic-speech-recognition". '
+                        f'It is not compatible with --use_hf_asr. '
+                        f'Please choose a model with pipeline_tag="automatic-speech-recognition" on HuggingFace Hub.'
+                    )
+            except ImportError:
+                pass
+
+        print(f'{INFO}Loading HuggingFace ASR model: {model}')
+        self.language = language
+        self.pipe = pipeline('automatic-speech-recognition', model=model, device_map='auto')
+
+    def transcribe(self, audio: np.array, initial_prompt: str = None) -> tuple[str, list | None]:
+        generate_kwargs = {}
+        if self.language:
+            generate_kwargs['language'] = self.language
+        result = self.pipe(
+            {'array': audio, 'sampling_rate': SAMPLE_RATE},
+            generate_kwargs=generate_kwargs or None,
+        )
+        return result['text'], None
diff --git a/stream_translator_gpt/main.py b/stream_translator_gpt/main.py
@@ -16,7 +16,7 @@
 from .common import ApiKeyPool, start_daemon_thread, is_url, WARNING, ERROR, INFO
 from .audio_getter import StreamAudioGetter, LocalFileAudioGetter, DeviceAudioGetter
 from .audio_slicer import AudioSlicer
-from .audio_transcriber import OpenaiWhisper, FasterWhisper, SimulStreaming, RemoteOpenaiTranscriber
+from .audio_transcriber import OpenaiWhisper, FasterWhisper, SimulStreaming, RemoteOpenaiTranscriber, HFTranscriber
 from .llm_translator import LLMClient, ParallelTranslator, SerialTranslator
 from .result_exporter import ResultExporter
 from . import __version__
@@ -25,7 +25,7 @@
 def main(url, openai_api_key, google_api_key, openai_base_url, google_base_url, proxy, format, cookies, input_proxy,
          device_index, device_recording_interval, mic, min_audio_length, max_audio_length, target_audio_length,
          continuous_no_speech_threshold, disable_dynamic_no_speech_threshold, prefix_retention_length, vad_threshold,
-         disable_dynamic_vad_threshold, model, language, use_faster_whisper, use_simul_streaming,
+         disable_dynamic_vad_threshold, model, language, use_faster_whisper, use_simul_streaming, use_hf_asr,
          use_openai_transcription_api, openai_transcription_model, transcription_filters, disable_transcription_context,
          transcription_initial_prompt, gpt_model, gemini_model, translation_prompt, translation_history_size,
          translation_timeout, use_json_result, retry_if_translation_fails, temperature, top_p, top_k, prompt_cache_key,
@@ -97,6 +97,8 @@ def init_transcriber():
                                                language=language,
                                                proxy=processing_proxy,
                                                **common_args)
+            elif use_hf_asr:
+                return HFTranscriber(model=model, language=language, proxy=processing_proxy, **common_args)
             else:
                 return OpenaiWhisper(model=model, language=language, **common_args)
 
@@ -334,6 +336,10 @@ def cli():
         type=str,
         default='gpt-4o-mini-transcribe',
         help='OpenAI\'s transcription model name, whisper-1 / gpt-4o-mini-transcribe / gpt-4o-transcribe')
+    parser.add_argument(
+        '--use_hf_asr',
+        action='store_true',
+        help='Set this flag to use a HuggingFace ASR model (via transformers pipeline) specified by --model.')
     parser.add_argument(
         '--transcription_filters',
         type=str,
@@ -541,11 +547,14 @@ def cli():
     if args['use_openai_transcription_api']:
         transcription_encoder_flag_num += 1
         transcription_decoder_flag_num += 1
+    if args['use_hf_asr']:
+        transcription_encoder_flag_num += 1
+        transcription_decoder_flag_num += 1
     if transcription_encoder_flag_num > 1:
-        print(f'{ERROR}Cannot use Faster Whisper or OpenAI Transcription API at the same time')
+        print(f'{ERROR}Cannot use Faster Whisper, OpenAI Transcription API or HuggingFace ASR at the same time')
         sys.exit(0)
     if transcription_decoder_flag_num > 1:
-        print(f'{ERROR}Cannot use Simul Streaming or OpenAI Transcription API at the same time')
+        print(f'{ERROR}Cannot use Simul Streaming, OpenAI Transcription API or HuggingFace ASR at the same time')
         sys.exit(0)
 
     if args['use_openai_transcription_api'] and not args['openai_api_key']:
diff --git a/webui/locales/en.json b/webui/locales/en.json
@@ -93,5 +93,7 @@
     "program_exited": "Program exited. You can close this tab now.",
     "delete_confirmation": "Are you sure you want to delete this preset?",
     "extra_cli_args": "Extra Arguments",
-    "extra_cli_args_ph": "CLI arguments not available in the UI. They will be appended to the command as-is."
+    "extra_cli_args_ph": "CLI arguments not available in the UI. They will be appended to the command as-is.",
+    "hf_model_name": "Model Name",
+    "hf_model_name_ph": "e.g. openai/whisper-large-v3-turbo"
 }
diff --git a/webui/locales/ja.json b/webui/locales/ja.json
@@ -93,5 +93,7 @@
     "program_exited": "プログラムは終了しました。このタブを閉じることができます。",
     "delete_confirmation": "このプリセットを削除してもよろしいですか？",
     "extra_cli_args": "追加引数",
-    "extra_cli_args_ph": "UI にない CLI 引数。コマンドにそのまま追加されます。"
+    "extra_cli_args_ph": "UI にない CLI 引数。コマンドにそのまま追加されます。",
+    "hf_model_name": "モデル名",
+    "hf_model_name_ph": "例：openai/whisper-large-v3-turbo"
 }
diff --git a/webui/locales/zh.json b/webui/locales/zh.json
@@ -93,5 +93,7 @@
     "program_exited": "程序已退出。您现在可以关闭此标签页。",
     "delete_confirmation": "确定要删除此预设吗？",
     "extra_cli_args": "额外参数",
-    "extra_cli_args_ph": "WebUI 中没有的 CLI 参数，将原样追加到命令中。"
+    "extra_cli_args_ph": "WebUI 中没有的 CLI 参数，将原样追加到命令中。",
+    "hf_model_name": "模型名称",
+    "hf_model_name_ph": "例：openai/whisper-large-v3-turbo"
 }
diff --git a/webui/webui.py b/webui/webui.py
@@ -81,7 +81,7 @@ def get(self, key):
 INPUT_KEYS = [
     "input_type", "input_url", "device_rec_interval", "audio_source", "input_file", "input_format", "input_cookies",
     "input_proxy", "openai_key", "google_key", "openai_base_url", "google_base_url", "overall_proxy", "model_size",
-    "language", "whisper_backend", "openai_transcription_model", "vad_threshold", "min_audio_len", "max_audio_len",
+    "hf_model_name", "language", "whisper_backend", "openai_transcription_model", "vad_threshold", "min_audio_len", "max_audio_len",
     "target_audio_len", "silence_threshold", "disable_dynamic_vad", "disable_dynamic_silence", "prefix_retention_len",
     "filter_emoji", "filter_repetition", "filter_japanese_stream", "disable_transcription_context",
     "transcription_initial_prompt", "translation_prompt", "translation_provider", "gpt_model", "gemini_model",
@@ -234,6 +234,7 @@ def build_translator_command(
         language,
         whisper_backend,
         openai_transcription_model,
+        hf_model_name,
         vad_threshold,
         min_audio_len,
         max_audio_len,
@@ -357,11 +358,16 @@ def add_arg(flag, value, default_key=None):
     elif whisper_backend == "Faster-Whisper & Simul-Streaming":
         cmd.append("--use_faster_whisper")
         cmd.append("--use_simul_streaming")
+    elif whisper_backend == "HuggingFace ASR":
+        cmd.append("--use_hf_asr")
     elif whisper_backend == "OpenAI Transcription API":
         cmd.append("--use_openai_transcription_api")
         add_arg("--openai_transcription_model", openai_transcription_model, "openai_transcription_model")
 
-    add_arg("--model", model_size, "model_size")
+    if whisper_backend == "HuggingFace ASR":
+        add_arg("--model", hf_model_name)
+    else:
+        add_arg("--model", model_size, "model_size")
     add_arg("--language", language, "language")
     if disable_transcription_context:
         cmd.append("--disable_transcription_context")
@@ -467,6 +473,7 @@ def run_translator(
         language,
         whisper_backend,
         openai_transcription_model,
+        hf_model_name,
         vad_threshold,
         min_audio_len,
         max_audio_len,
@@ -542,6 +549,7 @@ def run_translator(
                                           language=language,
                                           whisper_backend=whisper_backend,
                                           openai_transcription_model=openai_transcription_model,
+                                          hf_model_name=hf_model_name,
                                           vad_threshold=vad_threshold,
                                           min_audio_len=min_audio_len,
                                           max_audio_len=max_audio_len,
@@ -725,6 +733,7 @@ def run_list_formats(url, cookies, input_proxy):
             whisper_backend = gr.Radio(choices=[
                 ("Whisper", "Whisper"), ("Faster-Whisper", "Faster-Whisper"), ("Simul-Streaming", "Simul-Streaming"),
                 ("Faster-Whisper & Simul-Streaming", "Faster-Whisper & Simul-Streaming"),
+                ("HuggingFace ASR", "HuggingFace ASR"),
                 (i18n.get("openai_transcription_api_option"), "OpenAI Transcription API")
             ],
                                        label=i18n.get("transcription_type"),
@@ -749,6 +758,10 @@ def run_list_formats(url, cookies, input_proxy):
                                                          value=get_default("openai_transcription_model"),
                                                          visible=False,
                                                          allow_custom_value=True)
+                hf_model_name = gr.Textbox(label=i18n.get("hf_model_name"),
+                                           placeholder=i18n.get("hf_model_name_ph"),
+                                           visible=False,
+                                           value=get_default("hf_model_name"))
                 language = gr.Dropdown(
                     [
                         "auto", "af", "am", "ar", "as", "az", "ba", "be", "bg", "bn", "bo", "br", "bs", "ca", "cs",
@@ -914,14 +927,16 @@ def update_input_visibility(choice):
     # Whisper Backend Visibility
     def update_backend_visibility(choice):
         openai_visible = (choice == "OpenAI Transcription API")
+        hf_visible = (choice == "HuggingFace ASR")
         return {
             openai_transcription_model: gr.update(visible=openai_visible),
-            model_size: gr.update(visible=not openai_visible),
-            openai_transcription_group: gr.update(visible=openai_visible)
+            model_size: gr.update(visible=not openai_visible and not hf_visible),
+            openai_transcription_group: gr.update(visible=openai_visible),
+            hf_model_name: gr.update(visible=hf_visible),
         }
 
     whisper_backend.change(update_backend_visibility, whisper_backend,
-                           [openai_transcription_model, model_size, openai_transcription_group])
+                           [openai_transcription_model, model_size, openai_transcription_group, hf_model_name])
 
     # Translation Visibility
     def update_translation_visibility(choice):
@@ -977,8 +992,8 @@ def kill():
     start_btn.click(run_translator,
                     inputs=[
                         input_type, input_url, device_rec_interval, audio_source, input_file, input_format,
-                        input_cookies, input_proxy, openai_key, google_key, overall_proxy, model_size, language,
-                        whisper_backend, openai_transcription_model, vad_threshold, min_audio_len, max_audio_len,
+                        input_cookies, input_proxy, openai_key, google_key, overall_proxy, model_size,
+                        language, whisper_backend, openai_transcription_model, hf_model_name, vad_threshold, min_audio_len, max_audio_len,
                         target_audio_len, silence_threshold, disable_dynamic_vad, disable_dynamic_silence,
                         prefix_retention_len, filter_emoji, filter_repetition, filter_japanese_stream,
                         disable_transcription_context, transcription_initial_prompt, translation_prompt,

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+-r requirements.txt`
	`2`	`+transformers>=4.40.0`
Original file line number	Diff line number	Diff line change
`@@ -93,5 +93,7 @@`
`93`	`93`	`"program_exited": "Program exited. You can close this tab now.",`
`94`	`94`	`"delete_confirmation": "Are you sure you want to delete this preset?",`
`95`	`95`	`"extra_cli_args": "Extra Arguments",`
`96`		`- "extra_cli_args_ph": "CLI arguments not available in the UI. They will be appended to the command as-is."`
	`96`	`+ "extra_cli_args_ph": "CLI arguments not available in the UI. They will be appended to the command as-is.",`
	`97`	`+ "hf_model_name": "Model Name",`
	`98`	`+ "hf_model_name_ph": "e.g. openai/whisper-large-v3-turbo"`
`97`	`99`	`}`