Skip to content

Add support for partial transcription prefix in the prompt#15449

Open
azziko wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
azziko:add-decoder-prefix
Open

Add support for partial transcription prefix in the prompt#15449
azziko wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
azziko:add-decoder-prefix

Conversation

@azziko
Copy link

@azziko azziko commented Feb 27, 2026

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add support for partial transcription of the current audio input. This is especially useful in the streaming scenarios.

Collection: [ASR]

Changelog

  • Adds a user turn in the Canary2PromptFormatter.

Usage

Can be used as an input propmt to the top level .transcribe() function. The partially transcribed part is ommited in the hypothesis. Must be used as the last turn:

from nemo.collections.asr.models import ASRModel
from nemo.collections.asr.models.aed_multitask_models import MultiTaskTranscriptionConfig
from nemo.collections.asr.parts.submodules.multitask_decoding import MultiTaskDecodingConfig
from nemo.collections.asr.models.aed_multitask_models import parse_multitask_prompt

model = ASRModel.from_pretrained(model_name="nvidia/canary-1b-v2")
decoding_config = MultiTaskDecodingConfig()
model.change_decoding_strategy(decoding_config)

turns =  [
      {
          "role": "user",
          "slots": {
              "source_lang": "<|en|>",
              "target_lang": "<|en|>",
              "task": "<|transcribe|>",
              "pnc": "<|pnc|>",
          },
      },
      {
          "role": "user_prefix",
          "slots": {
              "prefix": "Partial transcription."
          },
      },
]

prompt = parse_multitask_prompt({"turns": turns})

config = MultiTaskTranscriptionConfig(
    batch_size=1, 
    return_hypotheses=True,
    num_workers=0,
    verbose=False,
    prompt=prompt,
    enable_chunking=False
)

output = model.transcribe("/path/to/your/audio", override_config=config)

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Signed-off-by: azziko <sharipov.wdev@gmail.com>
@pzelasko
Copy link
Collaborator

Thank you for a very clean usage example! Does this approach work well with the pretrained canary-v2, or did you train your own model with some modifications for streaming? If it's possible to share any numbers, I'd be curious to learn more.

Can you add the tests either to tests/collections/common/prompt_formatters/test_canary_prompt_formatter.py or create a new test_canary2_prompt_formatter.py?

@pzelasko pzelasko self-requested a review February 27, 2026 15:50
@pzelasko pzelasko self-assigned this Feb 27, 2026
azziko and others added 2 commits February 27, 2026 21:32
Signed-off-by: azziko <sharipov.wdev@gmail.com>
Signed-off-by: azziko <azziko@users.noreply.github.com>
@azziko
Copy link
Author

azziko commented Feb 27, 2026

Thank you for a quick review!
I have added a set of separate unit tests for the Canary2PromptFormatter.

For my purposes and tests I have been using the pretrained canary-v2 model. My decoding parameters were as follows(let me know if you would like to know some specific numbers that I might have missed, I will happily share them too):

    strategy: beam
    compute_hypothesis_token_set: true
    preserve_alignments: null
    confidence_cfg:
      preserve_frame_confidence: false
      preserve_token_confidence: false
      preserve_word_confidence: false
      exclude_blank: true
      aggregation: min
      tdt_include_duration: false
      method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: DEPRECATED
    compute_langs: false
    greedy:
      temperature: null
      max_generation_delta: -1
      preserve_alignments: false
      preserve_token_confidence: false
      confidence_method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: DEPRECATED
      n_samples: 1
    beam:
      beam_size: 5
      search_type: default
      len_pen: 1.0
      max_generation_delta: -1
      return_best_hypothesis: true
      preserve_alignments: false
      ngram_lm_model: null
      ngram_lm_alpha: 0.0
      boosting_tree:
        model_path: null
        key_phrases_file: null
        key_phrases_list: null
        key_phrase_items_list: null
        context_score: 1.0
        depth_scaling: 1.0
        unk_score: 0.0
        final_eos_score: 1.0
        score_per_phrase: 0.0
        source_lang: en
        use_triton: true
        uniform_weights: false
        use_bpe_dropout: false
        num_of_transcriptions: 5
        bpe_alpha: 0.3
      boosting_tree_alpha: 0.0
    temperature: 1.0
    return_xattn_scores: true

@pzelasko
Copy link
Collaborator

Thanks. I was just wondering if you have any WER comparison to other approaches or models - I would have expected canary2 to degrade with this technique.

@azziko
Copy link
Author

azziko commented Feb 28, 2026

No, not yet. I will share them here soon when I do

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Mar 1, 2026
@azziko
Copy link
Author

azziko commented Mar 19, 2026

Hi @pzelasko I ran some ASR tests on https://huggingface.co/datasets/ymoslem/acl-6060 dataset in the simultaneous mode with alignatt. Here is the AVG WER results I got testing on 4 longform audios from the dataset:

Condition                  avg WER
-----------------------------------
ch-0.1.fr-1                 0.3517
ch-0.1.fr-10                0.1003
ch-0.1.fr-5                 0.1914
ch-0.25.fr-1                0.2722
ch-0.25.fr-10               0.0917
ch-0.25.fr-5                0.1347
ch-0.5.fr-1                 0.2010
ch-0.5.fr-10                0.0922
ch-0.5.fr-5                 0.1154
ch-1.0.fr-1                 0.1379
ch-1.0.fr-10                0.0894
ch-1.0.fr-5                 0.1020

For Whisper it was:

Condition                  avg WER
-----------------------------------
ch-0.1.fr-1                 9.1769
ch-0.1.fr-10                1.4646
ch-0.1.fr-5                 4.6227
ch-0.25.fr-1                2.6347
ch-0.25.fr-10               1.0373
ch-0.25.fr-5                1.4472
ch-0.5.fr-1                 0.5155
ch-0.5.fr-10                0.2890
ch-0.5.fr-5                 0.3273
ch-1.0.fr-1                 0.1575
ch-1.0.fr-10                0.1492
ch-1.0.fr-5                 0.1524

ch is the chunk size in seconds and fr is the alignatt frame threshold. I need to admit, however, I have noticed late that these fr numbers are a bit skewed in a sense that Whisper's subsampling factor is 2 as opposed to Canary's 8, so the results might not necessarily be compared 1 to 1 here. Still I wanted to post those here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Manual decoding_input_ids/prefix for the decoder injection

3 participants