Add support for partial transcription prefix in the prompt by azziko · Pull Request #15449 · NVIDIA-NeMo/NeMo

azziko · 2026-02-27T10:33:41Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add support for partial transcription of the current audio input. This is especially useful in the streaming scenarios.

Collection: [ASR]

Changelog

Adds a user turn in the Canary2PromptFormatter.

Usage

Can be used as an input propmt to the top level .transcribe() function. The partially transcribed part is ommited in the hypothesis. Must be used as the last turn:

from nemo.collections.asr.models import ASRModel
from nemo.collections.asr.models.aed_multitask_models import MultiTaskTranscriptionConfig
from nemo.collections.asr.parts.submodules.multitask_decoding import MultiTaskDecodingConfig
from nemo.collections.asr.models.aed_multitask_models import parse_multitask_prompt

model = ASRModel.from_pretrained(model_name="nvidia/canary-1b-v2")
decoding_config = MultiTaskDecodingConfig()
model.change_decoding_strategy(decoding_config)

turns =  [
      {
          "role": "user",
          "slots": {
              "source_lang": "<|en|>",
              "target_lang": "<|en|>",
              "task": "<|transcribe|>",
              "pnc": "<|pnc|>",
          },
      },
      {
          "role": "user_prefix",
          "slots": {
              "prefix": "Partial transcription."
          },
      },
]

prompt = parse_multitask_prompt({"turns": turns})

config = MultiTaskTranscriptionConfig(
    batch_size=1, 
    return_hypotheses=True,
    num_workers=0,
    verbose=False,
    prompt=prompt,
    enable_chunking=False
)

output = model.transcribe("/path/to/your/audio", override_config=config)

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Resolves Manual decoding_input_ids/prefix for the decoder injection #15393

Signed-off-by: azziko <sharipov.wdev@gmail.com>

pzelasko · 2026-02-27T15:49:51Z

Thank you for a very clean usage example! Does this approach work well with the pretrained canary-v2, or did you train your own model with some modifications for streaming? If it's possible to share any numbers, I'd be curious to learn more.

Can you add the tests either to tests/collections/common/prompt_formatters/test_canary_prompt_formatter.py or create a new test_canary2_prompt_formatter.py?

Signed-off-by: azziko <sharipov.wdev@gmail.com>

Signed-off-by: azziko <azziko@users.noreply.github.com>

azziko · 2026-02-27T21:43:16Z

Thank you for a quick review!
I have added a set of separate unit tests for the Canary2PromptFormatter.

For my purposes and tests I have been using the pretrained canary-v2 model. My decoding parameters were as follows(let me know if you would like to know some specific numbers that I might have missed, I will happily share them too):

    strategy: beam
    compute_hypothesis_token_set: true
    preserve_alignments: null
    confidence_cfg:
      preserve_frame_confidence: false
      preserve_token_confidence: false
      preserve_word_confidence: false
      exclude_blank: true
      aggregation: min
      tdt_include_duration: false
      method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: DEPRECATED
    compute_langs: false
    greedy:
      temperature: null
      max_generation_delta: -1
      preserve_alignments: false
      preserve_token_confidence: false
      confidence_method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: DEPRECATED
      n_samples: 1
    beam:
      beam_size: 5
      search_type: default
      len_pen: 1.0
      max_generation_delta: -1
      return_best_hypothesis: true
      preserve_alignments: false
      ngram_lm_model: null
      ngram_lm_alpha: 0.0
      boosting_tree:
        model_path: null
        key_phrases_file: null
        key_phrases_list: null
        key_phrase_items_list: null
        context_score: 1.0
        depth_scaling: 1.0
        unk_score: 0.0
        final_eos_score: 1.0
        score_per_phrase: 0.0
        source_lang: en
        use_triton: true
        uniform_weights: false
        use_bpe_dropout: false
        num_of_transcriptions: 5
        bpe_alpha: 0.3
      boosting_tree_alpha: 0.0
    temperature: 1.0
    return_xattn_scores: true

pzelasko · 2026-02-27T21:55:51Z

Thanks. I was just wondering if you have any WER comparison to other approaches or models - I would have expected canary2 to degrade with this technique.

azziko · 2026-02-28T07:09:47Z

No, not yet. I will share them here soon when I do

azziko · 2026-03-19T15:42:09Z

Hi @pzelasko I ran some ASR tests on https://huggingface.co/datasets/ymoslem/acl-6060 dataset in the simultaneous mode with alignatt. Here is the AVG WER results I got testing on 4 longform audios from the dataset:

Condition                  avg WER
-----------------------------------
ch-0.1.fr-1                 0.3517
ch-0.1.fr-10                0.1003
ch-0.1.fr-5                 0.1914
ch-0.25.fr-1                0.2722
ch-0.25.fr-10               0.0917
ch-0.25.fr-5                0.1347
ch-0.5.fr-1                 0.2010
ch-0.5.fr-10                0.0922
ch-0.5.fr-5                 0.1154
ch-1.0.fr-1                 0.1379
ch-1.0.fr-10                0.0894
ch-1.0.fr-5                 0.1020

For Whisper it was:

Condition                  avg WER
-----------------------------------
ch-0.1.fr-1                 9.1769
ch-0.1.fr-10                1.4646
ch-0.1.fr-5                 4.6227
ch-0.25.fr-1                2.6347
ch-0.25.fr-10               1.0373
ch-0.25.fr-5                1.4472
ch-0.5.fr-1                 0.5155
ch-0.5.fr-10                0.2890
ch-0.5.fr-5                 0.3273
ch-1.0.fr-1                 0.1575
ch-1.0.fr-10                0.1492
ch-1.0.fr-5                 0.1524

ch is the chunk size in seconds and fr is the alignatt frame threshold. I need to admit, however, I have noticed late that these fr numbers are a bit skewed in a sense that Whisper's subsampling factor is 2 as opposed to Canary's 8, so the results might not necessarily be compared 1 to 1 here. Still I wanted to post those here

Add support for partial transcription prefix in the prompt

51f78e1

Signed-off-by: azziko <sharipov.wdev@gmail.com>

github-actions bot added the common label Feb 27, 2026

pzelasko self-requested a review February 27, 2026 15:50

pzelasko self-assigned this Feb 27, 2026

github-actions bot added the community-request label Feb 27, 2026

azziko and others added 2 commits February 27, 2026 21:32

Add unit tests for Canary2PromptFormatter

657c35d

Signed-off-by: azziko <sharipov.wdev@gmail.com>

Apply isort and black reformatting

9236465

Signed-off-by: azziko <azziko@users.noreply.github.com>

pzelasko approved these changes Feb 27, 2026

View reviewed changes

pzelasko added the Run CICD label Feb 27, 2026

pzelasko temporarily deployed to test February 27, 2026 21:55 — with GitHub Actions Inactive

chtruong814 added the needs-follow-up Issue needs follow-up label Mar 1, 2026

Merge branch 'main' into add-decoder-prefix

644b481

chtruong814 added Run CICD and removed Run CICD labels Mar 7, 2026

Merge branch 'NVIDIA-NeMo:main' into add-decoder-prefix

8b44fd9

chtruong814 added Run CICD and removed Run CICD labels Mar 12, 2026

This was referenced Mar 16, 2026

Canary-v2 streamatt speech processor hlt-mt/simulstream#28

Open

Question: Canary-v2 hallucination in streaming mode #15514

Open

chtruong814 temporarily deployed to test March 19, 2026 15:45 — with GitHub Actions Inactive

Merge branch 'main' into add-decoder-prefix

1c3aa8f

chtruong814 added Run CICD and removed Run CICD labels Mar 19, 2026

chtruong814 temporarily deployed to test March 20, 2026 15:37 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for partial transcription prefix in the prompt#15449

Add support for partial transcription prefix in the prompt#15449
azziko wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
azziko:add-decoder-prefix

azziko commented Feb 27, 2026

Uh oh!

pzelasko commented Feb 27, 2026

Uh oh!

azziko commented Feb 27, 2026

Uh oh!

pzelasko commented Feb 27, 2026

Uh oh!

azziko commented Feb 28, 2026 •

edited

Loading

Uh oh!

azziko commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

azziko commented Feb 27, 2026

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

pzelasko commented Feb 27, 2026

Uh oh!

azziko commented Feb 27, 2026

Uh oh!

pzelasko commented Feb 27, 2026

Uh oh!

azziko commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azziko commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

azziko commented Feb 28, 2026 •

edited

Loading