Skip to content

Conversation

@urroxyz
Copy link

@urroxyz urroxyz commented Nov 13, 2025

This PR introduces two requested features to the ASRInferencePipeline: accurate timestamp extraction (word/character level) and automatic audio chunking for long-form files. Closes #3.

Summary of changes

  • transcribe() now returns a tuple (transcripts, timestamps).
    • Uses frame boundary analysis from the emission logits for CTC models.
    • Uses a novel Dynamic Time Warping (DTW) approach on the decoder's Cross-Attention maps (Text Query $\to$ Audio Key) to enforce monotonic alignment and prevent attention collapse for LLM-based models.
    • Includes language-agnostic logic to automatically detect non-spaced scripts (CJK, Thai, Lao, Khmer, Myanmar) and switch from word-level to character-level timestamps.
  • Input audio exceeding the model's limit of 40s is now automatically chunked. The chunk_len parameter allows users to define this window (default None preserves strict behavior, setting it enables chunking).
  • All implemented using pure torch and numpy without adding heavy dependencies like scipy or librosa.

Why?

This addresses an important utility gap compared to similar libraries like OpenAI's Whisper. Community demand for timestamp alignment and long-form transcription has been high.

Relevant @mentions: @ijean, @artemru, @d-cota, @huanglizhuo, @wikioai, @nikopartanen, @Aunali321, @marcelgoya, and some others.

How?

  • Created a new module to handle alignment strategies.
    • Maps greedy decoding paths to time frames and interpolates text units between non-blank token transitions for CTC.
    • Performs a secondary "teacher-forced" forward pass with hooks on StandardMultiheadAttention for LLMs. It captures QK similarity matrices, aggregates them across layers, and applies a custom NumPy-based DTW algorithm to find the optimal alignment path.
  • Updated transcribe to handle waveform loading and splitting based on chunk_len.
    • It processes chunks sequentially, aligns them individually, and recalculates global time offsets before returning the merged result.

Test plan

Verified on both CTC and LLM architectures using English and Japanese.

1. CTC model for word-level alignment

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Load CTC model
pipeline = ASRInferencePipeline(model_card="omniASR_CTC_300M")

audio_files = ["IMDA_conversation.wav"] # English audio example
transcriptions, timestamps = pipeline.transcribe(audio_files, chunk_len=40, batch_size=2) # Request chunking at 40s intervals

print(transcriptions)
print(timestamps)

Output:

['and then we will need some documents from you...']
[[{'word': 'and', 'start': 8.21, 'end': 8.44}, {'word': 'then', 'start': 8.42, 'end': 8.77}, ...(TRUNCATED)...]]

2. LLM model and script detection

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Load LLM model
pipeline = ASRInferencePipeline(model_card="omniASR_LLM_300M")

audio_files = ["ado.wav"] # Japanese audio example
transcriptions, timestamps = pipeline.transcribe(audio_files, chunk_len=40, batch_size=2)

print(transcriptions)
print(timestamps)

Output:

['まずどんな勉強してたか...']
[[{'char': 'ま', 'start': 0.0, 'end': 0.32}, {'char': 'ず', 'start': 0.32, 'end': 0.42}, {'char': 'ど', 'start': 0.42, 'end': 0.54}, ...(TRUNCATED)...]] # Accurate character-level timestamps for CJK scripts via LLM attention

- Timestamps are extracted by default and can be collected as a second variable upon `transcribe()` call.
- Input audios are automagically chunked into groups of 40s. User may adjust with `chunk_len` parameter.

## Why?
Other similar libraries, such as OpenAI's Whisper, are modified to perform the same operations due to community demand.

## How?
Per #3, the CTC models are already capable of mapping their outputs to time. Chunking is simple, and automating it takes the burden off the user. Most audio inputs will exceed 40 seconds.

## Test plan
Run:
```py
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(model_card="omniASR_CTC_300M")

audio_files = ["ado.wav"]
transcriptions, timestamps = pipeline.transcribe(audio_files, chunk_len=40, batch_size=2)

print(transcriptions)
print(timestamps)
```

Output:
```
parameter load: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   423/423 100% 0:00:00
['まずどんな勉強してたか ...(TRUNCATED)...'] # Transcript is untouched.
[[{'char': 'ま', 'start': 0.0, 'end': 0.32}, {'char': 'ず', 'start': 0.32, 'end': 0.42}, {'char': 'ど', 'start': 0.42, 'end': 0.54}, {'char': 'ん', 'start': 0.54, 'end': 0.64}, {'char': 'な', 'start': 0.64, 'end': 0.68}, {'char': '勉', 'start': 0.68, 'end': 0.82}, {'char': '強', 'start': 0.82, 'end': 1.12}, {'char': 'し', 'start': 1.12, 'end': 1.2}, {'char': 'て', 'start': 1.2, 'end': 1.28}, {'char': 'た', 'start': 1.28, 'end': 1.4}, {'char': 'か', 'start': 1.4, 'end': 2.72}, ...(TRUNCATED)...]] # Accurate word-level timestamps for CTC models.
```

Run:
```py
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(model_card="omniASR_CTC_300M")

audio_files = ["ado.wav"]
transcriptions, timestamps = pipeline.transcribe(audio_files, chunk_len=40, batch_size=2)

print(transcriptions)
print(timestamps)
```

Output:
```
parameter load: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   535/535 100% 0:00:00
['まずどんな勉強してたか ...(TRUNCATED)...'] # Transcript is untouched.
[[]] # No timestamps for non-CTC models.
```
@urroxyz urroxyz requested a review from artemru as a code owner November 13, 2025 18:27
@meta-cla
Copy link

meta-cla bot commented Nov 13, 2025

Hi @urroxyz!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 13, 2025
@meta-cla
Copy link

meta-cla bot commented Nov 13, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@urroxyz
Copy link
Author

urroxyz commented Nov 17, 2025

@artemru

@seohyunjun
Copy link

@urroxyz
Is it possible to receive the output in segment units as well?
The word-level chunks are too detailed to use as subtitles.
Thank you for developing such good code.

@urroxyz
Copy link
Author

urroxyz commented Nov 18, 2025

@seohyunjun It all depends on how your boundaries are determined.

It's important to note that word-level approximations derive from token-internals association, and are grouped after. Thus, any unique classification is possible, but you need a way to establish where one group begins and the other ends.

Perhaps I could implement a logic similar to VAD that reveals significant pauses (or the like) and groups based on that.

Also...

Word-level timestamps are extraordinarily useful for subtitles. In fact, that is one of the reasons they exist. As someone who transcribes CC and translates subtitles professionally, word timings increase my productivity a lot. If I had only segment-level timestamps, and I wanted to move just one or two words to the next cue, I'd have to guess the timings or manually determine them. More granular definitions are always useful.

With that, I emphasize that you can definitely run your transcriptions through a punctuation restoration model and then regroup words into sentences based on Unicode information or a specialized tokenization pipeline. But I'll look into it nonetheless.

Let me know if you have any specific ideas!

…ipeline

- `transcribe()` now returns a tuple `(transcripts, timestamps)` instead of just `transcripts`. Timestamps contain start/end times and word/character units.
- Integrates the new `align_ctc` and `align_llm` modules to extract alignment data during inference for both architecture types.
- Adds the `chunk_len` parameter. Audio files exceeding the model's maximum duration (default 40s) are now automatically split, processed in batches, and stitched back together with corrected time offsets.
- Switches the internal audio loading mechanism from a streaming `DataPipeline` to an eager loading sequence to allow for dynamic chunking and waveform access required for alignment, while preserving all original `fairseq2` preprocessing steps (decoding, resampling, normalization).
…ction

- Implements logic to map greedy decoding paths to emission frame boundaries for `Wav2Vec2AsrModel`.
- Implements a Dynamic Time Warping (DTW) strategy for `Wav2Vec2LlamaModel`. It performs a secondary teacher-forced forward pass, hooks into `StandardMultiheadAttention` to capture cross-attention matrices, and aligns text tokens to audio frames monotonically.
- Adds language-agnostic heuristics using Unicode ranges to automatically determine if timestamps should be generated at the word level (whitespace-separated) or character level (CJK, Thai, Lao, Khmer, Myanmar).

(All logic is implemented using standard `numpy` and `torch` operations without introducing external dependencies like `scipy` or `librosa`.)
@urroxyz urroxyz changed the title CTC timestamps and automagic chunking Timestamp extraction and automagic chunking Nov 18, 2025
@seohyunjun
Copy link

@urroxyz

It seems that, unlike Whisper, this method doesn’t use time tokens to store the segment’s end time.

As you mentioned, it looks like I could infer silent sections with a VAD filter to separate segments, or handle sentence endings with a rule-based approach. Thank you for the idea. I’ve used a diarization model before to split segments, but I’ll need to test whether it will be effective in this case as well.

Even with just the code you provided, it has already been extremely helpful for the project I’m working on.

Thank you!!

@urroxyz
Copy link
Author

urroxyz commented Nov 19, 2025

@seohyunjun Yes, it is unfortunate that the OmniASR models lack punctuation and native "time tokens" for boundary detection. I'm trying to see if there is a low-level hack to reconcile these things, but it's unlikely I'll discover anything reliable.

Glad that it's still a help, though. Let me know if you need any additional information or support for your project, I'm a student of computational linguistics and love all things ASR and subtitling!

@seohyunjun
Copy link

@urroxyz

I work at a small CDN company and use ASR models for generating video subtitles. Our existing model is also based on wav2vec, but its trained languages were limited, so I was really excited to discover the recently released omnilingual-asr model. You may already know this, but I’d also recommend the nvidia/canary-1b-v2 model, which supports both segments and words.

I really appreciate that the new model comes with various size options and even provides the dataset. 🎉

It sounds like Meta is planning to offer word- and segment-level outputs in the future as well, so I’ll be waiting for that!

I’ll analyze the code you shared too. 👍

…atement

- Fix `transcribe_with_context` missing a return statement; the pipeline builder was defined but never executed.
- Fix MyPy error in `_process_context_audio` by casting input list to `AudioInput` before passing to `_build_audio_wavform_pipeline`.
- Remove unused imports (`typing.Optional`, `typing.Union`, `torchaudio`).
- Apply `isort` and `black` formatting to resolve linting errors.
- Fix MyPy errors accessing `pipeline.model` attributes by casting the model to `Wav2Vec2LlamaModel` inside `align_llm`.
- Add missing type annotations for `AttentionStore.weights`, `token_frames`, and `current_group`.
- Remove extensive trailing whitespace and blank lines causing flake8 failures.
- Sort imports to satisfy `isort`.
@urroxyz
Copy link
Author

urroxyz commented Nov 24, 2025

@jeanm Sorry about all that. I'm new to GitHub. Should be good now.

@Fannovel16
Copy link

@urroxyz Is it possible to do forced alignment (get timestamp from ground truth transcription) with your approachs?

@urroxyz
Copy link
Author

urroxyz commented Nov 25, 2025

@Fannovel16 Yes, it is. I was thinking of implementing a transcript parameter for transcribe() that would align the given text instead of generating a new one. But I thought this PR would get accepted sooner, so I haven't touched it much.

return []

# Narrow type for MyPy
model = cast(Wav2Vec2LlamaModel, pipeline.model)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have RuntimeError: 'Wav2Vec2LlamaModel' is not defined from this line

Copy link
Author

@urroxyz urroxyz Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should now be fixed with the latest commit d65fe74.

@d-cota
Copy link

d-cota commented Nov 28, 2025

Any ETA regarding the merge?

Wav2Vec2LlamaModel is imported inside a TYPE_CHECKING block. Since typing.cast evaluates its arguments at runtime, passing the class object directly causes a NameError/RuntimeError because the class is not defined in the runtime namespace.

This change switches to a string forward reference in `align.py`, which satisfies static analysis (MyPy) without triggering a runtime lookup failure.
@cirquit cirquit self-requested a review December 1, 2025 17:53
@cirquit
Copy link
Contributor

cirquit commented Dec 1, 2025

Hi @urroxyz, thanks for this contribution! From a first glance, this looks nice - I've allowed the lint&test gh action, please try to fix the errors first until I find the time to fully review your PR.

@urroxyz
Copy link
Author

urroxyz commented Dec 3, 2025

@cirquit I'm getting would reformat annotations for line 0 in both files — not really sure what this means or how to fix it.

I'm also new to GitHub and don't use git locally. Do you have any guidance for me? I appreciate your patience.

@cirquit
Copy link
Contributor

cirquit commented Dec 3, 2025

@cirquit I'm getting would reformat annotations for line 0 in both files — not really sure what this means or how to fix it.

I'm also new to GitHub and don't use git locally. Do you have any guidance for me? I appreciate your patience.

No worries! I meant the "Lint and Test" Github workflow that runs automatically on every commit. You can see it at the end of this page and there is a big red X next to it. Navigate to the "..." to the right, and view the details, which gives you the concrete actions that ran, and then you can see the issues that you need to address to align with mypy and black, the two linters that complain.

@seohyunjun
Copy link

seohyunjun commented Dec 17, 2025

@urroxy
check this PR plz.
Apply ruff format
urroxyz#1

@Teeeto
Copy link

Teeeto commented Dec 18, 2025

Longer form audio fails at assert_max_length. I believe it happens because preprocessing pipeline(builder) is engaged before internal chunking. After commenting out length check in assert_max_length everything worked.

@urroxyz urroxyz requested a review from Fannovel16 December 20, 2025 03:19
@urroxyz
Copy link
Author

urroxyz commented Dec 20, 2025

So sorry for the mess this has been. Still new to contributing on GitHub and am not fully used to Git in general.

I ran the tests locally (black, mypy, and those of Lint & Test) and everything runs without error.

The long-form audio assert blockage has also been removed. Thanks for pointing it out, @Teeeto.

@rocety27
Copy link

Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Word level timestamps

7 participants