Timestamp extraction and automagic chunking #22

urroxyz · 2025-11-13T18:27:39Z

This PR introduces two requested features to the ASRInferencePipeline: accurate timestamp extraction (word/character level) and automatic audio chunking for long-form files. Closes #3.

Summary of changes

transcribe() now returns a tuple (transcripts, timestamps).
- Uses frame boundary analysis from the emission logits for CTC models.
- Uses a novel Dynamic Time Warping (DTW) approach on the decoder's Cross-Attention maps (Text Query $\to$ Audio Key) to enforce monotonic alignment and prevent attention collapse for LLM-based models.
- Includes language-agnostic logic to automatically detect non-spaced scripts (CJK, Thai, Lao, Khmer, Myanmar) and switch from word-level to character-level timestamps.
Input audio exceeding the model's limit of 40s is now automatically chunked. The chunk_len parameter allows users to define this window (default None preserves strict behavior, setting it enables chunking).
All implemented using pure torch and numpy without adding heavy dependencies like scipy or librosa.

Why?

This addresses an important utility gap compared to similar libraries like OpenAI's Whisper. Community demand for timestamp alignment and long-form transcription has been high.

Relevant @mentions: @ijean, @artemru, @d-cota, @huanglizhuo, @wikioai, @nikopartanen, @Aunali321, @marcelgoya, and some others.

How?

Created a new module to handle alignment strategies.
- Maps greedy decoding paths to time frames and interpolates text units between non-blank token transitions for CTC.
- Performs a secondary "teacher-forced" forward pass with hooks on StandardMultiheadAttention for LLMs. It captures QK similarity matrices, aggregates them across layers, and applies a custom NumPy-based DTW algorithm to find the optimal alignment path.
Updated transcribe to handle waveform loading and splitting based on chunk_len.
- It processes chunks sequentially, aligns them individually, and recalculates global time offsets before returning the merged result.

Test plan

Verified on both CTC and LLM architectures using English and Japanese.

1. CTC model for word-level alignment

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Load CTC model
pipeline = ASRInferencePipeline(model_card="omniASR_CTC_300M")

audio_files = ["IMDA_conversation.wav"] # English audio example
transcriptions, timestamps = pipeline.transcribe(audio_files, chunk_len=40, batch_size=2) # Request chunking at 40s intervals

print(transcriptions)
print(timestamps)

Output:

['and then we will need some documents from you...']
[[{'word': 'and', 'start': 8.21, 'end': 8.44}, {'word': 'then', 'start': 8.42, 'end': 8.77}, ...(TRUNCATED)...]]

2. LLM model and script detection

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Load LLM model
pipeline = ASRInferencePipeline(model_card="omniASR_LLM_300M")

audio_files = ["ado.wav"] # Japanese audio example
transcriptions, timestamps = pipeline.transcribe(audio_files, chunk_len=40, batch_size=2)

print(transcriptions)
print(timestamps)

Output:

['まずどんな勉強してたか...']
[[{'char': 'ま', 'start': 0.0, 'end': 0.32}, {'char': 'ず', 'start': 0.32, 'end': 0.42}, {'char': 'ど', 'start': 0.42, 'end': 0.54}, ...(TRUNCATED)...]] # Accurate character-level timestamps for CJK scripts via LLM attention

- Timestamps are extracted by default and can be collected as a second variable upon `transcribe()` call. - Input audios are automagically chunked into groups of 40s. User may adjust with `chunk_len` parameter. ## Why? Other similar libraries, such as OpenAI's Whisper, are modified to perform the same operations due to community demand. ## How? Per #3, the CTC models are already capable of mapping their outputs to time. Chunking is simple, and automating it takes the burden off the user. Most audio inputs will exceed 40 seconds. ## Test plan Run: ```py from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline pipeline = ASRInferencePipeline(model_card="omniASR_CTC_300M") audio_files = ["ado.wav"] transcriptions, timestamps = pipeline.transcribe(audio_files, chunk_len=40, batch_size=2) print(transcriptions) print(timestamps) ``` Output: ``` parameter load: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 423/423 100% 0:00:00 ['まずどんな勉強してたか ...(TRUNCATED)...'] # Transcript is untouched. [[{'char': 'ま', 'start': 0.0, 'end': 0.32}, {'char': 'ず', 'start': 0.32, 'end': 0.42}, {'char': 'ど', 'start': 0.42, 'end': 0.54}, {'char': 'ん', 'start': 0.54, 'end': 0.64}, {'char': 'な', 'start': 0.64, 'end': 0.68}, {'char': '勉', 'start': 0.68, 'end': 0.82}, {'char': '強', 'start': 0.82, 'end': 1.12}, {'char': 'し', 'start': 1.12, 'end': 1.2}, {'char': 'て', 'start': 1.2, 'end': 1.28}, {'char': 'た', 'start': 1.28, 'end': 1.4}, {'char': 'か', 'start': 1.4, 'end': 2.72}, ...(TRUNCATED)...]] # Accurate word-level timestamps for CTC models. ``` Run: ```py from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline pipeline = ASRInferencePipeline(model_card="omniASR_CTC_300M") audio_files = ["ado.wav"] transcriptions, timestamps = pipeline.transcribe(audio_files, chunk_len=40, batch_size=2) print(transcriptions) print(timestamps) ``` Output: ``` parameter load: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 535/535 100% 0:00:00 ['まずどんな勉強してたか ...(TRUNCATED)...'] # Transcript is untouched. [[]] # No timestamps for non-CTC models. ```

meta-cla · 2025-11-13T18:27:45Z

Hi @urroxyz!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

meta-cla · 2025-11-13T20:13:14Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

urroxyz · 2025-11-17T21:51:56Z

@artemru

seohyunjun · 2025-11-18T11:14:53Z

@urroxyz
Is it possible to receive the output in segment units as well?
The word-level chunks are too detailed to use as subtitles.
Thank you for developing such good code.

urroxyz · 2025-11-18T14:22:41Z

@seohyunjun It all depends on how your boundaries are determined.

It's important to note that word-level approximations derive from token-internals association, and are grouped after. Thus, any unique classification is possible, but you need a way to establish where one group begins and the other ends.

Perhaps I could implement a logic similar to VAD that reveals significant pauses (or the like) and groups based on that.

Also...

Word-level timestamps are extraordinarily useful for subtitles. In fact, that is one of the reasons they exist. As someone who transcribes CC and translates subtitles professionally, word timings increase my productivity a lot. If I had only segment-level timestamps, and I wanted to move just one or two words to the next cue, I'd have to guess the timings or manually determine them. More granular definitions are always useful.

With that, I emphasize that you can definitely run your transcriptions through a punctuation restoration model and then regroup words into sentences based on Unicode information or a specialized tokenization pipeline. But I'll look into it nonetheless.

Let me know if you have any specific ideas!

…ipeline - `transcribe()` now returns a tuple `(transcripts, timestamps)` instead of just `transcripts`. Timestamps contain start/end times and word/character units. - Integrates the new `align_ctc` and `align_llm` modules to extract alignment data during inference for both architecture types. - Adds the `chunk_len` parameter. Audio files exceeding the model's maximum duration (default 40s) are now automatically split, processed in batches, and stitched back together with corrected time offsets. - Switches the internal audio loading mechanism from a streaming `DataPipeline` to an eager loading sequence to allow for dynamic chunking and waveform access required for alignment, while preserving all original `fairseq2` preprocessing steps (decoding, resampling, normalization).

…ction - Implements logic to map greedy decoding paths to emission frame boundaries for `Wav2Vec2AsrModel`. - Implements a Dynamic Time Warping (DTW) strategy for `Wav2Vec2LlamaModel`. It performs a secondary teacher-forced forward pass, hooks into `StandardMultiheadAttention` to capture cross-attention matrices, and aligns text tokens to audio frames monotonically. - Adds language-agnostic heuristics using Unicode ranges to automatically determine if timestamps should be generated at the word level (whitespace-separated) or character level (CJK, Thai, Lao, Khmer, Myanmar). (All logic is implemented using standard `numpy` and `torch` operations without introducing external dependencies like `scipy` or `librosa`.)

seohyunjun · 2025-11-18T22:43:51Z

@urroxyz

It seems that, unlike Whisper, this method doesn’t use time tokens to store the segment’s end time.

As you mentioned, it looks like I could infer silent sections with a VAD filter to separate segments, or handle sentence endings with a rule-based approach. Thank you for the idea. I’ve used a diarization model before to split segments, but I’ll need to test whether it will be effective in this case as well.

Even with just the code you provided, it has already been extremely helpful for the project I’m working on.

Thank you!!

urroxyz · 2025-11-19T02:42:46Z

@seohyunjun Yes, it is unfortunate that the OmniASR models lack punctuation and native "time tokens" for boundary detection. I'm trying to see if there is a low-level hack to reconcile these things, but it's unlikely I'll discover anything reliable.

Glad that it's still a help, though. Let me know if you need any additional information or support for your project, I'm a student of computational linguistics and love all things ASR and subtitling!

seohyunjun · 2025-11-19T03:01:33Z

@urroxyz

I work at a small CDN company and use ASR models for generating video subtitles. Our existing model is also based on wav2vec, but its trained languages were limited, so I was really excited to discover the recently released omnilingual-asr model. You may already know this, but I’d also recommend the nvidia/canary-1b-v2 model, which supports both segments and words.

I really appreciate that the new model comes with various size options and even provides the dataset. 🎉

It sounds like Meta is planning to offer word- and segment-level outputs in the future as well, so I’ll be waiting for that!

I’ll analyze the code you shared too. 👍

…atement - Fix `transcribe_with_context` missing a return statement; the pipeline builder was defined but never executed. - Fix MyPy error in `_process_context_audio` by casting input list to `AudioInput` before passing to `_build_audio_wavform_pipeline`. - Remove unused imports (`typing.Optional`, `typing.Union`, `torchaudio`). - Apply `isort` and `black` formatting to resolve linting errors.

- Fix MyPy errors accessing `pipeline.model` attributes by casting the model to `Wav2Vec2LlamaModel` inside `align_llm`. - Add missing type annotations for `AttentionStore.weights`, `token_frames`, and `current_group`. - Remove extensive trailing whitespace and blank lines causing flake8 failures. - Sort imports to satisfy `isort`.

urroxyz · 2025-11-24T18:32:07Z

@jeanm Sorry about all that. I'm new to GitHub. Should be good now.

Fannovel16 · 2025-11-25T05:33:43Z

@urroxyz Is it possible to do forced alignment (get timestamp from ground truth transcription) with your approachs?

urroxyz · 2025-11-25T05:39:49Z

@Fannovel16 Yes, it is. I was thinking of implementing a transcript parameter for transcribe() that would align the given text instead of generating a new one. But I thought this PR would get accepted sooner, so I haven't touched it much.

Fannovel16 · 2025-11-25T05:45:07Z

src/omnilingual_asr/models/inference/align.py

+        return []
+
+    # Narrow type for MyPy
+    model = cast(Wav2Vec2LlamaModel, pipeline.model)


I have RuntimeError: 'Wav2Vec2LlamaModel' is not defined from this line

It should now be fixed with the latest commit d65fe74.

d-cota · 2025-11-28T11:09:42Z

Any ETA regarding the merge?

Wav2Vec2LlamaModel is imported inside a TYPE_CHECKING block. Since typing.cast evaluates its arguments at runtime, passing the class object directly causes a NameError/RuntimeError because the class is not defined in the runtime namespace. This change switches to a string forward reference in `align.py`, which satisfies static analysis (MyPy) without triggering a runtime lookup failure.

cirquit · 2025-12-01T18:03:40Z

Hi @urroxyz, thanks for this contribution! From a first glance, this looks nice - I've allowed the lint&test gh action, please try to fix the errors first until I find the time to fully review your PR.

urroxyz · 2025-12-03T15:03:19Z

@cirquit I'm getting would reformat annotations for line 0 in both files — not really sure what this means or how to fix it.

I'm also new to GitHub and don't use git locally. Do you have any guidance for me? I appreciate your patience.

cirquit · 2025-12-03T15:08:07Z

@cirquit I'm getting would reformat annotations for line 0 in both files — not really sure what this means or how to fix it.

I'm also new to GitHub and don't use git locally. Do you have any guidance for me? I appreciate your patience.

No worries! I meant the "Lint and Test" Github workflow that runs automatically on every commit. You can see it at the end of this page and there is a big red X next to it. Navigate to the "..." to the right, and view the details, which gives you the concrete actions that ran, and then you can see the issues that you need to address to align with mypy and black, the two linters that complain.

seohyunjun · 2025-12-17T03:55:50Z

@urroxy
check this PR plz.
Apply ruff format
urroxyz#1

Apply Ruff format

Teeeto · 2025-12-18T07:05:22Z

Longer form audio fails at assert_max_length. I believe it happens because preprocessing pipeline(builder) is engaged before internal chunking. After commenting out length check in assert_max_length everything worked.

…ssertions

urroxyz · 2025-12-20T03:20:36Z

So sorry for the mess this has been. Still new to contributing on GitHub and am not fully used to Git in general.

I ran the tests locally (black, mypy, and those of Lint & Test) and everything runs without error.

The long-form audio assert blockage has also been removed. Thanks for pointing it out, @Teeeto.

rocety27 · 2026-01-29T14:44:25Z

Any update on this?

urroxyz requested a review from artemru as a code owner November 13, 2025 18:27

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 13, 2025

Merge branch 'main' into patch-2

dcdca50

urroxyz added 2 commits November 18, 2025 15:48

urroxyz changed the title ~~CTC timestamps and automagic chunking~~ Timestamp extraction and automagic chunking Nov 18, 2025

urroxyz mentioned this pull request Nov 18, 2025

[Feature] Word level timestamps #3

Open

urroxyz added 3 commits November 22, 2025 19:24

fix(typo)

973a7d3

Fannovel16 reviewed Nov 25, 2025

View reviewed changes

urroxyz mentioned this pull request Nov 25, 2025

[Feature] speaker identification or diarization #32

Open

cirquit self-requested a review December 1, 2025 17:53

Apply Ruff format

45f6310

urroxyz force-pushed the patch-2 branch from 9ab8ade to d65fe74 Compare December 17, 2025 19:54

urroxyz added 2 commits December 17, 2025 11:54

Merge pull request #1 from seohyunjun/apply-linter

908227c

Apply Ruff format

Merge branch 'main' into patch-2

1f4ae50

urroxyz mentioned this pull request Dec 20, 2025

How to support output text timestamp #56

Closed

fix(inference): resolve typing errors, linting issues, and chunking a…

15f9810

…ssertions

urroxyz requested a review from Fannovel16 December 20, 2025 03:19

Timestamp extraction and automagic chunking #22

Are you sure you want to change the base?

Timestamp extraction and automagic chunking #22

Uh oh!

Conversation

urroxyz commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of changes

Why?

How?

Test plan

1. CTC model for word-level alignment

2. LLM model and script detection

Uh oh!

meta-cla bot commented Nov 13, 2025

Action Required

Process

Uh oh!

meta-cla bot commented Nov 13, 2025

Uh oh!

urroxyz commented Nov 17, 2025

Uh oh!

seohyunjun commented Nov 18, 2025

Uh oh!

urroxyz commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seohyunjun commented Nov 18, 2025

Uh oh!

urroxyz commented Nov 19, 2025

Uh oh!

seohyunjun commented Nov 19, 2025

Uh oh!

urroxyz commented Nov 24, 2025

Uh oh!

Fannovel16 commented Nov 25, 2025

Uh oh!

urroxyz commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fannovel16 Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

urroxyz Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-cota commented Nov 28, 2025

Uh oh!

cirquit commented Dec 1, 2025

Uh oh!

urroxyz commented Dec 3, 2025

Uh oh!

cirquit commented Dec 3, 2025

Uh oh!

seohyunjun commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Teeeto commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

urroxyz commented Dec 20, 2025

Uh oh!

rocety27 commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

urroxyz commented Nov 13, 2025 •

edited

Loading

urroxyz commented Nov 18, 2025 •

edited

Loading

urroxyz commented Nov 25, 2025 •

edited

Loading

urroxyz Nov 29, 2025 •

edited

Loading

seohyunjun commented Dec 17, 2025 •

edited

Loading

Teeeto commented Dec 18, 2025 •

edited

Loading