-
Notifications
You must be signed in to change notification settings - Fork 229
Timestamp extraction and automagic chunking #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Timestamps are extracted by default and can be collected as a second variable upon `transcribe()` call. - Input audios are automagically chunked into groups of 40s. User may adjust with `chunk_len` parameter. ## Why? Other similar libraries, such as OpenAI's Whisper, are modified to perform the same operations due to community demand. ## How? Per #3, the CTC models are already capable of mapping their outputs to time. Chunking is simple, and automating it takes the burden off the user. Most audio inputs will exceed 40 seconds. ## Test plan Run: ```py from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline pipeline = ASRInferencePipeline(model_card="omniASR_CTC_300M") audio_files = ["ado.wav"] transcriptions, timestamps = pipeline.transcribe(audio_files, chunk_len=40, batch_size=2) print(transcriptions) print(timestamps) ``` Output: ``` parameter load: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 423/423 100% 0:00:00 ['まずどんな勉強してたか ...(TRUNCATED)...'] # Transcript is untouched. [[{'char': 'ま', 'start': 0.0, 'end': 0.32}, {'char': 'ず', 'start': 0.32, 'end': 0.42}, {'char': 'ど', 'start': 0.42, 'end': 0.54}, {'char': 'ん', 'start': 0.54, 'end': 0.64}, {'char': 'な', 'start': 0.64, 'end': 0.68}, {'char': '勉', 'start': 0.68, 'end': 0.82}, {'char': '強', 'start': 0.82, 'end': 1.12}, {'char': 'し', 'start': 1.12, 'end': 1.2}, {'char': 'て', 'start': 1.2, 'end': 1.28}, {'char': 'た', 'start': 1.28, 'end': 1.4}, {'char': 'か', 'start': 1.4, 'end': 2.72}, ...(TRUNCATED)...]] # Accurate word-level timestamps for CTC models. ``` Run: ```py from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline pipeline = ASRInferencePipeline(model_card="omniASR_CTC_300M") audio_files = ["ado.wav"] transcriptions, timestamps = pipeline.transcribe(audio_files, chunk_len=40, batch_size=2) print(transcriptions) print(timestamps) ``` Output: ``` parameter load: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 535/535 100% 0:00:00 ['まずどんな勉強してたか ...(TRUNCATED)...'] # Transcript is untouched. [[]] # No timestamps for non-CTC models. ```
|
Hi @urroxyz! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
|
@urroxyz |
|
@seohyunjun It all depends on how your boundaries are determined. It's important to note that word-level approximations derive from token-internals association, and are grouped after. Thus, any unique classification is possible, but you need a way to establish where one group begins and the other ends. Perhaps I could implement a logic similar to VAD that reveals significant pauses (or the like) and groups based on that. Also... Word-level timestamps are extraordinarily useful for subtitles. In fact, that is one of the reasons they exist. As someone who transcribes CC and translates subtitles professionally, word timings increase my productivity a lot. If I had only segment-level timestamps, and I wanted to move just one or two words to the next cue, I'd have to guess the timings or manually determine them. More granular definitions are always useful. With that, I emphasize that you can definitely run your transcriptions through a punctuation restoration model and then regroup words into sentences based on Unicode information or a specialized tokenization pipeline. But I'll look into it nonetheless. Let me know if you have any specific ideas! |
…ipeline - `transcribe()` now returns a tuple `(transcripts, timestamps)` instead of just `transcripts`. Timestamps contain start/end times and word/character units. - Integrates the new `align_ctc` and `align_llm` modules to extract alignment data during inference for both architecture types. - Adds the `chunk_len` parameter. Audio files exceeding the model's maximum duration (default 40s) are now automatically split, processed in batches, and stitched back together with corrected time offsets. - Switches the internal audio loading mechanism from a streaming `DataPipeline` to an eager loading sequence to allow for dynamic chunking and waveform access required for alignment, while preserving all original `fairseq2` preprocessing steps (decoding, resampling, normalization).
…ction - Implements logic to map greedy decoding paths to emission frame boundaries for `Wav2Vec2AsrModel`. - Implements a Dynamic Time Warping (DTW) strategy for `Wav2Vec2LlamaModel`. It performs a secondary teacher-forced forward pass, hooks into `StandardMultiheadAttention` to capture cross-attention matrices, and aligns text tokens to audio frames monotonically. - Adds language-agnostic heuristics using Unicode ranges to automatically determine if timestamps should be generated at the word level (whitespace-separated) or character level (CJK, Thai, Lao, Khmer, Myanmar). (All logic is implemented using standard `numpy` and `torch` operations without introducing external dependencies like `scipy` or `librosa`.)
|
It seems that, unlike Whisper, this method doesn’t use time tokens to store the segment’s end time. As you mentioned, it looks like I could infer silent sections with a VAD filter to separate segments, or handle sentence endings with a rule-based approach. Thank you for the idea. I’ve used a diarization model before to split segments, but I’ll need to test whether it will be effective in this case as well. Even with just the code you provided, it has already been extremely helpful for the project I’m working on. Thank you!! |
|
@seohyunjun Yes, it is unfortunate that the OmniASR models lack punctuation and native "time tokens" for boundary detection. I'm trying to see if there is a low-level hack to reconcile these things, but it's unlikely I'll discover anything reliable. Glad that it's still a help, though. Let me know if you need any additional information or support for your project, I'm a student of computational linguistics and love all things ASR and subtitling! |
|
I work at a small CDN company and use ASR models for generating video subtitles. Our existing model is also based on wav2vec, but its trained languages were limited, so I was really excited to discover the recently released omnilingual-asr model. You may already know this, but I’d also recommend the nvidia/canary-1b-v2 model, which supports both segments and words. I really appreciate that the new model comes with various size options and even provides the dataset. 🎉 It sounds like Meta is planning to offer word- and segment-level outputs in the future as well, so I’ll be waiting for that! I’ll analyze the code you shared too. 👍 |
…atement - Fix `transcribe_with_context` missing a return statement; the pipeline builder was defined but never executed. - Fix MyPy error in `_process_context_audio` by casting input list to `AudioInput` before passing to `_build_audio_wavform_pipeline`. - Remove unused imports (`typing.Optional`, `typing.Union`, `torchaudio`). - Apply `isort` and `black` formatting to resolve linting errors.
- Fix MyPy errors accessing `pipeline.model` attributes by casting the model to `Wav2Vec2LlamaModel` inside `align_llm`. - Add missing type annotations for `AttentionStore.weights`, `token_frames`, and `current_group`. - Remove extensive trailing whitespace and blank lines causing flake8 failures. - Sort imports to satisfy `isort`.
|
@jeanm Sorry about all that. I'm new to GitHub. Should be good now. |
|
@urroxyz Is it possible to do forced alignment (get timestamp from ground truth transcription) with your approachs? |
|
@Fannovel16 Yes, it is. I was thinking of implementing a |
| return [] | ||
|
|
||
| # Narrow type for MyPy | ||
| model = cast(Wav2Vec2LlamaModel, pipeline.model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have RuntimeError: 'Wav2Vec2LlamaModel' is not defined from this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should now be fixed with the latest commit d65fe74.
|
Any ETA regarding the merge? |
Wav2Vec2LlamaModel is imported inside a TYPE_CHECKING block. Since typing.cast evaluates its arguments at runtime, passing the class object directly causes a NameError/RuntimeError because the class is not defined in the runtime namespace. This change switches to a string forward reference in `align.py`, which satisfies static analysis (MyPy) without triggering a runtime lookup failure.
|
Hi @urroxyz, thanks for this contribution! From a first glance, this looks nice - I've allowed the lint&test gh action, please try to fix the errors first until I find the time to fully review your PR. |
|
@cirquit I'm getting I'm also new to GitHub and don't use |
No worries! I meant the "Lint and Test" Github workflow that runs automatically on every commit. You can see it at the end of this page and there is a big red X next to it. Navigate to the "..." to the right, and view the details, which gives you the concrete actions that ran, and then you can see the issues that you need to address to align with |
|
@urroxy |
|
Longer form audio fails at assert_max_length. I believe it happens because preprocessing pipeline(builder) is engaged before internal chunking. After commenting out length check in assert_max_length everything worked. |
|
So sorry for the mess this has been. Still new to contributing on GitHub and am not fully used to Git in general. I ran the tests locally ( The long-form audio assert blockage has also been removed. Thanks for pointing it out, @Teeeto. |
|
Any update on this? |
This PR introduces two requested features to the
ASRInferencePipeline: accurate timestamp extraction (word/character level) and automatic audio chunking for long-form files. Closes #3.Summary of changes
transcribe()now returns a tuple(transcripts, timestamps).chunk_lenparameter allows users to define this window (defaultNonepreserves strict behavior, setting it enables chunking).torchandnumpywithout adding heavy dependencies likescipyorlibrosa.Why?
This addresses an important utility gap compared to similar libraries like OpenAI's Whisper. Community demand for timestamp alignment and long-form transcription has been high.
Relevant @mentions: @ijean, @artemru, @d-cota, @huanglizhuo, @wikioai, @nikopartanen, @Aunali321, @marcelgoya, and some others.
How?
StandardMultiheadAttentionfor LLMs. It captures QK similarity matrices, aggregates them across layers, and applies a custom NumPy-based DTW algorithm to find the optimal alignment path.transcribeto handle waveform loading and splitting based onchunk_len.Test plan
Verified on both CTC and LLM architectures using English and Japanese.
1. CTC model for word-level alignment
Output:
2. LLM model and script detection
Output: