Thanks for open-sourcing this project — it's been very helpful!
Currently the Whisper implementation uses <|notimestamps|> and filters out all special tokens, so TimestampedResult.timestamps is always None.
For segment-level timestamps, it seems like removing <|notimestamps|> and parsing <|xx.xx|> tokens would be a straightforward fix.
For word-level timestamps, I'm not sure what the best approach would be — would re-exporting the ONNX model with cross-attention weights and applying DTW be a viable path? Or is there a better way to achieve this?
If you could point me in the right direction, I'd be happy to give it a try and submit a PR.
Thanks for open-sourcing this project — it's been very helpful!
Currently the Whisper implementation uses
<|notimestamps|>and filters out all special tokens, soTimestampedResult.timestampsis alwaysNone.For segment-level timestamps, it seems like removing
<|notimestamps|>and parsing<|xx.xx|>tokens would be a straightforward fix.For word-level timestamps, I'm not sure what the best approach would be — would re-exporting the ONNX model with cross-attention weights and applying DTW be a viable path? Or is there a better way to achieve this?
If you could point me in the right direction, I'd be happy to give it a try and submit a PR.