Summary
The frame-level output array model_output["note"] (shape (T, 88)) has a growing temporal misalignment between frame index and original audio position. By the end of a ~530-second file the drift reaches −2.7 seconds. The note-level output (note_events, midi_data) is not affected because it is decoded from seconds-stamped onset/offset detections rather than from the raw frame array.
Who is affected
Anyone using model_output["note"] (or "onset", "contour") as a frame-level feature — e.g. for multi-pitch estimation evaluation, training a downstream model, or aligning with a piano-roll ground truth — will observe results that degrade over the course of the track. Users who only use the decoded note_events / MIDI output are unaffected.
Root Cause
In basic-pitch/inference.py, audio is split into overlapping windows of AUDIO_N_SAMPLES = 43844 samples with overlap_len = 30 × FFT_HOP = 7680 samples. After inference, n_overlapping_frames // 2 = 15 frames are stripped from each side of each window, leaving:
kept = ANNOT_N_FRAMES - 2 × 15 = 172 - 30 = 142 frames/window
hop_size = AUDIO_N_SAMPLES - overlap_len = 43844 - 7680 = 36164 samples
For the concatenated frame array to be temporally consistent, the hop in samples must equal exactly kept × FFT_HOP. But:
kept × FFT_HOP = 142 × 256 = 36352 samples
hop_size = 36164 samples
mismatch = −188 samples/window ← bug
This means every window boundary introduces a −188-sample drift between what frame index F implies (F × FFT_HOP samples into the audio) and where that frame actually came from. Over 321 windows (527-second file) the error accumulates to ~60 000 samples ≈ 2.7 seconds.
To Reproduce
import basic_pitch.constants as C
n_overlapping_frames = 30
overlap_len = n_overlapping_frames * C.FFT_HOP # 7680
hop_size = C.AUDIO_N_SAMPLES - overlap_len # 36164
kept = C.ANNOT_N_FRAMES - n_overlapping_frames # 142
print(f"hop_size = {hop_size}") # 36164
print(f"kept × HOP = {kept * C.FFT_HOP}") # 36352 ← mismatch
print(f"drift/win = {hop_size - kept * C.FFT_HOP} samples") # −188
Expected behavior
unwrap_output should document the per-window drift and expose a utility to convert frame indices to original-audio sample positions.
Screenshots
These two figures show the ground truth (green), the prediction (red), and the corrected prediction (blue) of a test sample. The first figure shows the comparison from 0s to 30s, where we can see the green one and the red one are roughly aligned, but the second figure shows that the green one and the red one are clearly misaligned.
Environment
basic-pitch 0.4.0
Python 3.10
Additional information
Issue written by claude and manually verified :)
Summary
The frame-level output array
model_output["note"](shape (T, 88)) has a growing temporal misalignment between frame index and original audio position. By the end of a ~530-second file the drift reaches −2.7 seconds. The note-level output (note_events,midi_data) is not affected because it is decoded from seconds-stamped onset/offset detections rather than from the raw frame array.Who is affected
Anyone using
model_output["note"](or"onset","contour") as a frame-level feature — e.g. for multi-pitch estimation evaluation, training a downstream model, or aligning with a piano-roll ground truth — will observe results that degrade over the course of the track. Users who only use the decodednote_events/ MIDI output are unaffected.Root Cause
In
basic-pitch/inference.py, audio is split into overlapping windows ofAUDIO_N_SAMPLES = 43844samples withoverlap_len = 30 × FFT_HOP = 7680samples. After inference,n_overlapping_frames // 2 = 15frames are stripped from each side of each window, leaving:For the concatenated frame array to be temporally consistent, the hop in samples must equal exactly
kept × FFT_HOP. But:This means every window boundary introduces a −188-sample drift between what frame index
Fimplies (F × FFT_HOPsamples into the audio) and where that frame actually came from. Over 321 windows (527-second file) the error accumulates to ~60 000 samples ≈ 2.7 seconds.To Reproduce
Expected behavior
unwrap_output should document the per-window drift and expose a utility to convert frame indices to original-audio sample positions.
Screenshots
These two figures show the ground truth (green), the prediction (red), and the corrected prediction (blue) of a test sample. The first figure shows the comparison from 0s to 30s, where we can see the green one and the red one are roughly aligned, but the second figure shows that the green one and the red one are clearly misaligned.
Environment
Additional information
Issue written by claude and manually verified :)