Skip to content

Frame-level model_output["note"] array has growing temporal drift due to hop_size ≠ kept_frames × FFT_HOP #190

@suncerock

Description

@suncerock

Summary
The frame-level output array model_output["note"] (shape (T, 88)) has a growing temporal misalignment between frame index and original audio position. By the end of a ~530-second file the drift reaches −2.7 seconds. The note-level output (note_events, midi_data) is not affected because it is decoded from seconds-stamped onset/offset detections rather than from the raw frame array.

Who is affected
Anyone using model_output["note"] (or "onset", "contour") as a frame-level feature — e.g. for multi-pitch estimation evaluation, training a downstream model, or aligning with a piano-roll ground truth — will observe results that degrade over the course of the track. Users who only use the decoded note_events / MIDI output are unaffected.

Root Cause
In basic-pitch/inference.py, audio is split into overlapping windows of AUDIO_N_SAMPLES = 43844 samples with overlap_len = 30 × FFT_HOP = 7680 samples. After inference, n_overlapping_frames // 2 = 15 frames are stripped from each side of each window, leaving:

kept     = ANNOT_N_FRAMES - 2 × 15 = 172 - 30 = 142 frames/window
hop_size = AUDIO_N_SAMPLES - overlap_len = 43844 - 7680 = 36164 samples

For the concatenated frame array to be temporally consistent, the hop in samples must equal exactly kept × FFT_HOP. But:

kept × FFT_HOP = 142 × 256 = 36352 samples
hop_size       =              36164 samples
mismatch       =               −188 samples/window  ← bug

This means every window boundary introduces a −188-sample drift between what frame index F implies (F × FFT_HOP samples into the audio) and where that frame actually came from. Over 321 windows (527-second file) the error accumulates to ~60 000 samples ≈ 2.7 seconds.

To Reproduce

import basic_pitch.constants as C

n_overlapping_frames = 30
overlap_len = n_overlapping_frames * C.FFT_HOP          # 7680
hop_size    = C.AUDIO_N_SAMPLES - overlap_len           # 36164
kept        = C.ANNOT_N_FRAMES - n_overlapping_frames   # 142

print(f"hop_size   = {hop_size}")           # 36164
print(f"kept × HOP = {kept * C.FFT_HOP}")  # 36352  ← mismatch
print(f"drift/win  = {hop_size - kept * C.FFT_HOP} samples")  # −188

Expected behavior
unwrap_output should document the per-window drift and expose a utility to convert frame indices to original-audio sample positions.

Screenshots

Image Image

These two figures show the ground truth (green), the prediction (red), and the corrected prediction (blue) of a test sample. The first figure shows the comparison from 0s to 30s, where we can see the green one and the red one are roughly aligned, but the second figure shows that the green one and the red one are clearly misaligned.

Environment

basic-pitch  0.4.0
Python       3.10

Additional information
Issue written by claude and manually verified :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions