Frame-level model_output["note"] array has growing temporal drift due to hop_size ≠ kept_frames × FFT_HOP

**Summary**
The **frame-level** output array `model_output["note"]` (shape (T, 88)) has a growing temporal misalignment between frame index and original audio position. By the end of a ~530-second file the drift reaches −2.7 seconds. The note-level output (`note_events`, `midi_data`) is not affected because it is decoded from seconds-stamped onset/offset detections rather than from the raw frame array.

**Who is affected**
Anyone using `model_output["note"]` (or `"onset"`, `"contour"`) as a frame-level feature — e.g. for multi-pitch estimation evaluation, training a downstream model, or aligning with a piano-roll ground truth — will observe results that degrade over the course of the track. Users who only use the decoded `note_events` / MIDI output are unaffected.

**Root Cause**
In `basic-pitch/inference.py`, audio is split into overlapping windows of `AUDIO_N_SAMPLES = 43844` samples with `overlap_len = 30 × FFT_HOP = 7680` samples. After inference, `n_overlapping_frames // 2 = 15` frames are stripped from each side of each window, leaving:

```
kept     = ANNOT_N_FRAMES - 2 × 15 = 172 - 30 = 142 frames/window
hop_size = AUDIO_N_SAMPLES - overlap_len = 43844 - 7680 = 36164 samples
```

For the concatenated frame array to be temporally consistent, the hop in samples must equal exactly `kept × FFT_HOP`. But:

```
kept × FFT_HOP = 142 × 256 = 36352 samples
hop_size       =              36164 samples
mismatch       =               −188 samples/window  ← bug
```

This means every window boundary introduces a −188-sample drift between what frame index `F` implies (`F × FFT_HOP` samples into the audio) and where that frame actually came from. Over 321 windows (527-second file) the error accumulates to ~60 000 samples ≈ 2.7 seconds.

**To Reproduce**
```python
import basic_pitch.constants as C

n_overlapping_frames = 30
overlap_len = n_overlapping_frames * C.FFT_HOP          # 7680
hop_size    = C.AUDIO_N_SAMPLES - overlap_len           # 36164
kept        = C.ANNOT_N_FRAMES - n_overlapping_frames   # 142

print(f"hop_size   = {hop_size}")           # 36164
print(f"kept × HOP = {kept * C.FFT_HOP}")  # 36352  ← mismatch
print(f"drift/win  = {hop_size - kept * C.FFT_HOP} samples")  # −188
```

**Expected behavior**
unwrap_output should document the per-window drift and expose a utility to convert frame indices to original-audio sample positions.

**Screenshots**

<img width="613" height="306" alt="Image" src="https://github.com/user-attachments/assets/a8ea3a2e-33e7-41cb-9692-8d21a87a2504" />

<img width="613" height="305" alt="Image" src="https://github.com/user-attachments/assets/488fdc17-5251-4c7a-afce-ce0bb227e133" />

These two figures show the ground truth (green), the prediction (red), and the corrected prediction (blue) of a test sample. The first figure shows the comparison from 0s to 30s, where we can see the green one and the red one are roughly aligned, but the second figure shows that the green one and the red one are clearly misaligned.

**Environment**
```
basic-pitch  0.4.0
Python       3.10
```

**Additional information**
Issue written by claude and manually verified :)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frame-level model_output["note"] array has growing temporal drift due to hop_size ≠ kept_frames × FFT_HOP #190

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Frame-level model_output["note"] array has growing temporal drift due to hop_size ≠ kept_frames × FFT_HOP #190

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions