Skip to content

Conversation

@a5hun
Copy link
Contributor

@a5hun a5hun commented Jun 15, 2025

Replaced the Harmonic Product Spectrum (HPS) pitch estimator with an ultra-slim implementation of SWIPE (Sawtooth Waveform Inspired Pitch Estimator), adapted from libf0 and based on Arturo Camacho's original algorithm:

https://ufdcimages.uflib.ufl.edu/UF/E0/02/15/89/00001/camacho_a.pdf

SWIPE is an extremely accurate and performant pitch estimator, working on log-spaced frequency bins for precision, and it works well for most voices. Arturo's original implementation and both libf0 versions (full and slim) use multiple FFT sizes [128, 256, 512, 1024, 2048, and 4096] to reduce the pitch strength loss for different frequencies when using suboptimal window sizes, but pitch estimation is still very good with just a single FFT size (4096 for sr > 44100). I've also slightly modified the kernels to weigh the fundamental and first harmonic evenly, which helps prevent cases where a higher strength first harmonic (as in a sung or voiced "eh" ⟨e⟩) is falsely detected as the pitch.

To better align with other voice pitch trackers, a new frequency scale has been added, centered around C and with ticks for each note on the chromatic scale.

@tlecomte
Copy link
Owner

tlecomte commented Aug 4, 2025

Thank you very much @a5hun for your contribution, it's very much appreciated.

I have a couple questions:

  • should we keep both pitch estimators with a setting ? Or is swipe strictly always better than hps ?
  • the code uses scipy for interp1d and CubicSpline. Unfortunately we cannot use scipy directly because that would make the friture package much bigger. Possible alternatives would be to use methods that already exist in friture source (I think there may be one for interp1d), or copy the relevant piece of code from scipy, or reimplement it independently.

@a5hun
Copy link
Contributor Author

a5hun commented Aug 25, 2025

Apologies for the delayed response! Here's the current HPS pitch tracker compared with the SWIPE-like algo on Salvatore Fisichella's "O muto asil del pianto":

friture-pt-comp

Pitch and time resolution are much better with my SWIPE implementation, so my vote would be to replace HPS.

Scipy's interp1d and CubicSpline can be replaced with numpy's np.interp without hurting the resolution of the pitch tracker, so I'll make the changes. Is there any way to not plot zero/NaN pitch estimates? That would clean up the plot a little bit for unvoiced sections.

@tlecomte
Copy link
Owner

Thanks for screenshots for comparison, it seems to be indeed more precise.

Thanks also for the move from scipy to numpy.

Regarding hiding zeroes or Nan in plots, I think this is something that could be addressed. I would also allow to cleanup other plots like the long-time level widget.

@tlecomte
Copy link
Owner

@a5hun Would you be able to rebase your changes please? There are merge conflicts following changes I made in pitch_tracker.py... Sorry about that!

@tlecomte
Copy link
Owner

(I'm also curious if it could be realistic to use CREPE or SwiftF0)

Pitch tracker now works with the latest friture changes. Kernel generation is slightly different, tuned with vocal stems (male/female) to return fewer spurious voiced pitches.
@a5hun
Copy link
Contributor Author

a5hun commented Sep 22, 2025

Not sure this was what you asked, exactly. A git rebase seems to be more difficult than writing a SWIPE-like pitch tracker! Everything on my branch should now be current with friture's master.

CREPE is too slow for realtime. I've spent a lot of time testing out various pitch tracking algorithms with high quality isolated vocals (https://cambridge-mt.com/ms3/mtk/), and I settled on a customized SWIPE-ish one because it's fast, very precise, and can work on independent audio frames. I really like pYIN, but it's also too slow. AI models can produce good results (CREPE is good, SPICE not so much, and I'm playing with SwiftF0 now), but they're either too slow or don't work well on 4096 samples.

https://github.com/lars76/pitch-benchmark/
I know SWIPE looks bad in this (not sure what SPTK is doing exactly with their SWIPE algo), but my version performs nearly identically to SwiftF0 on most voices. Here's a 16 second clip from the vocal stem on Jesse Joy's 'Release':

swift-swipe-comp

For this clip, calc time for my SWIPE algo is 0.122 seconds. SwiftF0 takes 1.52 seconds. That's at 4096 FFT length, hop at 1/4 that.

I tried to implement SwiftF0 into Friture (it's very probably the better pitch estimator), but I ran into a lot of issues. It needs librosa and scipy (and more!), and I can't get onnxruntime to work without crashing. If you can untangle the dependencies, it's really easy to implement in code, but I don't know enough about which versions of what that friture requires.

@a5hun a5hun reopened this Sep 22, 2025
@tlecomte
Copy link
Owner

tlecomte commented Oct 6, 2025

Thank you very much @a5hun, that looks great!

I will merge as is.

I am also curious if there are some algorithms that are capable of handling multi-sources, because I can imagine that it could be interesting on a setting with several voices and instruments.

@tlecomte tlecomte merged commit 24661ae into tlecomte:master Oct 7, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants