Remove openai-whisper dependency for log_mel_spectrogram#1846
Open
musselmanjoey wants to merge 1 commit intoFunAudioLLM:mainfrom
Open
Remove openai-whisper dependency for log_mel_spectrogram#1846musselmanjoey wants to merge 1 commit intoFunAudioLLM:mainfrom
musselmanjoey wants to merge 1 commit intoFunAudioLLM:mainfrom
Conversation
openai-whisper is a heavy (~1.5GB) speech recognition package but CosyVoice only uses whisper.log_mel_spectrogram() — a standard audio preprocessing utility. This causes widespread installation failures (see FunAudioLLM#1844, FunAudioLLM#1266, FunAudioLLM#249, FunAudioLLM#316) due to dependency conflicts, especially on platforms with pre-installed PyTorch (Kaggle, Colab). Replace all whisper.log_mel_spectrogram() calls with a lightweight implementation in cosyvoice/utils/audio_utils.py that uses only torch and torchaudio (already required dependencies). The output is numerically equivalent. The legacy get_tokenizer() function (CosyVoice v1) still needs whisper.tokenizer.Tokenizer, so that import is moved to a lazy import inside the function body — it only triggers if you actually use the v1 tokenizer path. CosyVoice2/3 tokenizers are unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
whisper.log_mel_spectrogram()calls with a lightweight implementation usingtorch+torchaudio(already required deps)cosyvoice/utils/audio_utils.pywith a drop-inlog_mel_spectrogram()functionopenai-whisper==20231117fromrequirements.txtwhisper.tokenizer.Tokenizerimport (used only by CosyVoice v1'sget_tokenizer()) to a lazy import so it doesn't break module loadingMotivation
openai-whisperis a ~1.5GB speech recognition package, but CosyVoice only uses one utility function from it:whisper.log_mel_spectrogram(). This causes widespread installation failures due to dependency conflicts, especially on platforms with pre-installed PyTorch (Kaggle, Colab, etc).Related issues: #1844, #1266, #249, #316
Details
log_mel_spectrogramis a standard audio preprocessing operation (STFT → mel filterbank → log scaling). The replacement inaudio_utils.pyusestorch.stftandtorchaudio.functional.melscale_fbankswith the same parameters as Whisper (n_fft=400, hop_length=160, 16kHz sample rate), producing numerically equivalent output.Files changed:
cosyvoice/utils/audio_utils.py(new) — sharedlog_mel_spectrogramimplementationcosyvoice/cli/frontend.py— useaudio_utils.log_mel_spectrograminstead ofwhispercosyvoice/dataset/processor.py— same replacementtools/extract_speech_token.py— same replacementcosyvoice/tokenizer/tokenizer.py— lazy import ofwhisper.tokenizer.Tokenizer(only needed for v1 tokenizer path)requirements.txt— removeopenai-whisper🤖 Generated with Claude Code