change prepare_csv_wavs from relative path to absolute path and get d…#1256
change prepare_csv_wavs from relative path to absolute path and get d…#1256SWivid merged 3 commits intoSWivid:mainfrom
Conversation
…uration info with soundfile and torchaudio
|
cc @SWivid |
There was a problem hiding this comment.
Pull request overview
This PR refactors the CSV-based dataset preparation to accept a direct CSV file path with absolute audio paths instead of expecting a directory containing metadata.csv and a wavs subdirectory. The PR also improves audio duration extraction by trying soundfile first before falling back to ffprobe and torchaudio.info.
Changes:
- Changed input format from directory structure to direct CSV file path with absolute audio file paths
- Added soundfile as the primary method for extracting audio duration with fallback chain
- Enhanced CSV validation with strict header checking and absolute path requirement
- Updated documentation to reflect the new CSV-based workflow
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| src/f5_tts/train/datasets/prepare_csv_wavs.py | Refactored to accept CSV file with absolute paths; added soundfile for duration extraction; improved validation and error handling |
| src/f5_tts/train/README.md | Updated usage instructions to show CSV file path instead of directory path |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| parser.add_argument( | ||
| "inp_dir", | ||
| type=str, | ||
| help="Input CSV with header 'audio_file|text' and absolute wav paths.", |
There was a problem hiding this comment.
The help text states "absolute wav paths" but the code accepts any audio file format that soundfile, ffprobe, or torchaudio can handle (not just WAV files). Consider using more generic terminology like "absolute audio file paths" to avoid confusion.
| audio_file = row[0].strip() | ||
| text = row[1].strip() | ||
| if not audio_file: | ||
| continue |
There was a problem hiding this comment.
The code validates that audio_file is not empty (line 221-222) but does not validate that text is not empty. Empty text entries could cause issues downstream in the text processing pipeline. Consider adding validation for empty text values similar to the audio_file check.
| continue | |
| continue | |
| if not text: | |
| continue |
| ### 2. Create custom dataset with metadata.csv | ||
| ### 2. Create custom dataset with CSV | ||
| Prepare a CSV with two columns using a required header: `audio_file|text`. Audio paths must be absolute. | ||
| Use guidance see [#57 here](https://github.com/SWivid/F5-TTS/discussions/57#discussioncomment-10959029). |
There was a problem hiding this comment.
The sentence "Use guidance see" is grammatically incorrect. Consider revising to "For guidance, see" or "See usage guidance at".
| Use guidance see [#57 here](https://github.com/SWivid/F5-TTS/discussions/57#discussioncomment-10959029). | |
| For guidance, see [#57 here](https://github.com/SWivid/F5-TTS/discussions/57#discussioncomment-10959029). |
| Get the duration of an audio file in seconds using ffmpeg's ffprobe. | ||
| Falls back to torchaudio.load() if ffprobe fails. | ||
| """ | ||
| """Get the duration of an audio file in seconds with fallbacks.""" |
There was a problem hiding this comment.
The timeout parameter is still defined in the function signature but is now only used for the ffprobe fallback (not for soundfile). The soundfile.info() call has no timeout, which means it could potentially hang indefinitely on corrupted or problematic audio files. Consider either removing the timeout parameter if it's no longer needed, or documenting that it only applies to the ffprobe fallback.
| """Get the duration of an audio file in seconds with fallbacks.""" | |
| """ | |
| Get the duration of an audio file in seconds with fallbacks. | |
| Note: | |
| The ``timeout`` parameter applies only to the ``ffprobe`` subprocess | |
| fallback, not to the initial ``soundfile.info()`` call (which has no | |
| built-in timeout) or the ``torchaudio.info()`` fallback. | |
| """ |
| audio_path = Path(audio_file).expanduser() | ||
| if not audio_path.is_absolute(): | ||
| raise ValueError(f"audio_file must be an absolute path (row {row_idx}): {audio_file}") |
There was a problem hiding this comment.
The absolute path validation occurs after calling expanduser() on the path. This means that paths like "~/audio.wav" will first be expanded to an absolute path (e.g., "/home/user/audio.wav") and will pass the is_absolute() check, even though the original path in the CSV was not absolute. This contradicts the stated requirement that audio paths in the CSV must be absolute. Consider checking if the path is absolute before calling expanduser() to enforce the documented requirement strictly.
| audio_path = Path(audio_file).expanduser() | |
| if not audio_path.is_absolute(): | |
| raise ValueError(f"audio_file must be an absolute path (row {row_idx}): {audio_file}") | |
| audio_path = Path(audio_file) | |
| if not audio_path.is_absolute(): | |
| raise ValueError(f"audio_file must be an absolute path (row {row_idx}): {audio_file}") | |
| audio_path = audio_path.expanduser() |
| /path/to/wavs/audio_0002.wav|Hi, how are you doing today? I want to go shopping and buy me some lemons. | ||
|
|
||
| Notes: | ||
| - audio_file must be an absolute path. |
There was a problem hiding this comment.
The module docstring states "audio_file must be an absolute path" but the actual code allows tilde expansion (expanduser). This creates an inconsistency between the documentation and implementation. If the intent is to allow tilde paths, the documentation should reflect this (e.g., "audio_file must be an absolute path or use tilde notation"). If not, the code should validate before expansion.
| - audio_file must be an absolute path. | |
| - audio_file should be a valid filesystem path (e.g., absolute, relative to the current working directory, or using ~ for the home directory). |
| try: | ||
| return sf.info(audio_path).duration |
There was a problem hiding this comment.
The new implementation tries soundfile first, then ffprobe, then torchaudio.info as fallbacks. However, soundfile.info() may load or partially decode the audio file, which could be slower than ffprobe for certain formats. The original implementation used ffprobe first (which is typically fast as it just reads metadata) before falling back to loading the actual audio. Consider whether soundfile.info() is actually faster than ffprobe for the expected audio formats, or if the fallback order should be reconsidered.
| from importlib.resources import files | ||
| from pathlib import Path | ||
|
|
||
| import soundfile as sf |
There was a problem hiding this comment.
The new soundfile import is added at line 32 but there is no validation that soundfile is installed. While soundfile will be tried first in get_audio_duration, if it's not installed, every single audio file will trigger an exception and print a warning before falling back to ffprobe. This could result in significant console spam for large datasets. Consider adding an import check at module level or checking once at the start of processing whether soundfile is available.
| import soundfile as sf | |
| try: | |
| import soundfile as sf | |
| _SOUND_FILE_AVAILABLE = True | |
| except ImportError: | |
| sf = None # type: ignore[assignment] | |
| _SOUND_FILE_AVAILABLE = False |
|
|
||
|
|
||
| def prepare_csv_wavs_dir(input_dir, num_workers=None): | ||
| def prepare_csv_wavs_dir(input_path, num_workers=None): |
There was a problem hiding this comment.
The function is named prepare_csv_wavs_dir which suggests it expects a directory, but it now accepts a CSV file path and the parameter is named input_path. Consider renaming the function to better reflect its actual purpose, such as prepare_csv_wavs or prepare_csv_dataset, since it no longer operates on a directory containing metadata.csv and wavs subdirectory.
| return info.num_frames / info.sample_rate | ||
| raise ValueError("Invalid sample_rate from torchaudio.info.") | ||
| except Exception as e: | ||
| raise RuntimeError(f"failed to get duration for {audio_path}: {e}") |
There was a problem hiding this comment.
The error message uses lowercase "failed" but similar RuntimeError messages in Python typically start with an uppercase letter for consistency with standard exception formatting. Consider capitalizing the first letter: "Failed to get duration for".
| raise RuntimeError(f"failed to get duration for {audio_path}: {e}") | |
| raise RuntimeError(f"Failed to get duration for {audio_path}: {e}") |
…uration info with soundfile and torchaudio