-
Notifications
You must be signed in to change notification settings - Fork 2.1k
change prepare_csv_wavs from relative path to absolute path and get d… #1256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,9 +1,22 @@ | ||||||||||||||||||||
| """ | ||||||||||||||||||||
| Usage: | ||||||||||||||||||||
| python prepare_csv_wavs.py /path/to/metadata.csv /output/dataset/path [--pretrain] [--workers N] | ||||||||||||||||||||
|
|
||||||||||||||||||||
| CSV format (header required, "|" delimiter): | ||||||||||||||||||||
| audio_file|text | ||||||||||||||||||||
| /path/to/wavs/audio_0001.wav|Yo! Hello? Hello? | ||||||||||||||||||||
| /path/to/wavs/audio_0002.wav|Hi, how are you doing today? I want to go shopping and buy me some lemons. | ||||||||||||||||||||
|
|
||||||||||||||||||||
| Notes: | ||||||||||||||||||||
| - audio_file must be an absolute path. | ||||||||||||||||||||
|
||||||||||||||||||||
| - audio_file must be an absolute path. | |
| - audio_file should be a valid filesystem path (e.g., absolute, relative to the current working directory, or using ~ for the home directory). |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new soundfile import is added at line 32 but there is no validation that soundfile is installed. While soundfile will be tried first in get_audio_duration, if it's not installed, every single audio file will trigger an exception and print a warning before falling back to ffprobe. This could result in significant console spam for large datasets. Consider adding an import check at module level or checking once at the start of processing whether soundfile is available.
| import soundfile as sf | |
| try: | |
| import soundfile as sf | |
| _SOUND_FILE_AVAILABLE = True | |
| except ImportError: | |
| sf = None # type: ignore[assignment] | |
| _SOUND_FILE_AVAILABLE = False |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is named prepare_csv_wavs_dir which suggests it expects a directory, but it now accepts a CSV file path and the parameter is named input_path. Consider renaming the function to better reflect its actual purpose, such as prepare_csv_wavs or prepare_csv_dataset, since it no longer operates on a directory containing metadata.csv and wavs subdirectory.
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout parameter is still defined in the function signature but is now only used for the ffprobe fallback (not for soundfile). The soundfile.info() call has no timeout, which means it could potentially hang indefinitely on corrupted or problematic audio files. Consider either removing the timeout parameter if it's no longer needed, or documenting that it only applies to the ffprobe fallback.
| """Get the duration of an audio file in seconds with fallbacks.""" | |
| """ | |
| Get the duration of an audio file in seconds with fallbacks. | |
| Note: | |
| The ``timeout`` parameter applies only to the ``ffprobe`` subprocess | |
| fallback, not to the initial ``soundfile.info()`` call (which has no | |
| built-in timeout) or the ``torchaudio.info()`` fallback. | |
| """ |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation tries soundfile first, then ffprobe, then torchaudio.info as fallbacks. However, soundfile.info() may load or partially decode the audio file, which could be slower than ffprobe for certain formats. The original implementation used ffprobe first (which is typically fast as it just reads metadata) before falling back to loading the actual audio. Consider whether soundfile.info() is actually faster than ffprobe for the expected audio formats, or if the fallback order should be reconsidered.
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message uses lowercase "failed" but similar RuntimeError messages in Python typically start with an uppercase letter for consistency with standard exception formatting. Consider capitalizing the first letter: "Failed to get duration for".
| raise RuntimeError(f"failed to get duration for {audio_path}: {e}") | |
| raise RuntimeError(f"Failed to get duration for {audio_path}: {e}") |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code validates that audio_file is not empty (line 221-222) but does not validate that text is not empty. Empty text entries could cause issues downstream in the text processing pipeline. Consider adding validation for empty text values similar to the audio_file check.
| continue | |
| continue | |
| if not text: | |
| continue |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The absolute path validation occurs after calling expanduser() on the path. This means that paths like "~/audio.wav" will first be expanded to an absolute path (e.g., "/home/user/audio.wav") and will pass the is_absolute() check, even though the original path in the CSV was not absolute. This contradicts the stated requirement that audio paths in the CSV must be absolute. Consider checking if the path is absolute before calling expanduser() to enforce the documented requirement strictly.
| audio_path = Path(audio_file).expanduser() | |
| if not audio_path.is_absolute(): | |
| raise ValueError(f"audio_file must be an absolute path (row {row_idx}): {audio_file}") | |
| audio_path = Path(audio_file) | |
| if not audio_path.is_absolute(): | |
| raise ValueError(f"audio_file must be an absolute path (row {row_idx}): {audio_file}") | |
| audio_path = audio_path.expanduser() |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The help text states "absolute wav paths" but the code accepts any audio file format that soundfile, ffprobe, or torchaudio can handle (not just WAV files). Consider using more generic terminology like "absolute audio file paths" to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence "Use guidance see" is grammatically incorrect. Consider revising to "For guidance, see" or "See usage guidance at".