Skip to content

change prepare_csv_wavs from relative path to absolute path and get d…#1256

Merged
SWivid merged 3 commits intoSWivid:mainfrom
ZhikangNiu:main
Jan 22, 2026
Merged

change prepare_csv_wavs from relative path to absolute path and get d…#1256
SWivid merged 3 commits intoSWivid:mainfrom
ZhikangNiu:main

Conversation

@ZhikangNiu
Copy link
Collaborator

…uration info with soundfile and torchaudio

@ZhikangNiu ZhikangNiu requested review from SWivid and Copilot January 22, 2026 04:28
@ZhikangNiu
Copy link
Collaborator Author

cc @SWivid

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the CSV-based dataset preparation to accept a direct CSV file path with absolute audio paths instead of expecting a directory containing metadata.csv and a wavs subdirectory. The PR also improves audio duration extraction by trying soundfile first before falling back to ffprobe and torchaudio.info.

Changes:

  • Changed input format from directory structure to direct CSV file path with absolute audio file paths
  • Added soundfile as the primary method for extracting audio duration with fallback chain
  • Enhanced CSV validation with strict header checking and absolute path requirement
  • Updated documentation to reflect the new CSV-based workflow

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

File Description
src/f5_tts/train/datasets/prepare_csv_wavs.py Refactored to accept CSV file with absolute paths; added soundfile for duration extraction; improved validation and error handling
src/f5_tts/train/README.md Updated usage instructions to show CSV file path instead of directory path

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

parser.add_argument(
"inp_dir",
type=str,
help="Input CSV with header 'audio_file|text' and absolute wav paths.",
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The help text states "absolute wav paths" but the code accepts any audio file format that soundfile, ffprobe, or torchaudio can handle (not just WAV files). Consider using more generic terminology like "absolute audio file paths" to avoid confusion.

Copilot uses AI. Check for mistakes.
audio_file = row[0].strip()
text = row[1].strip()
if not audio_file:
continue
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code validates that audio_file is not empty (line 221-222) but does not validate that text is not empty. Empty text entries could cause issues downstream in the text processing pipeline. Consider adding validation for empty text values similar to the audio_file check.

Suggested change
continue
continue
if not text:
continue

Copilot uses AI. Check for mistakes.
### 2. Create custom dataset with metadata.csv
### 2. Create custom dataset with CSV
Prepare a CSV with two columns using a required header: `audio_file|text`. Audio paths must be absolute.
Use guidance see [#57 here](https://github.com/SWivid/F5-TTS/discussions/57#discussioncomment-10959029).
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence "Use guidance see" is grammatically incorrect. Consider revising to "For guidance, see" or "See usage guidance at".

Suggested change
Use guidance see [#57 here](https://github.com/SWivid/F5-TTS/discussions/57#discussioncomment-10959029).
For guidance, see [#57 here](https://github.com/SWivid/F5-TTS/discussions/57#discussioncomment-10959029).

Copilot uses AI. Check for mistakes.
Get the duration of an audio file in seconds using ffmpeg's ffprobe.
Falls back to torchaudio.load() if ffprobe fails.
"""
"""Get the duration of an audio file in seconds with fallbacks."""
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout parameter is still defined in the function signature but is now only used for the ffprobe fallback (not for soundfile). The soundfile.info() call has no timeout, which means it could potentially hang indefinitely on corrupted or problematic audio files. Consider either removing the timeout parameter if it's no longer needed, or documenting that it only applies to the ffprobe fallback.

Suggested change
"""Get the duration of an audio file in seconds with fallbacks."""
"""
Get the duration of an audio file in seconds with fallbacks.
Note:
The ``timeout`` parameter applies only to the ``ffprobe`` subprocess
fallback, not to the initial ``soundfile.info()`` call (which has no
built-in timeout) or the ``torchaudio.info()`` fallback.
"""

Copilot uses AI. Check for mistakes.
Comment on lines +223 to +225
audio_path = Path(audio_file).expanduser()
if not audio_path.is_absolute():
raise ValueError(f"audio_file must be an absolute path (row {row_idx}): {audio_file}")
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The absolute path validation occurs after calling expanduser() on the path. This means that paths like "~/audio.wav" will first be expanded to an absolute path (e.g., "/home/user/audio.wav") and will pass the is_absolute() check, even though the original path in the CSV was not absolute. This contradicts the stated requirement that audio paths in the CSV must be absolute. Consider checking if the path is absolute before calling expanduser() to enforce the documented requirement strictly.

Suggested change
audio_path = Path(audio_file).expanduser()
if not audio_path.is_absolute():
raise ValueError(f"audio_file must be an absolute path (row {row_idx}): {audio_file}")
audio_path = Path(audio_file)
if not audio_path.is_absolute():
raise ValueError(f"audio_file must be an absolute path (row {row_idx}): {audio_file}")
audio_path = audio_path.expanduser()

Copilot uses AI. Check for mistakes.
/path/to/wavs/audio_0002.wav|Hi, how are you doing today? I want to go shopping and buy me some lemons.

Notes:
- audio_file must be an absolute path.
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring states "audio_file must be an absolute path" but the actual code allows tilde expansion (expanduser). This creates an inconsistency between the documentation and implementation. If the intent is to allow tilde paths, the documentation should reflect this (e.g., "audio_file must be an absolute path or use tilde notation"). If not, the code should validate before expansion.

Suggested change
- audio_file must be an absolute path.
- audio_file should be a valid filesystem path (e.g., absolute, relative to the current working directory, or using ~ for the home directory).

Copilot uses AI. Check for mistakes.
Comment on lines +170 to +171
try:
return sf.info(audio_path).duration
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation tries soundfile first, then ffprobe, then torchaudio.info as fallbacks. However, soundfile.info() may load or partially decode the audio file, which could be slower than ffprobe for certain formats. The original implementation used ffprobe first (which is typically fast as it just reads metadata) before falling back to loading the actual audio. Consider whether soundfile.info() is actually faster than ffprobe for the expected audio formats, or if the fallback order should be reconsidered.

Copilot uses AI. Check for mistakes.
from importlib.resources import files
from pathlib import Path

import soundfile as sf
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new soundfile import is added at line 32 but there is no validation that soundfile is installed. While soundfile will be tried first in get_audio_duration, if it's not installed, every single audio file will trigger an exception and print a warning before falling back to ffprobe. This could result in significant console spam for large datasets. Consider adding an import check at module level or checking once at the start of processing whether soundfile is available.

Suggested change
import soundfile as sf
try:
import soundfile as sf
_SOUND_FILE_AVAILABLE = True
except ImportError:
sf = None # type: ignore[assignment]
_SOUND_FILE_AVAILABLE = False

Copilot uses AI. Check for mistakes.


def prepare_csv_wavs_dir(input_dir, num_workers=None):
def prepare_csv_wavs_dir(input_path, num_workers=None):
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is named prepare_csv_wavs_dir which suggests it expects a directory, but it now accepts a CSV file path and the parameter is named input_path. Consider renaming the function to better reflect its actual purpose, such as prepare_csv_wavs or prepare_csv_dataset, since it no longer operates on a directory containing metadata.csv and wavs subdirectory.

Copilot uses AI. Check for mistakes.
return info.num_frames / info.sample_rate
raise ValueError("Invalid sample_rate from torchaudio.info.")
except Exception as e:
raise RuntimeError(f"failed to get duration for {audio_path}: {e}")
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message uses lowercase "failed" but similar RuntimeError messages in Python typically start with an uppercase letter for consistency with standard exception formatting. Consider capitalizing the first letter: "Failed to get duration for".

Suggested change
raise RuntimeError(f"failed to get duration for {audio_path}: {e}")
raise RuntimeError(f"Failed to get duration for {audio_path}: {e}")

Copilot uses AI. Check for mistakes.
@SWivid SWivid merged commit c279a2b into SWivid:main Jan 22, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants