Table of Contents
This package assists in generating training data for fine-tuning Whisper by:
- Synthesizing long-form audio: Concatenating multiple sentence-level audio clips into longer recordings with matching SRT subtitles, simulating real-world long-form transcription scenarios
- Processing existing SRTs: Cutting and segmenting audio based on existing SRT/VTT transcripts into Whisper-compatible 30-second chunks
- Netflix-style SRT normalization: Merging short captions to meet Netflix subtitle guidelines (max 42 chars, max 7 seconds)
- Filtering and quality control: Removing problematic samples (high compression ratio, too few words, French content, specific keywords)
When working with sentence-level datasets (e.g., Common Voice), the tool concatenates multiple short audio clips into longer recordings while:
- Maintaining speaker consistency with configurable probability
- Generating matching SRT files with accurate timestamps
- Supporting audio overlap between segments (for realistic speech patterns)
- Using Voice Activity Detection (VAD) to determine optimal overlap points
For existing long-form audio with SRT/VTT transcripts, the tool:
- Segments audio into 30-second chunks (Whisper's optimal input length)
- Preserves timestamp information in the output format (
<|0.00|>text<|1.50|>) - Handles overlapping utterances and invalid timestamps
- Supports optional trimming of initial silence (
cut_initial_audio)
Optionally merge consecutive SRT captions to meet Netflix subtitle guidelines:
- Maximum 42 characters per caption
- Maximum 7 seconds duration
- Configurable skip words to prevent merging specific content
Automatic filtering of problematic samples:
- High compression ratio (> 2.4) indicating repetitive/garbage text
- Too few words (≤ 8 words)
- French language detection and filtering
- Custom word filtering via
filter_wordsconfiguration
Create a .tsv file with columns:
| Column | Required | Description |
|---|---|---|
path |
Yes | Relative path to the .mp3 file |
sentence |
Yes | The text corresponding to the audio |
client_id |
No | Speaker ID (increases probability of consecutive same-speaker utterances) |
Create a .tsv file with columns:
| Column | Required | Description |
|---|---|---|
srt_path |
Yes | Path to the .srt or .vtt file |
audio_path |
Yes | Path to the corresponding audio file |
language |
Yes | ISO language code (e.g., de, en) |
id |
No | Unique identifier (defaults to audio filename) |
Use this with the transcripts_tsv config option.
Place audio files in one folder and matching SRT/VTT files in another folder (with the same stem name). The tool will automatically match them.
Specify dataset identifiers via hu_datasets. Supports:
- Datasets with
audioandsrtcolumns (processed directly) - Datasets with
audioandsentence/textcolumns (generates synthetic SRTs)
Set up a .yaml configuration file. See example.yaml for a complete example.
# Output structure: out_folder_base/dataset_name/split_name
dataset_name: my_dataset
split_name: train
out_folder_base: /path/to/output
# Data Sources (choose one or more)
tsv_paths: ["data/sentences.tsv"] # Sentence-level TSV files
clips_folders: ["data/clips"] # Folders containing audio clips
partials: [1.0] # Proportion of each dataset to use
transcripts_tsv: "data/transcripts.tsv" # TSV mapping SRTs to audio files
hu_datasets: ["username/dataset"] # HuggingFace dataset identifiersmaintain_speaker_chance: 0.5 # Probability of keeping same speaker
n_samples_per_srt: 16 # Number of sentences per generated SRT
normalize_text: true # Apply text normalization rules
# Audio overlap settings (for realistic speech)
overlap_chance: 0.5 # Probability of overlap between clips
max_overlap_chance: 0.2 # Probability of maximum overlap
max_overlap_duration: 0.2 # Max overlap duration in secondsnetflix_normalize: true # Apply Netflix-style caption merging
cut_initial_audio: true # Trim audio to 1 second before first subtitle
filter_french: true # Remove French language samples
filter_words: ["[MUSIC]", "[NOISE]"] # Remove samples containing these wordsupload_to_hu: true
hu_repo: "username/repo_name"
hu_private: truewhisper_prep -c config.yamlThis will:
- Download HuggingFace datasets (if configured)
- Generate synthetic SRTs from sentences OR process existing SRTs
- Apply Netflix normalization (if enabled)
- Segment audio into 30-second chunks with timestamps
- Filter problematic samples
- Convert to HuggingFace dataset format
- Upload to HuggingFace Hub (if configured)
To normalize SRT files in a folder without running the full pipeline:
from whisper_prep.utils import netflix_normalize_all_srts_in_folder
# Normalize all SRTs in a folder
netflix_normalize_all_srts_in_folder("/path/to/srt/folder")
# With skip words (cues containing these won't be merged)
netflix_normalize_all_srts_in_folder("/path/to/srt/folder", skip_words=["[MUSIC]"])Upload a TSV as ASR Dataset:
python upload_asr_dataset.py --tsv path/to/data.tsv \
--repo_id username/dataset_name --split trainUpload to HuggingFace Hub: See https://huggingface.co/docs/datasets/v1.16.0/upload_dataset.html
After running the pipeline, the output folder contains:
out_folder_base/dataset_name/split_name/
├── audios/ # Downloaded/generated audio files
├── transcripts/ # Downloaded/generated SRT files
├── created_dataset/
│ ├── data.ljson # Processed records (JSON lines)
│ └── dump/ # 30-second audio segments
│ └── <audio_id>/
│ ├── 0.mp3 # Segment starting at 0ms
│ ├── 30000.mp3 # Segment starting at 30000ms
│ └── ...
├── hf/ # HuggingFace dataset format
├── bad_examples.csv # Filtered high-compression/short samples
├── french_examples.csv # Filtered French samples (if enabled)
└── filtered_<word>_examples.csv # Filtered samples by word
Each output record contains:
audio_path: Path to the 30-second audio segmenttext: Transcription with timestamps (e.g.,<|0.00|> Hello world <|1.50|>)language: ISO language codeprompt: Previous text for context (from prior segments)
Timestamps are quantized to 20ms resolution (Whisper's native resolution).
- All audio is resampled to 16kHz mono
- Segments are saved as MP3 format
- Maximum segment duration: 30 seconds
The examples/ folder contains two complete configuration examples for the two main use cases.
File: examples/config_sentence_fusion.yaml
Use this when you have short sentence-level audio clips (like Common Voice) and want to combine them into longer, more realistic training data.
Workflow:
Input: 16 short audio clips (each 2-3s) + transcriptions
↓
[Sentence Concatenation]
↓
Output: 1 long audio file (30-60s) + matched SRT with timestamps
Key Features:
- Maintains speaker consistency (can force consecutive utterances from same speaker)
- Applies Voice Activity Detection (VAD) to find natural overlap points
- Generates synthetic SRT files with accurate timestamps
- Normalizes text (removes URLs, extra spaces, etc.)
Quick Start:
# Edit the configuration
cp examples/config_sentence_fusion.yaml my_config.yaml
# Update: tsv_paths, clips_folders, dataset_name
# Run
whisper_prep -c my_config.yamlExample TSV format (sentences.tsv):
path sentence client_id
clips/abc123.mp3 The quick brown fox speaker_001
clips/def456.mp3 jumps over the lazy dog speaker_001
clips/ghi789.mp3 in the green forest speaker_002
File: examples/config_srt_splitting.yaml
Use this when you have long-form audio with existing SRT/VTT subtitles (like movies, podcasts, audiobooks) and want to segment them into Whisper-compatible chunks.
Workflow:
Input: movie_001.mp3 (2 hours) + movie_001.srt (with timestamps)
↓
[Netflix Normalization] ← optional
↓
[Audio Segmentation]
↓
Output: 240 × 30-second segments with timestamp tokens
movie_001/0.mp3, movie_001/30000.mp3, movie_001/60000.mp3, ...
Key Features:
- Automatically segments into 30-second chunks (optimal for Whisper)
- Preserves timestamps in output format:
<|0.00|> Text <|1.50|> - Netflix normalization merges captions (42 chars max, 7 seconds max)
- Handles overlapping subtitles and invalid timestamps
- Filters problematic content (high compression, noise markers, etc.)
Quick Start:
# Edit the configuration
cp examples/config_srt_splitting.yaml my_config.yaml
# Update: transcripts_tsv, dataset_name, filter_words
# Run
whisper_prep -c my_config.yamlExample TSV format (transcripts_mapping.tsv):
srt_path audio_path language id
subtitles/movie_001.srt audio/movie_001.mp3 de movie_001
subtitles/podcast_002.srt audio/podcast_002.mp3 en podcast_002
Or alternatively, create a folder structure:
audio/
├── movie_001.mp3
└── podcast_002.mp3
subtitles/
├── movie_001.srt
└── podcast_002.srt
After running whisper_prep, the dataset is automatically in HuggingFace format. Upload it with:
# Enable in config.yaml:
upload_to_hu: true
hu_repo: "username/my_dataset"
hu_private: true
# Then run:
whisper_prep -c config.yamlOr manually push an existing dataset:
huggingface-cli upload username/my_dataset ./datasets/my_dataset/train/hf/ --repo-type datasetThe upload_asr_dataset.py script converts a simple TSV file directly to HuggingFace format and uploads it.
Usage:
python upload_asr_dataset.py \
--tsv path/to/data.tsv \
--repo_id username/dataset_name \
--split train \
[--sampling_rate 16000]Required TSV Columns:
path: Path to audio file (MP3, WAV, FLAC, OGG, etc.)sentenceortext: Transcription text- Optional:
srt_path(for SRT subtitles)
Example:
# Create data.tsv:
# path sentence
# clips/abc.mp3 The quick brown fox
# clips/def.mp3 Jumps over the lazy dog
python upload_asr_dataset.py \
--tsv data.tsv \
--repo_id myuser/my_asr_data \
--split trainFeatures:
- Automatically loads audio files
- Casts audio to 16kHz sampling rate (or custom)
- Handles SRT files if
srt_pathcolumn exists - Drops path columns (keeps only audio data)
- Requires authentication:
huggingface-cli login
Command-line Options:
--tsv PATH # Path to input TSV file (required)
--repo_id REPO_ID # HuggingFace repo ID (required)
--split SPLIT # Dataset split name (default: train)
--audio_column COLUMN # TSV column name for audio paths (default: path)
--sampling_rate RATE # Target sampling rate (default: 16000)Vincenzo Timmel - [email protected]
Distributed under the MIT License. See LICENSE for more information.