Skip to content

A large-scale speech corpus introduced in Spark-TTS, built from diverse open-source datasets for training text-to-speech (TTS) systems.

License

Notifications You must be signed in to change notification settings

SparkAudio/VoxBox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VoxBox

A large-scale speech corpus introduced in Spark-TTS, built from diverse open-source datasets for training text-to-speech (TTS) systems.

📊 VoxBox Dataset Overview

Chinese Datasets

Data Language #Utterance Male (h) Female (h) Total (h)
AISHELL-3 Chinese 88,035 16.01 69.61 85.62
CASIA Chinese 857 0.25 0.2 0.44
Emilia-CN Chinese 15,629,241 22,017.56 12,741.89 34,759.45
ESD Chinese 16,101 6.69 7.68 14.37
HQ-Conversations Chinese 50,982 35.77 64.23 100
M3ED Chinese 253 0.04 0.06 0.1
MAGICDATA Chinese 609,474 360.31 393.81 754.13
MER2023 Chinese 1,667 0.86 1.07 1.93
NCSSD-CL-CN Chinese 98,628 53.83 59.21 113.04
NCSSD-RC-CN Chinese 21,688 7.05 22.53 29.58
WenetSpeech4TTS Chinese 8,856,480 7,504.19 4,264.3 11,768.49
Total (Chinese) 25,373,406 30,002.56 17,624.59 47,627.15

English Datasets

Data Language #Utterance Male (h) Female (h) Total (h)
CREMA-D English 809 0.3 0.27 0.57
Dailytalk English 23,754 10.79 10.86 21.65
Emilia-EN English 8,303,103 13,724.76 6,573.22 20,297.98
EMNS English 918 0 1.49 1.49
EmoV-DB English 3,647 2.22 2.79 5
Expresso English 11,595 5.47 5.39 10.86
Gigaspeech English 6,619,339 4,310.19 2,885.66 7,195.85
Hi-Fi TTS English 323,911 133.31 158.38 291.68
IEMOCAP English 2,423 1.66 1.31 2.97
JL-Corpus English 893 0.26 0.26 0.52
Librispeech English 230,865 393.95 367.67 761.62
LibriTTS-R English 363,270 277.87 283.03 560.9
MEAD English 3,767 2.26 2.42 4.68
MELD English 5,100 2.14 1.94 4.09
MLS-English English 6,319,002 14,366.25 11,212.92 25,579.18
MSP-Podcast English 796 0.76 0.56 1.32
NCSSD-CL-EN English 62,107 36.84 32.93 69.77
NCSSD-RL-EN English 10,032 4.18 14.92 19.09
RAVDESS English 950 0.49 0.48 0.97
SAVEE English 286 0.15 0.15 0.31
TESS English 1,956 0 1.15 1.15
VCTK English 44,283 16.95 24.51 41.46
Total (English) 22,332,806 33,290.8 21,582.31 54,873.11

Overall

Data #Utterance Male (h) Female (h) Total (h)
Overall Total 47,706,212 63,293.36 39,206.9 102,500.26

Dataset Structure

The dataset is organized as follows:

.
├── audios/
│   └── aishell-3/                      # Audio files (organized by sub-corpus)
│   └── ...
└── metadata/
    ├── aishell-3.jsonl
    ├── casia.jsonl
    ├── commonvoice_cn.jsonl
    ├── ...
    └── wenetspeech4tts.jsonl          # JSONL metadata files

Each JSONL file corresponds to a specific sub-corpus and contains metadata records for individual audio samples.

Metadata Format

Each line in the JSONL files is a JSON object detailing an audio sample. For example:

{
  "index": "VCTK_0000044280",
  "split": "train",
  "language": "en",
  "age": "Youth-Adult",
  "gender": "female",
  "emotion": "UNKNOWN",
  "pitch": 180.626,
  "pitch_std": 0.158,
  "speed": 4.2,
  "duration": 3.84,
  "speech_duration": 3.843,
  "syllable_num": 16,
  "text": "Clearly, the need for a personal loan is written in the stars.",
  "syllables": "K-L-IH1-R L-IY0 DH-AH0 N-IY1-D F-AO1 R-AH0 P-ER1 S-IH0 N-IH0-L L-OW1 N-IH1 Z-R-IH1 T-AH0 N-IH0-N DH-AH0-S T-AA1-R-Z",
  "wav_path": "vctk/VCTK_0000044280.flac"
}

Key fields include:

  • index: Unique identifier for the audio sample.
  • split: Dataset split (e.g., train, test).
  • language: Language of the audio sample (e.g., "en" for English, "zh" for Chinese).
  • age, gender, emotion: Speaker attributes.
  • pitch, pitch_std, speed: Acoustic features.
  • duration: Total duration of the audio sample in seconds.
  • speech_duration: Duration excluding silence at both ends.
  • syllable_num: Number of syllables in the utterance.
  • text: Transcription of the utterance.
  • syllables: Syllable-level transcription.
  • wav_path: Relative path to the audio file within the dataset.

📥 Download Data

You can download the VoxBox dataset via the Hugging Face Datasets Hub:

1️⃣ Download the Full Dataset

You can clone the entire dataset repository (metadata + all audio files):

git lfs install
git clone https://huggingface.co/datasets/SparkAudio/voxbox

⚠️ The full dataset is large (5.82 TB), and may take considerable time and storage.

2️⃣ Download Specific Subsets

from huggingface_hub import HfApi, hf_hub_download

# ✅ Specify the subsets you want to download
target_subsets = ['casia', 'cremad', 'emns']

REPO_ID = "SparkAudio/voxbox"
REPO_TYPE = "dataset"

api = HfApi()
dataset_info = api.dataset_info(repo_id=REPO_ID)

# Get all available file paths (rfilename)
all_paths = [s.rfilename for s in dataset_info.siblings]

for subset in target_subsets:
    print(f"\n🔽 Downloading subset: {subset}")

    # Download metadata file
    metadata_path = f"metadata/{subset}.jsonl"
    if metadata_path in all_paths:
        print(f"📄 Metadata found: {metadata_path}")
        hf_hub_download(
            repo_id=REPO_ID,
            repo_type=REPO_TYPE,
            filename=metadata_path,
            local_dir="./voxbox_subset",
            local_dir_use_symlinks=False,
        )
    else:
        print(f"⚠️ Metadata not found: {metadata_path}")

    # Match all audio tar.gz files for the subset
    audio_tars = [f for f in all_paths if f.startswith(f"audios/{subset}/") and f.endswith(".tar.gz")]
    if not audio_tars:
        print(f"⚠️ No audio files found for {subset}")
        continue

    for tar_file in audio_tars:
        print(f"🎧 Downloading audio: {tar_file}")
        hf_hub_download(
            repo_id=REPO_ID,
            repo_type=REPO_TYPE,
            filename=tar_file,
            local_dir="./voxbox_subset",
            local_dir_use_symlinks=False,
        )

✅ Directory Structure of the Downloaded Results

voxbox_subset/
├── audios
│   ├── casia
│   │   └── casia_0000.tar.gz
│   ├── cremad
│   │   └── cremad_0000.tar.gz
│   └── emns
│       └── emns_0000.tar.gz
└── metadata
    ├── casia.jsonl
    ├── cremad.jsonl
    └── emns.jsonl

Label Your Own Data

python -m tools.annotation \
    --audio_path 'path to the audio' \
    --text 'transcription of the audio'

License

Please refer to the original licenses of each sub-corpus. This dataset aggregates and annotates the metadata in a unified structure for research purposes.

Citation

If you use this dataset in your research, please consider citing:

@article{wang2025spark,
  title={Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens},
  author={Wang, Xinsheng and Jiang, Mingqi and Ma, Ziyang and Zhang, Ziyu and Liu, Songxiang and Li, Linqin and Liang, Zheng and Zheng, Qixi and Wang, Rui and Feng, Xiaoqin and others},
  journal={arXiv preprint arXiv:2503.01710},
  year={2025}
}

About

A large-scale speech corpus introduced in Spark-TTS, built from diverse open-source datasets for training text-to-speech (TTS) systems.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages