VoxBox

A large-scale speech corpus introduced in Spark-TTS, built from diverse open-source datasets for training text-to-speech (TTS) systems.

📊 VoxBox Dataset Overview

Chinese Datasets

Data	Language	#Utterance	Male (h)	Female (h)	Total (h)
AISHELL-3	Chinese	88,035	16.01	69.61	85.62
CASIA	Chinese	857	0.25	0.2	0.44
Emilia-CN	Chinese	15,629,241	22,017.56	12,741.89	34,759.45
ESD	Chinese	16,101	6.69	7.68	14.37
HQ-Conversations	Chinese	50,982	35.77	64.23	100
M3ED	Chinese	253	0.04	0.06	0.1
MAGICDATA	Chinese	609,474	360.31	393.81	754.13
MER2023	Chinese	1,667	0.86	1.07	1.93
NCSSD-CL-CN	Chinese	98,628	53.83	59.21	113.04
NCSSD-RC-CN	Chinese	21,688	7.05	22.53	29.58
WenetSpeech4TTS	Chinese	8,856,480	7,504.19	4,264.3	11,768.49
Total (Chinese)		25,373,406	30,002.56	17,624.59	47,627.15

English Datasets

Data	Language	#Utterance	Male (h)	Female (h)	Total (h)
CREMA-D	English	809	0.3	0.27	0.57
Dailytalk	English	23,754	10.79	10.86	21.65
Emilia-EN	English	8,303,103	13,724.76	6,573.22	20,297.98
EMNS	English	918	0	1.49	1.49
EmoV-DB	English	3,647	2.22	2.79	5
Expresso	English	11,595	5.47	5.39	10.86
Gigaspeech	English	6,619,339	4,310.19	2,885.66	7,195.85
Hi-Fi TTS	English	323,911	133.31	158.38	291.68
IEMOCAP	English	2,423	1.66	1.31	2.97
JL-Corpus	English	893	0.26	0.26	0.52
Librispeech	English	230,865	393.95	367.67	761.62
LibriTTS-R	English	363,270	277.87	283.03	560.9
MEAD	English	3,767	2.26	2.42	4.68
MELD	English	5,100	2.14	1.94	4.09
MLS-English	English	6,319,002	14,366.25	11,212.92	25,579.18
MSP-Podcast	English	796	0.76	0.56	1.32
NCSSD-CL-EN	English	62,107	36.84	32.93	69.77
NCSSD-RL-EN	English	10,032	4.18	14.92	19.09
RAVDESS	English	950	0.49	0.48	0.97
SAVEE	English	286	0.15	0.15	0.31
TESS	English	1,956	0	1.15	1.15
VCTK	English	44,283	16.95	24.51	41.46
Total (English)		22,332,806	33,290.8	21,582.31	54,873.11

Overall

Data	#Utterance	Male (h)	Female (h)	Total (h)
Overall Total	47,706,212	63,293.36	39,206.9	102,500.26

Dataset Structure

The dataset is organized as follows:

.
├── audios/
│   └── aishell-3/                      # Audio files (organized by sub-corpus)
│   └── ...
└── metadata/
    ├── aishell-3.jsonl
    ├── casia.jsonl
    ├── commonvoice_cn.jsonl
    ├── ...
    └── wenetspeech4tts.jsonl          # JSONL metadata files

Each JSONL file corresponds to a specific sub-corpus and contains metadata records for individual audio samples.

Metadata Format

Each line in the JSONL files is a JSON object detailing an audio sample. For example:

{
  "index": "VCTK_0000044280",
  "split": "train",
  "language": "en",
  "age": "Youth-Adult",
  "gender": "female",
  "emotion": "UNKNOWN",
  "pitch": 180.626,
  "pitch_std": 0.158,
  "speed": 4.2,
  "duration": 3.84,
  "speech_duration": 3.843,
  "syllable_num": 16,
  "text": "Clearly, the need for a personal loan is written in the stars.",
  "syllables": "K-L-IH1-R L-IY0 DH-AH0 N-IY1-D F-AO1 R-AH0 P-ER1 S-IH0 N-IH0-L L-OW1 N-IH1 Z-R-IH1 T-AH0 N-IH0-N DH-AH0-S T-AA1-R-Z",
  "wav_path": "vctk/VCTK_0000044280.flac"
}

Key fields include:

index: Unique identifier for the audio sample.
split: Dataset split (e.g., train, test).
language: Language of the audio sample (e.g., "en" for English, "zh" for Chinese).
age, gender, emotion: Speaker attributes.
pitch, pitch_std, speed: Acoustic features.
duration: Total duration of the audio sample in seconds.
speech_duration: Duration excluding silence at both ends.
syllable_num: Number of syllables in the utterance.
text: Transcription of the utterance.
syllables: Syllable-level transcription.
wav_path: Relative path to the audio file within the dataset.

📥 Download Data

You can download the VoxBox dataset via the Hugging Face Datasets Hub:

1️⃣ Download the Full Dataset

You can clone the entire dataset repository (metadata + all audio files):

git lfs install
git clone https://huggingface.co/datasets/SparkAudio/voxbox

⚠️ The full dataset is large (5.82 TB), and may take considerable time and storage.

2️⃣ Download Specific Subsets

from huggingface_hub import HfApi, hf_hub_download

# ✅ Specify the subsets you want to download
target_subsets = ['casia', 'cremad', 'emns']

REPO_ID = "SparkAudio/voxbox"
REPO_TYPE = "dataset"

api = HfApi()
dataset_info = api.dataset_info(repo_id=REPO_ID)

# Get all available file paths (rfilename)
all_paths = [s.rfilename for s in dataset_info.siblings]

for subset in target_subsets:
    print(f"\n🔽 Downloading subset: {subset}")

    # Download metadata file
    metadata_path = f"metadata/{subset}.jsonl"
    if metadata_path in all_paths:
        print(f"📄 Metadata found: {metadata_path}")
        hf_hub_download(
            repo_id=REPO_ID,
            repo_type=REPO_TYPE,
            filename=metadata_path,
            local_dir="./voxbox_subset",
            local_dir_use_symlinks=False,
        )
    else:
        print(f"⚠️ Metadata not found: {metadata_path}")

    # Match all audio tar.gz files for the subset
    audio_tars = [f for f in all_paths if f.startswith(f"audios/{subset}/") and f.endswith(".tar.gz")]
    if not audio_tars:
        print(f"⚠️ No audio files found for {subset}")
        continue

    for tar_file in audio_tars:
        print(f"🎧 Downloading audio: {tar_file}")
        hf_hub_download(
            repo_id=REPO_ID,
            repo_type=REPO_TYPE,
            filename=tar_file,
            local_dir="./voxbox_subset",
            local_dir_use_symlinks=False,
        )

✅ Directory Structure of the Downloaded Results

voxbox_subset/
├── audios
│   ├── casia
│   │   └── casia_0000.tar.gz
│   ├── cremad
│   │   └── cremad_0000.tar.gz
│   └── emns
│       └── emns_0000.tar.gz
└── metadata
    ├── casia.jsonl
    ├── cremad.jsonl
    └── emns.jsonl

Label Your Own Data

python -m tools.annotation \
    --audio_path 'path to the audio' \
    --text 'transcription of the audio'

License

Please refer to the original licenses of each sub-corpus. This dataset aggregates and annotates the metadata in a unified structure for research purposes.

Citation

If you use this dataset in your research, please consider citing:

@article{wang2025spark,
  title={Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens},
  author={Wang, Xinsheng and Jiang, Mingqi and Ma, Ziyang and Zhang, Ziyu and Liu, Songxiang and Li, Linqin and Liang, Zheng and Zheng, Qixi and Wang, Rui and Feng, Xiaoqin and others},
  journal={arXiv preprint arXiv:2503.01710},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
example		example
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VoxBox

📊 VoxBox Dataset Overview

Chinese Datasets

English Datasets

Overall

Dataset Structure

Metadata Format

📥 Download Data

1️⃣ Download the Full Dataset

2️⃣ Download Specific Subsets

Label Your Own Data

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

SparkAudio/VoxBox

Folders and files

Latest commit

History

Repository files navigation

VoxBox

📊 VoxBox Dataset Overview

Chinese Datasets

English Datasets

Overall

Dataset Structure

Metadata Format

📥 Download Data

1️⃣ Download the Full Dataset

2️⃣ Download Specific Subsets

Label Your Own Data

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages