A large-scale speech corpus introduced in Spark-TTS, built from diverse open-source datasets for training text-to-speech (TTS) systems.
Data | Language | #Utterance | Male (h) | Female (h) | Total (h) |
---|---|---|---|---|---|
AISHELL-3 | Chinese | 88,035 | 16.01 | 69.61 | 85.62 |
CASIA | Chinese | 857 | 0.25 | 0.2 | 0.44 |
Emilia-CN | Chinese | 15,629,241 | 22,017.56 | 12,741.89 | 34,759.45 |
ESD | Chinese | 16,101 | 6.69 | 7.68 | 14.37 |
HQ-Conversations | Chinese | 50,982 | 35.77 | 64.23 | 100 |
M3ED | Chinese | 253 | 0.04 | 0.06 | 0.1 |
MAGICDATA | Chinese | 609,474 | 360.31 | 393.81 | 754.13 |
MER2023 | Chinese | 1,667 | 0.86 | 1.07 | 1.93 |
NCSSD-CL-CN | Chinese | 98,628 | 53.83 | 59.21 | 113.04 |
NCSSD-RC-CN | Chinese | 21,688 | 7.05 | 22.53 | 29.58 |
WenetSpeech4TTS | Chinese | 8,856,480 | 7,504.19 | 4,264.3 | 11,768.49 |
Total (Chinese) | 25,373,406 | 30,002.56 | 17,624.59 | 47,627.15 |
Data | Language | #Utterance | Male (h) | Female (h) | Total (h) |
---|---|---|---|---|---|
CREMA-D | English | 809 | 0.3 | 0.27 | 0.57 |
Dailytalk | English | 23,754 | 10.79 | 10.86 | 21.65 |
Emilia-EN | English | 8,303,103 | 13,724.76 | 6,573.22 | 20,297.98 |
EMNS | English | 918 | 0 | 1.49 | 1.49 |
EmoV-DB | English | 3,647 | 2.22 | 2.79 | 5 |
Expresso | English | 11,595 | 5.47 | 5.39 | 10.86 |
Gigaspeech | English | 6,619,339 | 4,310.19 | 2,885.66 | 7,195.85 |
Hi-Fi TTS | English | 323,911 | 133.31 | 158.38 | 291.68 |
IEMOCAP | English | 2,423 | 1.66 | 1.31 | 2.97 |
JL-Corpus | English | 893 | 0.26 | 0.26 | 0.52 |
Librispeech | English | 230,865 | 393.95 | 367.67 | 761.62 |
LibriTTS-R | English | 363,270 | 277.87 | 283.03 | 560.9 |
MEAD | English | 3,767 | 2.26 | 2.42 | 4.68 |
MELD | English | 5,100 | 2.14 | 1.94 | 4.09 |
MLS-English | English | 6,319,002 | 14,366.25 | 11,212.92 | 25,579.18 |
MSP-Podcast | English | 796 | 0.76 | 0.56 | 1.32 |
NCSSD-CL-EN | English | 62,107 | 36.84 | 32.93 | 69.77 |
NCSSD-RL-EN | English | 10,032 | 4.18 | 14.92 | 19.09 |
RAVDESS | English | 950 | 0.49 | 0.48 | 0.97 |
SAVEE | English | 286 | 0.15 | 0.15 | 0.31 |
TESS | English | 1,956 | 0 | 1.15 | 1.15 |
VCTK | English | 44,283 | 16.95 | 24.51 | 41.46 |
Total (English) | 22,332,806 | 33,290.8 | 21,582.31 | 54,873.11 |
Data | #Utterance | Male (h) | Female (h) | Total (h) |
---|---|---|---|---|
Overall Total | 47,706,212 | 63,293.36 | 39,206.9 | 102,500.26 |
The dataset is organized as follows:
.
├── audios/
│ └── aishell-3/ # Audio files (organized by sub-corpus)
│ └── ...
└── metadata/
├── aishell-3.jsonl
├── casia.jsonl
├── commonvoice_cn.jsonl
├── ...
└── wenetspeech4tts.jsonl # JSONL metadata files
Each JSONL file corresponds to a specific sub-corpus and contains metadata records for individual audio samples.
Each line in the JSONL files is a JSON object detailing an audio sample. For example:
{
"index": "VCTK_0000044280",
"split": "train",
"language": "en",
"age": "Youth-Adult",
"gender": "female",
"emotion": "UNKNOWN",
"pitch": 180.626,
"pitch_std": 0.158,
"speed": 4.2,
"duration": 3.84,
"speech_duration": 3.843,
"syllable_num": 16,
"text": "Clearly, the need for a personal loan is written in the stars.",
"syllables": "K-L-IH1-R L-IY0 DH-AH0 N-IY1-D F-AO1 R-AH0 P-ER1 S-IH0 N-IH0-L L-OW1 N-IH1 Z-R-IH1 T-AH0 N-IH0-N DH-AH0-S T-AA1-R-Z",
"wav_path": "vctk/VCTK_0000044280.flac"
}
Key fields include:
index
: Unique identifier for the audio sample.split
: Dataset split (e.g., train, test).language
: Language of the audio sample (e.g., "en" for English, "zh" for Chinese).age
,gender
,emotion
: Speaker attributes.pitch
,pitch_std
,speed
: Acoustic features.duration
: Total duration of the audio sample in seconds.speech_duration
: Duration excluding silence at both ends.syllable_num
: Number of syllables in the utterance.text
: Transcription of the utterance.syllables
: Syllable-level transcription.wav_path
: Relative path to the audio file within the dataset.
You can download the VoxBox dataset via the Hugging Face Datasets Hub:
You can clone the entire dataset repository (metadata + all audio files):
git lfs install
git clone https://huggingface.co/datasets/SparkAudio/voxbox
from huggingface_hub import HfApi, hf_hub_download
# ✅ Specify the subsets you want to download
target_subsets = ['casia', 'cremad', 'emns']
REPO_ID = "SparkAudio/voxbox"
REPO_TYPE = "dataset"
api = HfApi()
dataset_info = api.dataset_info(repo_id=REPO_ID)
# Get all available file paths (rfilename)
all_paths = [s.rfilename for s in dataset_info.siblings]
for subset in target_subsets:
print(f"\n🔽 Downloading subset: {subset}")
# Download metadata file
metadata_path = f"metadata/{subset}.jsonl"
if metadata_path in all_paths:
print(f"📄 Metadata found: {metadata_path}")
hf_hub_download(
repo_id=REPO_ID,
repo_type=REPO_TYPE,
filename=metadata_path,
local_dir="./voxbox_subset",
local_dir_use_symlinks=False,
)
else:
print(f"⚠️ Metadata not found: {metadata_path}")
# Match all audio tar.gz files for the subset
audio_tars = [f for f in all_paths if f.startswith(f"audios/{subset}/") and f.endswith(".tar.gz")]
if not audio_tars:
print(f"⚠️ No audio files found for {subset}")
continue
for tar_file in audio_tars:
print(f"🎧 Downloading audio: {tar_file}")
hf_hub_download(
repo_id=REPO_ID,
repo_type=REPO_TYPE,
filename=tar_file,
local_dir="./voxbox_subset",
local_dir_use_symlinks=False,
)
✅ Directory Structure of the Downloaded Results
voxbox_subset/
├── audios
│ ├── casia
│ │ └── casia_0000.tar.gz
│ ├── cremad
│ │ └── cremad_0000.tar.gz
│ └── emns
│ └── emns_0000.tar.gz
└── metadata
├── casia.jsonl
├── cremad.jsonl
└── emns.jsonl
python -m tools.annotation \
--audio_path 'path to the audio' \
--text 'transcription of the audio'
Please refer to the original licenses of each sub-corpus. This dataset aggregates and annotates the metadata in a unified structure for research purposes.
If you use this dataset in your research, please consider citing:
@article{wang2025spark,
title={Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens},
author={Wang, Xinsheng and Jiang, Mingqi and Ma, Ziyang and Zhang, Ziyu and Liu, Songxiang and Li, Linqin and Liang, Zheng and Zheng, Qixi and Wang, Rui and Feng, Xiaoqin and others},
journal={arXiv preprint arXiv:2503.01710},
year={2025}
}