A large-scale, cleaned and annotated speech dataset covering Hindi, Hinglish (Hindi–English code-switching), and Indian English — compiled from 14 public corpora and original custom recordings.
The full dataset (Parquet + audio) is hosted on Hugging Face: 👉 huggingface.co/datasets/agarwalayushi/hinglish
| Stat | Value |
|---|---|
| Total clips | 815,171 |
| Total duration | ~2,264 hours |
| Unique speakers | 6,304 |
| Raw audio size | ~237 GB |
| Languages | Hindi · Hinglish · Indian English |
| Audio format | WAV (native sample rates) |
| Dataset format | Parquet with embedded audio (HF-compatible) |
| Primary tasks | ASR · TTS fine-tuning · Voice cloning |
| Column | Type | Description |
|---|---|---|
source |
string | Speaker or source identifier |
text |
string | Cleaned transcript (includes <hi-en> tags for code-switched utterances) |
audio |
Audio | Waveform bytes + sample rate, embedded in Parquet |
quality |
string | Sample rate (Hz) or MOS quality score where available |
duration |
string | Clip duration in seconds |
This dataset aggregates the following public corpora alongside original custom recordings:
| Source | Clips | Description | License |
|---|---|---|---|
| NPTEL Hindi Spoken Tutorial (AI4Bharat) | 521,028 | Educational lectures in Hindi | CC BY 4.0 |
| AI4Bharat Kathbath | 94,903 | Multi-speaker Hindi speech benchmark | CC BY 4.0 |
| AI4Bharat IndicTTS | 36,613 | Studio-quality Hindi TTS corpus | CC BY 4.0 |
| Hinglish — ujs | 25,378 | Code-switched Hindi–English speech | See source |
| Mozilla Common Voice 17 (Hindi) | 24,643 | Community-contributed Hindi speech | CC0 1.0 |
| AI4Bharat Mann Ki Baat | 22,483 | Hindi broadcast speech, En-Indic aligned | CC BY 4.0 |
| Mann Ki Baat (English) | 22,477 | English side of Mann Ki Baat parallel corpus | CC BY 4.0 |
| Hindi Female Single Speaker HQ — Shekharmeena | 22,058 | High-quality single female speaker Hindi | See source |
| Orpheus TTS Indian English — ar17to | 18,238 | Multi-speaker Indian English TTS | See source |
| Indic TTS Hindi — SPRINGLab | 11,825 | Multi-speaker Hindi TTS | CC BY 4.0 |
| Indian English — krishan23 | 6,765 | Indian-accented English speech | See source |
| Hinglish Test TTS — Shekharmeena | 3,136 | Custom Hinglish TTS recordings | See source |
| Custom female TTS recordings | 2,843 | Original studio recordings | CC BY 4.0 |
| Orpheus TTS Shaurya — prashantarya | 1,419 | Male Hindi/English TTS voice | See source |
| Anika Voice — Shekharmeena | ~5 | Custom female Hindi/Hinglish voice | See source |
- All clips re-segmented to remove silence and cross-talk
- Transcripts normalised to Unicode NFC;
<hi-en>language tags added for code-switched utterances - Duplicate and near-duplicate clips removed across sources
- Consistent 5-column schema applied before concatenation (
source,text,audio,quality,duration) - Audio stored as WAV; original sample rates preserved per speaker/source
from datasets import load_dataset
ds = load_dataset("agarwalayushi/hinglish", split="train", streaming=True)
for sample in ds:
audio = sample["audio"] # {"array": ..., "sampling_rate": ...}
text = sample["text"]
print(text, audio["sampling_rate"])# Code-switched Hinglish utterances only
hi_en = ds.filter(lambda x: x["text"].startswith("<hi-en>"))Released under CC BY 4.0.
- ✅ Free for research and commercial use
- ✅ Redistribution and adaptation allowed
- 📌 Attribution required — cite this dataset and link to upstream sources
- 🚫 Do not use to generate non-consensual synthetic voices of real individuals
Constituent datasets retain their original licenses. Users are responsible for compliance with each upstream source's terms.
@dataset{agarwal2026hinglish,
author = {Agarwal, Ayushi},
title = {Hinglish Concatenated Audio Dataset},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/agarwalayushi/hinglish},
note = {Aggregated from Mozilla Common Voice, AI4Bharat (Kathbath, Mann Ki Baat,
Spoken Tutorial, IndicTTS), SPRINGLab IndicTTS, and custom recordings}
}To report transcription errors, missing attributions, or audio quality issues, please open an issue or start a discussion on Hugging Face.