You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A large-scale, cleaned and annotated speech dataset covering Hindi, Hinglish (Hindi–English code-switching), and Indian English — compiled from 14 public corpora and original custom recordings, unified into a single Parquet dataset with consistent schema.
At a Glance
Stat
Value
Total clips
815,171
Total duration
~2,500 hours
Unique speakers
6,304
Raw audio size
~237 GB
Languages
Hindi (hi), Hinglish (hi-en), Indian English (en-IN)
Format
Parquet with embedded audio (HF dataset viewer compatible)
Tasks
ASR, TTS fine-tuning, voice cloning, speech research
Dataset Schema
Column
Type
Description
source
string
Speaker or source identifier
text
string
Cleaned transcript (may include <hi-en> language tags)
audio
Audio
WAV audio embedded in Parquet with native sample rate
quality
string
Sample rate (Hz) or MOS quality score where available
Use for research, commercial, or personal projects
Share and redistribute in any medium
Build upon and adapt the data
Under these conditions:
Attribution — Credit this dataset and link to the original source datasets listed above
No misrepresentation — Do not use this data to generate non-consensual synthetic voices of real individuals, or for fraud, harassment, or disinformation
⚠️Upstream licenses: Each constituent dataset retains its original license. Users are responsible for complying with the terms of the sources they use. Mozilla Common Voice is CC0; all AI4Bharat and SPRINGLab sources are CC BY 4.0. Sources marked "See source" should be checked individually.
Citation
If you use this dataset in your work, please cite it and the relevant upstream sources:
@dataset{agarwal2026hinglish,
author = {Agarwal, Ayushi},
title = {Hinglish Concatenated Audio Dataset},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/agarwalayushi/hinglish},
note = {Aggregated from Mozilla Common Voice, AI4Bharat (Kathbath, Mann Ki Baat, Spoken Tutorial, IndicTTS), SPRINGLab IndicTTS, and custom recordings}
}
Contact
To report transcription errors, audio quality issues, or attribution concerns, please open a Discussion on this page.