Hinglish Concatenated Audio Dataset

A large-scale, cleaned and annotated speech dataset covering Hindi, Hinglish (Hindi–English code-switching), and Indian English — compiled from 14 public corpora and original custom recordings, unified into a single Parquet dataset with consistent schema.

At a Glance

Stat	Value
Total clips	815,171
Total duration	~2,500 hours
Unique speakers	6,304
Raw audio size	~237 GB
Languages	Hindi (`hi`), Hinglish (`hi-en`), Indian English (`en-IN`)
Format	Parquet with embedded audio (HF dataset viewer compatible)
Tasks	ASR, TTS fine-tuning, voice cloning, speech research

Dataset Schema

Column	Type	Description
`source`	`string`	Speaker or source identifier
`text`	`string`	Cleaned transcript (may include `<hi-en>` language tags)
`audio`	`Audio`	WAV audio embedded in Parquet with native sample rate
`quality`	`string`	Sample rate (Hz) or MOS quality score where available
`duration`	`string`	Clip duration in seconds

Source Breakdown

Source Dataset	Clips	Description	License
NPTEL Hindi Spoken Tutorial	521,028	Educational lectures in Hindi — AI4Bharat / NPTEL	CC BY 4.0
AI4Bharat Kathbath	94,903	Multi-speaker Hindi speech benchmark	CC BY 4.0
AI4Bharat IndicTTS	36,613	Studio-quality Hindi TTS corpus	CC BY 4.0
Hinglish — ujs	25,378	Code-switched Hindi–English speech	See source
Mozilla Common Voice 17 (Hindi)	24,643	Community-contributed Hindi speech	CC0 1.0
AI4Bharat Mann Ki Baat	22,483	Broadcast Hindi speech, En-Indic aligned	CC BY 4.0
Mann Ki Baat (English)	22,477	English side of Mann Ki Baat parallel corpus	CC BY 4.0
Hindi Female Single Speaker HQ — Shekharmeena	22,058	High-quality single female speaker Hindi	See source
Orpheus TTS Indian English	18,238	Multi-speaker Indian English TTS	See source
Indic TTS Hindi — SPRINGLab	11,825	Multi-speaker Hindi TTS	CC BY 4.0
Indian English — krishan23	6,765	Indian-accented English speech	See source
Hinglish Test TTS — Shekharmeena	3,136	Custom Hinglish TTS recordings	See source
Custom female TTS recordings	2,843	Original studio recordings	CC BY 4.0
Orpheus TTS Shaurya — prashantarya	1,419	Male Hindi/English TTS voice	See source
Anika Voice — Shekharmeena	~5	Custom female Hindi/Hinglish voice	See source
Total	815,171

Curation & Processing

All clips re-segmented to remove silence and cross-talk
Transcripts normalised to Unicode NFC; language tags added (<hi-en> for code-switched utterances)
Duplicate and near-duplicate clips removed
Consistent CSV schema applied across all sources before concatenation
Audio stored as WAV, original sample rates preserved

License

This dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0).

You are free to:

Use for research, commercial, or personal projects
Share and redistribute in any medium
Build upon and adapt the data

Under these conditions:

Attribution — Credit this dataset and link to the original source datasets listed above
No misrepresentation — Do not use this data to generate non-consensual synthetic voices of real individuals, or for fraud, harassment, or disinformation

⚠️ Upstream licenses: Each constituent dataset retains its original license. Users are responsible for complying with the terms of the sources they use. Mozilla Common Voice is CC0; all AI4Bharat and SPRINGLab sources are CC BY 4.0. Sources marked "See source" should be checked individually.

Citation

If you use this dataset in your work, please cite it and the relevant upstream sources:

@dataset{agarwal2026hinglish,
  author    = {Agarwal, Ayushi},
  title     = {Hinglish Concatenated Audio Dataset},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/agarwalayushi/hinglish},
  note      = {Aggregated from Mozilla Common Voice, AI4Bharat (Kathbath, Mann Ki Baat,
               Spoken Tutorial, IndicTTS), SPRINGLab IndicTTS, and custom recordings}
}

Contact

To report transcription errors, audio quality issues, or attribution concerns, please open a Discussion on this page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hinglish Concatenated Audio Dataset

At a Glance

Dataset Schema

Source Breakdown

Curation & Processing

License

Citation

Contact

FilesExpand file tree

HF_README.md

Latest commit

History

HF_README.md

File metadata and controls

Hinglish Concatenated Audio Dataset

At a Glance

Dataset Schema

Source Breakdown

Curation & Processing

License

Citation

Contact