Skip to content

ayushi-agarwall/hinglish-audio-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Hinglish Audio Dataset

Hugging Face Dataset License: CC BY 4.0 Clips Hours

A large-scale, cleaned and annotated speech dataset covering Hindi, Hinglish (Hindi–English code-switching), and Indian English — compiled from 14 public corpora and original custom recordings.

The full dataset (Parquet + audio) is hosted on Hugging Face: 👉 huggingface.co/datasets/agarwalayushi/hinglish


Overview

Stat Value
Total clips 815,171
Total duration ~2,264 hours
Unique speakers 6,304
Raw audio size ~237 GB
Languages Hindi · Hinglish · Indian English
Audio format WAV (native sample rates)
Dataset format Parquet with embedded audio (HF-compatible)
Primary tasks ASR · TTS fine-tuning · Voice cloning

Dataset Schema

Column Type Description
source string Speaker or source identifier
text string Cleaned transcript (includes <hi-en> tags for code-switched utterances)
audio Audio Waveform bytes + sample rate, embedded in Parquet
quality string Sample rate (Hz) or MOS quality score where available
duration string Clip duration in seconds

Source Datasets

This dataset aggregates the following public corpora alongside original custom recordings:

Source Clips Description License
NPTEL Hindi Spoken Tutorial (AI4Bharat) 521,028 Educational lectures in Hindi CC BY 4.0
AI4Bharat Kathbath 94,903 Multi-speaker Hindi speech benchmark CC BY 4.0
AI4Bharat IndicTTS 36,613 Studio-quality Hindi TTS corpus CC BY 4.0
Hinglish — ujs 25,378 Code-switched Hindi–English speech See source
Mozilla Common Voice 17 (Hindi) 24,643 Community-contributed Hindi speech CC0 1.0
AI4Bharat Mann Ki Baat 22,483 Hindi broadcast speech, En-Indic aligned CC BY 4.0
Mann Ki Baat (English) 22,477 English side of Mann Ki Baat parallel corpus CC BY 4.0
Hindi Female Single Speaker HQ — Shekharmeena 22,058 High-quality single female speaker Hindi See source
Orpheus TTS Indian English — ar17to 18,238 Multi-speaker Indian English TTS See source
Indic TTS Hindi — SPRINGLab 11,825 Multi-speaker Hindi TTS CC BY 4.0
Indian English — krishan23 6,765 Indian-accented English speech See source
Hinglish Test TTS — Shekharmeena 3,136 Custom Hinglish TTS recordings See source
Custom female TTS recordings 2,843 Original studio recordings CC BY 4.0
Orpheus TTS Shaurya — prashantarya 1,419 Male Hindi/English TTS voice See source
Anika Voice — Shekharmeena ~5 Custom female Hindi/Hinglish voice See source

Curation & Processing

  • All clips re-segmented to remove silence and cross-talk
  • Transcripts normalised to Unicode NFC; <hi-en> language tags added for code-switched utterances
  • Duplicate and near-duplicate clips removed across sources
  • Consistent 5-column schema applied before concatenation (source, text, audio, quality, duration)
  • Audio stored as WAV; original sample rates preserved per speaker/source

Usage

Load from Hugging Face

from datasets import load_dataset

ds = load_dataset("agarwalayushi/hinglish", split="train", streaming=True)

for sample in ds:
    audio = sample["audio"]        # {"array": ..., "sampling_rate": ...}
    text  = sample["text"]
    print(text, audio["sampling_rate"])

Filter by language

# Code-switched Hinglish utterances only
hi_en = ds.filter(lambda x: x["text"].startswith("<hi-en>"))

License

Released under CC BY 4.0.

  • ✅ Free for research and commercial use
  • ✅ Redistribution and adaptation allowed
  • 📌 Attribution required — cite this dataset and link to upstream sources
  • 🚫 Do not use to generate non-consensual synthetic voices of real individuals

Constituent datasets retain their original licenses. Users are responsible for compliance with each upstream source's terms.


Citation

@dataset{agarwal2026hinglish,
  author    = {Agarwal, Ayushi},
  title     = {Hinglish Concatenated Audio Dataset},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/agarwalayushi/hinglish},
  note      = {Aggregated from Mozilla Common Voice, AI4Bharat (Kathbath, Mann Ki Baat,
               Spoken Tutorial, IndicTTS), SPRINGLab IndicTTS, and custom recordings}
}

Contributing & Issues

To report transcription errors, missing attributions, or audio quality issues, please open an issue or start a discussion on Hugging Face.

Releases

No releases published

Packages

 
 
 

Contributors