Hindi Second-Person Honorifics Benchmark

Do LLMs understand Hindi social dynamics? This benchmark evaluates how well language models handle Hindi's three-tier second-person pronoun system (तू/तुम/आप) — a core sociopragmatic competence required for respectful Hindi conversation.

Hindi has 600M+ speakers. Using तू when you should use आप is genuinely rude. If LLMs are deployed in Hindi conversations (chatbots, translation, creative writing), they need to get direct address right.

Why This Matters

Mukherjee et al. (EMNLP 2025) showed LLMs mess up Hindi honorifics in third-person reference. We extend this to second-person conversational usage — the harder, more socially consequential case. Our benchmark tests both comprehension (cloze task) and production (generation task) using only real dialogue data.

Data

All evaluation data is derived from IndicDialogue — a corpus of real Hindi film subtitles. No synthetic scenarios or researcher-authored stimuli are used.

Pipeline:

Extract pronoun-cloze probes from 17 high-quality Hindi-original films (indicdialogue_extract_probes.py → filter_probes_clean17.py)
Deduplicate on (masked_line, gold_pronoun) — 4,382 → 4,271 unique probes
Stratified sampling balanced by tier (T/TUM/AAP) and proportional by movie (sample_stratified.py, sample_generation.py)

Known limitation: IndicDialogue lacks speaker diarization. For the generation task, we mitigate this by restricting to single-register contexts (where only one honorific tier appears in the dialogue context).

Tasks

Task 1: Cloze (Comprehension)

Fill-in-the-blank with the correct Hindi 2nd-person pronoun. 500 probes stratified by tier (167 T / 167 TUM / 166 AAP) and proportional by movie across all 17 films. 18 possible pronoun forms across 3 tiers.

Data source: IndicDialogue film subtitles (non-synthetic)
Sample: probes_stratified_500.csv (from sample_stratified.py)

Task 2: Generation — Dialogue Continuation (Production)

Given real film dialogue context, generate the next line. Tests whether models produce appropriate honorific forms in free generation — the harder, more ecologically valid task.

Data source: IndicDialogue film subtitles (non-synthetic)
Sample: probes_generation_200.csv (from sample_generation.py)
Filtering: Only probes where context contains a single honorific tier (no speaker-switch ambiguity) and gold line is substantial (>= 15 chars)
Sample size: 200 probes (67 T / 67 TUM / 66 AAP), all 17 movies

	Cloze Task	Generation Task
Input	Context + masked line	Context only
Model does	Selects a pronoun	Generates free-form dialogue
Gold	The exact pronoun	The tier of the original line
Tests	Recognition/selection	Production/pragmatic inference
Data source	IndicDialogue (real)	IndicDialogue (real)

Results: GPT-5-mini

Cloze (Comprehension)

Metric	Value
Exact accuracy	75.0%
Tier accuracy	81.4%
Valid form rate	100%

Per-Tier	तू (intimate)	तुम (familiar)	आप (formal)
Tier accuracy	80.8% (n=167)	87.4% (n=167)	75.9% (n=166)

Generation (Production)

Metric	Value
Tier accuracy	40.5%
Avoidance rate	54.0%
Formality bias (AAP ratio)	0.47
Verb agreement	72.7%

Per-Tier	तू (intimate)	तुम (familiar)	आप (formal)
Tier accuracy	17.9% (n=67)	65.1% (n=66)	38.8% (n=67)
Avoidance rate	50.7%	59.1%	52.2%

Key Findings

Comprehension ≠ Production. GPT-5-mini scores 81.4% on cloze but only 40.5% on generation — it can recognize the correct honorific tier but struggles to produce it in free dialogue.
54% avoidance rate in generation. The model dodges 2nd-person pronouns entirely in over half its continuations, using passive constructions or dropping subjects.
तू is nearly impossible to generate. Only 17.9% accuracy on the intimate tier — the model almost never produces तू-tier forms freely, even when the dialogue context is entirely in that register.
तुम is the default. The confusion matrix shows the model collapses toward तुम: 36/67 gold-तू probes and 31/67 gold-आप probes get predicted as तुम.
Verb agreement is decent at 72.7% — when the model does use pronouns, it mostly gets the conjugation right.

How to Run

Prerequisites

pip install aiohttp
export OPENAI_API_KEY=your_key_here

Cloze Evaluation

# Run with any OpenAI model
python scripts/cloze_eval.py --method mc --backend openai --model gpt-5-mini \
  --probes probes_stratified_500.csv --output results/gpt5_mini_mc.jsonl

# Baselines
python scripts/cloze_eval.py --method baseline-majority --probes probes_stratified_500.csv \
  --output results/baseline_majority.jsonl
python scripts/cloze_eval.py --method baseline-random --probes probes_stratified_500.csv \
  --output results/baseline_random.jsonl

Generation Evaluation (Dialogue Continuation)

# Run generation eval
python scripts/generation_eval.py --model gpt-5-mini \
  --probes probes_generation_200.csv --output results/gen_gpt5_mini.jsonl

# With limits for testing
python scripts/generation_eval.py --model gpt-4o-mini \
  --probes probes_generation_200.csv --limit 10 --output results/gen_test.jsonl

Sampling (regenerate samples)

# Cloze sample: 500 probes, stratified by tier and movie
python scripts/sample_stratified.py --n 500 --seed 42

# Generation sample: 200 probes, single-tier contexts only
python scripts/sample_generation.py --n 200 --seed 99

Visualization

python scripts/plot_results.py

File Structure

├── README.md                          # This file
├── charter.md                         # Project scope and status
├── PROBE.md                           # Probe methodology and design
├── probes_clean17_ctx5.csv            # Full cleaned probes (4,271 after dedup)
├── probes_stratified_500.csv          # Cloze evaluation sample (stratified)
├── probes_generation_200.csv          # Generation evaluation sample
├── scripts/
│   ├── cloze_eval.py                  # Cloze evaluation pipeline
│   ├── generation_eval.py             # Generation (dialogue continuation) pipeline
│   ├── tier_classifier.py             # Pronoun/tier classification
│   ├── sample_stratified.py           # Stratified sampling for cloze task
│   ├── sample_generation.py           # Filtered sampling for generation task
│   ├── plot_results.py                # Visualization
│   ├── indicdialogue_extract_probes.py # Probe extraction from IndicDialogue
│   └── filter_probes_clean17.py       # Probe filtering to 17 clean films
├── results/                           # Evaluation results (JSONL + metrics JSON)
├── plots/                             # Generated visualizations
└── modules/                           # Git submodules
    ├── IndicDialogue/                 # Hindi film subtitle dialogues
    ├── hindi-politeness/              # Reference corpus
    └── honorific-wiki-llm/           # Mukherjee et al. dataset

References

Mukherjee, S., Mehta, A., & Saha, S. (2025). Women, Infamous, and Exotic Beings: Honorific Usages in Wikipedia and LLMs for Bengali and Hindi. EMNLP 2025.
Farhansyah, M. R. et al. (2025). Do Language Models Understand Honorific Systems in Javanese? ACL 2025.
Zhao, H. & Hawkins, R. D. (2025). Comparing human and LLM politeness strategies in free production. EMNLP 2025.
Kumar, R. (2014). Politeness in Hindi: A Corpus-Based Study. LREC 2014.
Brown, P. & Levinson, S. (1987). Politeness: Some universals in language usage. Cambridge University Press.

Citation

@misc{karode2026hindihonorificsbenchmark,
  title={Hindi Second-Person Honorifics Benchmark: Evaluating LLM Sociopragmatic Competence},
  author={Karode, Adit},
  year={2026},
  url={https://github.com/AKarode/hindi-honorifics-benchmark}
}

License

Research use. See individual data sources for their respective licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
archive		archive
plots		plots
results		results
scripts		scripts
.gitignore		.gitignore
BENCHMARK_STATS.md		BENCHMARK_STATS.md
CLAUDE.md		CLAUDE.md
PROBE.MD		PROBE.MD
README.md		README.md
RESEARCH_NOTES.md		RESEARCH_NOTES.md
SAMPLING_STRATEGY.md		SAMPLING_STRATEGY.md
charter.md		charter.md
final_validated_pairs.csv		final_validated_pairs.csv
found_pairs_optimized.csv		found_pairs_optimized.csv
hindi_films_filter.csv		hindi_films_filter.csv
hindi_films_summary.md		hindi_films_summary.md
probes_clean17_ctx5.csv		probes_clean17_ctx5.csv
probes_clean17_ctx5_deduped.csv		probes_clean17_ctx5_deduped.csv
probes_generation_200.csv		probes_generation_200.csv
probes_hindi_only.csv		probes_hindi_only.csv
probes_hindi_only_ctx5.csv		probes_hindi_only_ctx5.csv
probes_stratified_500.csv		probes_stratified_500.csv
subtitle_quality_audit.md		subtitle_quality_audit.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hindi Second-Person Honorifics Benchmark

Why This Matters

Data

Tasks

Task 1: Cloze (Comprehension)

Task 2: Generation — Dialogue Continuation (Production)

Results: GPT-5-mini

Cloze (Comprehension)

Generation (Production)

Key Findings

How to Run

Prerequisites

Cloze Evaluation

Generation Evaluation (Dialogue Continuation)

Sampling (regenerate samples)

Visualization

File Structure

References

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hindi Second-Person Honorifics Benchmark

Why This Matters

Data

Tasks

Task 1: Cloze (Comprehension)

Task 2: Generation — Dialogue Continuation (Production)

Results: GPT-5-mini

Cloze (Comprehension)

Generation (Production)

Key Findings

How to Run

Prerequisites

Cloze Evaluation

Generation Evaluation (Dialogue Continuation)

Sampling (regenerate samples)

Visualization

File Structure

References

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages