updated the code to avoid hardcore spanish, all outputs are in english by DavidTorresLeon · Pull Request #2 · PovertyAction/llm-quali-coding

DavidTorresLeon · 2026-05-20T03:28:29Z

Pull Request Summary 🚀

What does this PR do? 📝

Fixes mixed-language outputs across the pipeline (scripts 02–07) so that all structural text, HTML reports, console labels, chart titles, section headers, is always in English, while content (quotes, transcript excerpts) respects the language of the source transcript. Also introduces a TRANSCRIPT_LANGUAGE environment variable so users can explicitly select which transcript language to work in.

Why is this change needed? 🤔

When running the pipeline with a Spanish transcript, outputs were a mix of Spanish and English: HTML report titles and stat labels were hardcoded in Spanish, console summaries used Spanish section headers, and LLM calls returned Spanish metadata (e.g. cue_type: "Risas" instead of "Laughter"). This created inconsistent, hard-to-read outputs and made the repo confusing for participants working across both language versions. There was also a broken file path in 05_extract_themes_llm.py that silently caused it to always fall back to the Spanish transcript regardless of what the user intended.

How was this implemented? 🛠️

Changes span four areas:

Language selection via .env

Added TRANSCRIPT_LANGUAGE=en to .env.example (with full documentation of all configurable env vars)
Added transcript_language field to ModelConfig in src/openai_client.py, loaded from the env var
Added get_transcript_path(lang) helper: maps en → sample_english.md, es → sample_spanish.md, and any other code → sample_<lang>.md for extensibility

Transcript file selection

02_create_embeddings.py: replaced hardcoded English/Spanish fallback logic with get_transcript_path()
05_extract_themes_llm.py: fixed wrong path (outputs/02_translated_english.md never existed) and replaced with get_transcript_path()

LLM prompt fixes (src/llm_tasks.py)

extract_candidate_themes, extract_general_themes: added instruction to always return theme names and section headings in English, while quotes may remain in the transcript's original language
code_nonverbal_cues: added instruction to always return cue_type in English (e.g. "Laughter", not "Risas")

Structural UI strings

04_theme_classification_embeddings.py: HTML lang attribute, page title, all stat labels, section headings, buttons, and console print headers translated to English
06_nonverbal_coding_llm.py: same scope as above
07_inductive_clustering.py: console section headers and labels translated to English

How to test or reproduce? 🧪

Copy .env.example to .env and set your API key

Run the full pipeline with TRANSCRIPT_LANGUAGE=es:

python examples/02_create_embeddings.py
python examples/03_relevance_filtering.py
python examples/04_theme_classification_embeddings.py
python examples/05_extract_themes_llm.py
python examples/06_nonverbal_coding_llm.py
python examples/07_inductive_clustering.py

…modification Changed last script to do model comparisons

DavidTorresLeon and others added 3 commits May 19, 2026 22:25

updated the code to avoid hardcore spanish, all outputs are in english

948178d

Changed last script to do model comparisons

b84fd05

Merge pull request #3 from PovertyAction:update/indluctive-cluseting-…

5352806

…modification Changed last script to do model comparisons

DavidTorresLeon merged commit 742c180 into main May 20, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updated the code to avoid hardcore spanish, all outputs are in english#2

updated the code to avoid hardcore spanish, all outputs are in english#2
DavidTorresLeon merged 3 commits into
mainfrom
update/fix-language-issue

DavidTorresLeon commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DavidTorresLeon commented May 20, 2026

Pull Request Summary 🚀

What does this PR do? 📝

Why is this change needed? 🤔

How was this implemented? 🛠️

How to test or reproduce? 🧪

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant