Skip to content

updated the code to avoid hardcore spanish, all outputs are in english#2

Merged
DavidTorresLeon merged 3 commits into
mainfrom
update/fix-language-issue
May 20, 2026
Merged

updated the code to avoid hardcore spanish, all outputs are in english#2
DavidTorresLeon merged 3 commits into
mainfrom
update/fix-language-issue

Conversation

@DavidTorresLeon

Copy link
Copy Markdown
Collaborator

Pull Request Summary 🚀

What does this PR do? 📝

Fixes mixed-language outputs across the pipeline (scripts 02–07) so that all structural text, HTML reports, console labels, chart titles, section headers, is always in English, while content (quotes, transcript excerpts) respects the language of the source transcript. Also introduces a TRANSCRIPT_LANGUAGE environment variable so users can explicitly select which transcript language to work in.

Why is this change needed? 🤔

When running the pipeline with a Spanish transcript, outputs were a mix of Spanish and English: HTML report titles and stat labels were hardcoded in Spanish, console summaries used Spanish section headers, and LLM calls returned Spanish metadata (e.g. cue_type: "Risas" instead of "Laughter"). This created inconsistent, hard-to-read outputs and made the repo confusing for participants working across both language versions. There was also a broken file path in 05_extract_themes_llm.py that silently caused it to always fall back to the Spanish transcript regardless of what the user intended.

How was this implemented? 🛠️

Changes span four areas:

Language selection via .env

  • Added TRANSCRIPT_LANGUAGE=en to .env.example (with full documentation of all configurable env vars)
  • Added transcript_language field to ModelConfig in src/openai_client.py, loaded from the env var
  • Added get_transcript_path(lang) helper: maps ensample_english.md, essample_spanish.md, and any other code → sample_<lang>.md for extensibility

Transcript file selection

  • 02_create_embeddings.py: replaced hardcoded English/Spanish fallback logic with get_transcript_path()
  • 05_extract_themes_llm.py: fixed wrong path (outputs/02_translated_english.md never existed) and replaced with get_transcript_path()

LLM prompt fixes (src/llm_tasks.py)

  • extract_candidate_themes, extract_general_themes: added instruction to always return theme names and section headings in English, while quotes may remain in the transcript's original language
  • code_nonverbal_cues: added instruction to always return cue_type in English (e.g. "Laughter", not "Risas")

Structural UI strings

  • 04_theme_classification_embeddings.py: HTML lang attribute, page title, all stat labels, section headings, buttons, and console print headers translated to English
  • 06_nonverbal_coding_llm.py: same scope as above
  • 07_inductive_clustering.py: console section headers and labels translated to English

How to test or reproduce? 🧪

  1. Copy .env.example to .env and set your API key
  2. Run the full pipeline with TRANSCRIPT_LANGUAGE=es:
    python examples/02_create_embeddings.py
    python examples/03_relevance_filtering.py
    python examples/04_theme_classification_embeddings.py
    python examples/05_extract_themes_llm.py
    python examples/06_nonverbal_coding_llm.py
    python examples/07_inductive_clustering.py

@DavidTorresLeon DavidTorresLeon merged commit 742c180 into main May 20, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant