updated the code to avoid hardcore spanish, all outputs are in english#2
Merged
Conversation
…modification Changed last script to do model comparisons
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Summary 🚀
What does this PR do? 📝
Fixes mixed-language outputs across the pipeline (scripts 02–07) so that all structural text, HTML reports, console labels, chart titles, section headers, is always in English, while content (quotes, transcript excerpts) respects the language of the source transcript. Also introduces a
TRANSCRIPT_LANGUAGEenvironment variable so users can explicitly select which transcript language to work in.Why is this change needed? 🤔
When running the pipeline with a Spanish transcript, outputs were a mix of Spanish and English: HTML report titles and stat labels were hardcoded in Spanish, console summaries used Spanish section headers, and LLM calls returned Spanish metadata (e.g.
cue_type: "Risas"instead of"Laughter"). This created inconsistent, hard-to-read outputs and made the repo confusing for participants working across both language versions. There was also a broken file path in05_extract_themes_llm.pythat silently caused it to always fall back to the Spanish transcript regardless of what the user intended.How was this implemented? 🛠️
Changes span four areas:
Language selection via
.envTRANSCRIPT_LANGUAGE=ento.env.example(with full documentation of all configurable env vars)transcript_languagefield toModelConfiginsrc/openai_client.py, loaded from the env varget_transcript_path(lang)helper: mapsen→sample_english.md,es→sample_spanish.md, and any other code →sample_<lang>.mdfor extensibilityTranscript file selection
02_create_embeddings.py: replaced hardcoded English/Spanish fallback logic withget_transcript_path()05_extract_themes_llm.py: fixed wrong path (outputs/02_translated_english.mdnever existed) and replaced withget_transcript_path()LLM prompt fixes (
src/llm_tasks.py)extract_candidate_themes,extract_general_themes: added instruction to always return theme names and section headings in English, while quotes may remain in the transcript's original languagecode_nonverbal_cues: added instruction to always returncue_typein English (e.g."Laughter", not"Risas")Structural UI strings
04_theme_classification_embeddings.py: HTMLlangattribute, page title, all stat labels, section headings, buttons, and console print headers translated to English06_nonverbal_coding_llm.py: same scope as above07_inductive_clustering.py: console section headers and labels translated to EnglishHow to test or reproduce? 🧪
.env.exampleto.envand set your API keyTRANSCRIPT_LANGUAGE=es: