Palabras Perdidas

A comparative evaluation framework for testing Spanish language understanding across multiple LLMs using Ollama, with GPT-5 as the judge.

Overview

This project evaluates how well different language models understand Spanish vocabulary by:

Generating responses using multiple Ollama models with two different prompts
Evaluating response accuracy using GPT-5 as a judge
Creating comparative performance reports

Setup

Prerequisites

Python 3.13+
Ollama installed and running locally
OpenAI API key for GPT-5 judging

Installation

Clone the repository:

git clone https://github.com/madebygps/palabras-perdidas.git
cd palabras-perdidas

Install dependencies using uv:

uv sync

Create a .env file with your OpenAI API key:

OPENAI_API_KEY=your_api_key_here

Configure models in suite/models_list.txt (comment out models with # to skip):

gemma3:12b
llama3.1:latest
# mixtral:latest

Usage

Run the complete evaluation pipeline:

uv run main.py

This will:

Process all active models with both prompts against the vocabulary
Judge all responses using GPT-5
Generate a summary report and display results

Prompts

Prompt A: Definition request - "Dime la definición de la palabra '{word}'."
Prompt B: Contextual usage - "Escribe dos frases, una con la palabra '{word}', y otra que no contenga esa palabra, pero que esté relacionada con la primera y complemente su significado."

Vocabulary

Edit vocabulary files in the suite/ directory:

vocabulary_short.json - 10 words for quick testing
vocabulary_complete.json - Full vocabulary set

Output Structure

output/
├── prompt_a/
│   ├── gemma3:12b/
│   │   └── ardilla.json
│   └── llama3.1:latest/
│       └── ardilla.json
└── prompt_b/
    ├── gemma3:12b/
    │   └── ardilla.json
    └── llama3.1:latest/
        └── ardilla.json
summary.json

Each word file contains:

Original word and definition
Prompt used
Model response
Judge result (correct/incorrect)
Judge reasoning

Results

Results are displayed in a table showing correct/total for each model and prompt type:

Model Performance Summary
┌─────────────────┬─────────────────┬─────────────────┐
│ Model           │ Prompt A        │ Prompt B        │
│                 │ Correct         │ Correct         │
├─────────────────┼─────────────────┼─────────────────┤
│ gemma3:12b      │ 8/10            │ 10/10           │
│ llama3.1:latest │ 9/10            │ 7/10            │
└─────────────────┴─────────────────┴─────────────────┘

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.devcontainer		.devcontainer
.github		.github
suite		suite
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Palabras Perdidas

Overview

Setup

Prerequisites

Installation

Usage

Prompts

Vocabulary

Output Structure

Results

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

madebygps/palabras-perdidas

Folders and files

Latest commit

History

Repository files navigation

Palabras Perdidas

Overview

Setup

Prerequisites

Installation

Usage

Prompts

Vocabulary

Output Structure

Results

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages