README

The Model Evaluation Pipeline

This was an evaluation pipeline built for a career advice chatbot. Three models, three prompt variants, twelve test cases across difficulty levels. That's 108 evaluations. Each one scored by an LLM judge on four dimensions. Each one tracked for actual token costs.

The output that matters most is cost versus performance. Plot mean cost per evaluation against mean quality score. This is where you discover that the cheaper model delivers 90% of the quality at 20% of the cost. Or that the expensive model only barely outperforms on easy questions but pulls ahead on hard ones.

The framework is reusable. Swap the prompts, change the test cases, update the model list. The judge template, cost tracking, and visualization pipeline work unchanged. It doesn't care if you're building a career advisor or a customer service chatbot. It cares about measuring quality and cost systematically.

Project Structure

Pipeline Entry Point

main.py — Runs the full pipeline with a single command. Executes evaluation, visualization, and difficulty analysis in sequence, passing the output folder automatically between steps.

Core Evaluation

main_llm_models.py — The evaluation engine. Loops over every combination of model, prompt variant, and test case. Sends each question to the chatbot, judges the response with an LLM, tracks token costs, and saves all results to a timestamped CSV.
eval.py — Handles the LLM-as-judge pattern. Sends the judge prompt to the judge model, parses the structured output (scores + justifications for Coherence, Relevance, Fluency, Consistency), and returns cost info.
prompts.py — Contains the three prompt variants (A_zero_shot, B_few_shot, C_custom), the 12 test cases with difficulty levels and expected key aspects, and the judge prompt template with the 1-5 scoring rubric.

Configuration

openai_client.py — Initializes the OpenAI client using the API key from .env.
openai_models.py — Defines which models to evaluate (CHATBOT_MODELS) and which model serves as the judge (JUDGE_MODEL).
costs.py — Thin wrapper around the tokencost library that converts token counts to USD for any supported model.

Analysis & Visualization

visualize.py — Generates variant comparison bar charts (overall and per-dimension) and a cost-vs-performance scatter plot comparing models.
difficulty_breakdown.py — Produces a variant x difficulty heatmap, model x difficulty grouped bar chart, cost efficiency breakdowns, and prints best/worst case analysis with off-topic handling details.

Usage

Run the full pipeline:

python main.py

Or run each step individually:

python main_llm_models.py          # Step 1: Run evaluations
python visualize.py                # Step 2: Generate variant/cost plots
python difficulty_breakdown.py     # Step 3: Difficulty analysis & charts

All outputs are saved to output/<timestamp>/ with the CSV and all generated charts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

The Model Evaluation Pipeline

Project Structure

Pipeline Entry Point

Core Evaluation

Configuration

Analysis & Visualization

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
output/2026-03-13_16-11-25		output/2026-03-13_16-11-25
.env.sample		.env.sample
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
costs.py		costs.py
difficulty_breakdown.py		difficulty_breakdown.py
eval.py		eval.py
main.py		main.py
main_llm_models.py		main_llm_models.py
openai_client.py		openai_client.py
openai_models.py		openai_models.py
prompts.py		prompts.py
requirements.txt		requirements.txt
visualize.py		visualize.py

Folders and files

Latest commit

History

Repository files navigation

README

The Model Evaluation Pipeline

Project Structure

Pipeline Entry Point

Core Evaluation

Configuration

Analysis & Visualization

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages