LLM Evaluator with DSPy

An evaluation framework for Large Language Model (LLM) responses using DSPy. For a comprehensive explanation of the concepts and methodology, please read our Medium article.

Overview

This framework provides a standardized approach to evaluate LLM-generated responses across four key metrics:

Relevancy: How well the answer aligns with the question
Correctness: Factual accuracy compared to ground truth
ROUGE: Text overlap with reference responses
Toxicity: Detection of inappropriate content

Results are standardized into a traffic light system (🟢 Green, 🟡 Yellow, 🔴 Red) for intuitive interpretation. For detailed explanations of these metrics and the evaluation methodology, refer to our Medium article.

Installation

# Clone the repository
git clone https://github.com/yourusername/dspy-llm-evaluator.git
cd dspy-llm-evaluator

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Environment Variables

Copy the provided .env.example file to create your own .env file:

cp .env.example .env

Key environment variables:

Variable	Description	Default
`OPENAI_API_KEY`	Your OpenAI API key	None (Required)
`LLM_PROVIDER`	The LLM provider to use	openai
`MODEL_NAME`	The LLM model to use	gpt-4o
`METRICS_THRESHOLD_RELEVANCY`	Threshold for relevancy metric	0.7
`METRICS_THRESHOLD_CORRECTNESS`	Threshold for correctness metric	0.7
`METRICS_THRESHOLD_ROUGE`	Threshold for ROUGE metric	0.5
`OUTPUT_DIR`	Directory or file path for evaluation results	evaluation_results.csv
`LOG_LEVEL`	Logging level	INFO

For supported LLM providers, see DSPy docs.

Usage

Basic Usage

python main.py --data path/to/evaluation_data.csv --output results.csv

Command Line Arguments

--data: Path to the evaluation data CSV (required)
--output: Path to save evaluation results (default: evaluation_results.csv)
--api_key: API key for the LLM service (can also be set via environment variable)
--metrics: Comma-separated list of metrics to use (options: relevancy,correctness,rouge,toxicity or 'all')

Input Data Format

The input CSV should contain:

question: The question or prompt given to the LLM
response: The LLM's response to evaluate
reference: The reference or ground truth answer

Example:

question,response,reference
"Who won the FIFA World Cup in 2014?","Germany won the FIFA World Cup in 2014 by defeating Argentina 1-0 in the final.","Germany won the FIFA World Cup in 2014 by defeating Argentina 1-0 in the final."

Output Example

Evaluating responses: 100%|███████████████████████████████████████████████| 11/11 [00:00<00:00, 89.32it/s]
Evaluation complete. Results saved to sample_result.csv

Evaluation Summary:
--------------------------------------------------
🎯 Relevancy: 0.55
✅ Correctness: 0.53
📝 Rouge: 0.41
🛡 Toxicity: 0.91

Overall Status Distribution:
🟢 green: 2 (18.2%)
🟡 yellow: 2 (18.2%)
🔴 red: 7 (63.6%)

Utility Script

The project includes a utility script for post-processing evaluation results:

python scripts/llm_eval_utils.py <command> [arguments]

Available commands:

check-quality: Validates if results meet quality thresholds
generate-trends: Creates trend reports from historical data
compare-models: Compares results from different models
generate-report: Generates HTML reports
check-deployment: Checks if results meet deployment criteria

Example:

# Generate HTML report
python scripts/llm_eval_utils.py generate-report --results evaluation_results.csv --output report.html

Integration with CI/CD

This evaluator can be integrated into CI/CD pipelines to ensure consistent performance of LLM assistants. See the GitLab Integration Guide for details on:

Setting up GitLab CI/CD pipelines for automated evaluations
Configuring quality thresholds for pipeline success/failure
Tracking evaluation metrics over time
Comparing different model versions
Generating reports and visualizations

Architecture

The application follows a modular design for extensibility and maintainability:

Core Components

Metrics System
- Abstract Metric and DSPyMetric base classes
- Individual implementations for each metric type
DSPy Integration
- Leverages DSPy for consistent LLM-based evaluation
- Custom DSPy signatures and programs for evaluation
Scoring System
- TrafficLightScorer standardizes scores (green/yellow/red)
- Configurable thresholds for evaluation strictness
Evaluation Pipeline
- Orchestrates end-to-end evaluation process
- Handles parallel metric application and aggregation

Extensibility

To add new metrics:

Create a new class inheriting from Metric or DSPyMetric
Implement the required evaluate() method
Register the new metric in __init__.py

For more details on the conceptual framework and methodology, please refer to our Medium article or the original "LLM Evaluator: what AI Scientist must know" article.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Evaluator with DSPy

Overview

Installation

Environment Variables

Usage

Basic Usage

Command Line Arguments

Input Data Format

Output Example

Utility Script

Integration with CI/CD

Architecture

Core Components

Extensibility

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs		docs
evaluator		evaluator
example		example
scripts		scripts
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

AlessandroAnnini/dspy-llm-evaluator

Folders and files

Latest commit

History

Repository files navigation

LLM Evaluator with DSPy

Overview

Installation

Environment Variables

Usage

Basic Usage

Command Line Arguments

Input Data Format

Output Example

Utility Script

Integration with CI/CD

Architecture

Core Components

Extensibility

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages