An evaluation framework for Large Language Model (LLM) responses using DSPy. For a comprehensive explanation of the concepts and methodology, please read our Medium article.
This framework provides a standardized approach to evaluate LLM-generated responses across four key metrics:
- Relevancy: How well the answer aligns with the question
- Correctness: Factual accuracy compared to ground truth
- ROUGE: Text overlap with reference responses
- Toxicity: Detection of inappropriate content
Results are standardized into a traffic light system (🟢 Green, 🟡 Yellow, 🔴 Red) for intuitive interpretation. For detailed explanations of these metrics and the evaluation methodology, refer to our Medium article.
# Clone the repository
git clone https://github.com/yourusername/dspy-llm-evaluator.git
cd dspy-llm-evaluator
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtCopy the provided .env.example file to create your own .env file:
cp .env.example .envKey environment variables:
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
Your OpenAI API key | None (Required) |
LLM_PROVIDER |
The LLM provider to use | openai |
MODEL_NAME |
The LLM model to use | gpt-4o |
METRICS_THRESHOLD_RELEVANCY |
Threshold for relevancy metric | 0.7 |
METRICS_THRESHOLD_CORRECTNESS |
Threshold for correctness metric | 0.7 |
METRICS_THRESHOLD_ROUGE |
Threshold for ROUGE metric | 0.5 |
OUTPUT_DIR |
Directory or file path for evaluation results | evaluation_results.csv |
LOG_LEVEL |
Logging level | INFO |
For supported LLM providers, see DSPy docs.
python main.py --data path/to/evaluation_data.csv --output results.csv--data: Path to the evaluation data CSV (required)--output: Path to save evaluation results (default: evaluation_results.csv)--api_key: API key for the LLM service (can also be set via environment variable)--metrics: Comma-separated list of metrics to use (options: relevancy,correctness,rouge,toxicity or 'all')
The input CSV should contain:
question: The question or prompt given to the LLMresponse: The LLM's response to evaluatereference: The reference or ground truth answer
Example:
question,response,reference
"Who won the FIFA World Cup in 2014?","Germany won the FIFA World Cup in 2014 by defeating Argentina 1-0 in the final.","Germany won the FIFA World Cup in 2014 by defeating Argentina 1-0 in the final."Evaluating responses: 100%|███████████████████████████████████████████████| 11/11 [00:00<00:00, 89.32it/s]
Evaluation complete. Results saved to sample_result.csv
Evaluation Summary:
--------------------------------------------------
🎯 Relevancy: 0.55
✅ Correctness: 0.53
📝 Rouge: 0.41
🛡 Toxicity: 0.91
Overall Status Distribution:
🟢 green: 2 (18.2%)
🟡 yellow: 2 (18.2%)
🔴 red: 7 (63.6%)
The project includes a utility script for post-processing evaluation results:
python scripts/llm_eval_utils.py <command> [arguments]Available commands:
check-quality: Validates if results meet quality thresholdsgenerate-trends: Creates trend reports from historical datacompare-models: Compares results from different modelsgenerate-report: Generates HTML reportscheck-deployment: Checks if results meet deployment criteria
Example:
# Generate HTML report
python scripts/llm_eval_utils.py generate-report --results evaluation_results.csv --output report.htmlThis evaluator can be integrated into CI/CD pipelines to ensure consistent performance of LLM assistants. See the GitLab Integration Guide for details on:
- Setting up GitLab CI/CD pipelines for automated evaluations
- Configuring quality thresholds for pipeline success/failure
- Tracking evaluation metrics over time
- Comparing different model versions
- Generating reports and visualizations
The application follows a modular design for extensibility and maintainability:
-
Metrics System
- Abstract
MetricandDSPyMetricbase classes - Individual implementations for each metric type
- Abstract
-
DSPy Integration
- Leverages DSPy for consistent LLM-based evaluation
- Custom DSPy signatures and programs for evaluation
-
Scoring System
TrafficLightScorerstandardizes scores (green/yellow/red)- Configurable thresholds for evaluation strictness
-
Evaluation Pipeline
- Orchestrates end-to-end evaluation process
- Handles parallel metric application and aggregation
To add new metrics:
- Create a new class inheriting from
MetricorDSPyMetric - Implement the required
evaluate()method - Register the new metric in
__init__.py
For more details on the conceptual framework and methodology, please refer to our Medium article or the original "LLM Evaluator: what AI Scientist must know" article.