Multi-model comparison (GPT-4o, GPT-5.1, Model-Router) using Azure AI Evaluation SDK with Entra ID authentication.
- run_local_evaluation.py - Local evaluation using azure-ai-evaluation SDK (currently working)
- run_foundry_evaluation.py - Cloud evaluation using Azure AI Foundry (service currently has 500 errors)
- batch_test.py - Generates model responses for all models in the evaluation dataset
- data.txt - Business report context for testing
- evaluate_test_data.jsonl - Ground truth questions and expected answers
- model_responses_gpt-4o.jsonl - GPT-4o model responses
- model_responses_gpt_5_1.jsonl - GPT-5.1 model responses
- model_responses_model_router.jsonl - Model-Router responses
- .env - Azure OpenAI endpoint configuration
- .gitignore - Git ignore patterns
- Create virtual environment:
python -m venv .venv
.venv\Scripts\activate # Windows- Install dependencies:
pip install azure-ai-evaluation azure-identity openai python-dotenv- Configure
.env:
AZURE_OPENAI_ENDPOINT=https://your-endpoint.cognitiveservices.azure.com/
Note: The script now tests all three models (gpt-4o, gpt-5.1, model-router) automatically. Ensure all model deployments exist in your Azure OpenAI resource.
- Authenticate with Azure:
az login --scope https://cognitiveservices.azure.com/.defaultpython batch_test.pyGenerates responses for all three models:
model_responses_gpt-4o.jsonlmodel_responses_gpt_5_1.jsonlmodel_responses_model_router.jsonl
The script automatically loops through all configured models.
Option 1: Local Evaluation (Recommended - Currently Working)
python run_local_evaluation.pyEvaluates all available models locally on:
- Relevance - How well responses address the question
- Coherence - Logical flow and structure
- Groundedness - Alignment with provided context
- Similarity - Comparison to ground truth answers
The script automatically detects which model response files exist and compares all available models (minimum 2 required for comparison).
Option 2: Cloud Evaluation via Azure AI Foundry (Currently Unavailable)
python run_foundry_evaluation.pyUses Azure AI Foundry cloud evaluation service. Currently returns 500 Internal Server errors from the service. Keep this script for when the service is restored.
Required additional environment variable:
AZURE_AI_PROJECT_ENDPOINT=https://your-project.services.ai.azure.com/api/projects/your-project
The evaluation produces tabular output comparing all models:
- Score Comparison Table - Side-by-side scores (1-5 scale) for each metric across all models
- Summary Statistics - Wins per model, ties, and overall winner
- Pass Rate Table - Percentage of responses meeting quality thresholds for each model
The tables dynamically adjust to show all available models (GPT-4o, GPT-5.1, Model-Router).
The evaluation uses GPT-4o as the evaluator model (GPT-5.1 doesn't support the max_tokens parameter used by the evaluation library).
Uses Entra ID (DefaultAzureCredential) - no API keys required. Ensure your Azure account has appropriate permissions on the Azure OpenAI resource.