Skip to content

almudenaftourne/ai-evaluation-pipeline

Repository files navigation

Azure OpenAI Evaluation Pipeline

Multi-model comparison (GPT-4o, GPT-5.1, Model-Router) using Azure AI Evaluation SDK with Entra ID authentication.

Files

Core Scripts

  • run_local_evaluation.py - Local evaluation using azure-ai-evaluation SDK (currently working)
  • run_foundry_evaluation.py - Cloud evaluation using Azure AI Foundry (service currently has 500 errors)
  • batch_test.py - Generates model responses for all models in the evaluation dataset

Data Files

  • data.txt - Business report context for testing
  • evaluate_test_data.jsonl - Ground truth questions and expected answers
  • model_responses_gpt-4o.jsonl - GPT-4o model responses
  • model_responses_gpt_5_1.jsonl - GPT-5.1 model responses
  • model_responses_model_router.jsonl - Model-Router responses

Configuration

  • .env - Azure OpenAI endpoint configuration
  • .gitignore - Git ignore patterns

Setup

  1. Create virtual environment:
python -m venv .venv
.venv\Scripts\activate  # Windows
  1. Install dependencies:
pip install azure-ai-evaluation azure-identity openai python-dotenv
  1. Configure .env:
AZURE_OPENAI_ENDPOINT=https://your-endpoint.cognitiveservices.azure.com/

Note: The script now tests all three models (gpt-4o, gpt-5.1, model-router) automatically. Ensure all model deployments exist in your Azure OpenAI resource.

  1. Authenticate with Azure:
az login --scope https://cognitiveservices.azure.com/.default

Usage

Generate Model Responses

python batch_test.py

Generates responses for all three models:

  • model_responses_gpt-4o.jsonl
  • model_responses_gpt_5_1.jsonl
  • model_responses_model_router.jsonl

The script automatically loops through all configured models.

Run Evaluation

Option 1: Local Evaluation (Recommended - Currently Working)

python run_local_evaluation.py

Evaluates all available models locally on:

  • Relevance - How well responses address the question
  • Coherence - Logical flow and structure
  • Groundedness - Alignment with provided context
  • Similarity - Comparison to ground truth answers

The script automatically detects which model response files exist and compares all available models (minimum 2 required for comparison).

Option 2: Cloud Evaluation via Azure AI Foundry (Currently Unavailable)

python run_foundry_evaluation.py

Uses Azure AI Foundry cloud evaluation service. Currently returns 500 Internal Server errors from the service. Keep this script for when the service is restored.

Required additional environment variable:

AZURE_AI_PROJECT_ENDPOINT=https://your-project.services.ai.azure.com/api/projects/your-project

Output

The evaluation produces tabular output comparing all models:

  1. Score Comparison Table - Side-by-side scores (1-5 scale) for each metric across all models
  2. Summary Statistics - Wins per model, ties, and overall winner
  3. Pass Rate Table - Percentage of responses meeting quality thresholds for each model

The tables dynamically adjust to show all available models (GPT-4o, GPT-5.1, Model-Router).

Evaluator Configuration

The evaluation uses GPT-4o as the evaluator model (GPT-5.1 doesn't support the max_tokens parameter used by the evaluation library).

Authentication

Uses Entra ID (DefaultAzureCredential) - no API keys required. Ensure your Azure account has appropriate permissions on the Azure OpenAI resource.

About

Evaluation pipeline for model comparison

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages