From Fundamentals to Production-Ready Evaluation Frameworks
This hands-on workshop series takes you through a progressive journey of LLM evaluation techniques, starting from basic concepts and building up to sophisticated, production-ready evaluation frameworks like the ones used in enterprise solutions such as the AutoAuth Solution Accelerator.
- Technical professionals with basic programming experience
- Developers new to AI/LLM development
- Solution architects exploring LLM integration patterns
- Data scientists looking to implement systematic LLM evaluation
- Basic Python programming knowledge
- Azure subscription with access to create resources
- Familiarity with Jupyter notebooks
- Understanding of basic software development concepts
The workshop is organized into 3 progressive labs, each building upon the previous one. Each lab combines theory, hands-on coding, and real-world application patterns.
| Lab | Duration | Focus Area | Key Skills |
|---|---|---|---|
| Lab 1 | 60 min | Fundamentals | Basic evaluation concepts, dataset creation, simple evaluators |
| Lab 2 | 60β90 min | Azure AI Foundry Evaluations | Using AI Foundry evaluation tools, batch evaluation, integration patterns |
| Lab 3 | 60β90 min | Red Teaming & Adversarial Evaluation | Red teaming practices, automated adversarial testing, safety metrics |
Notebook: lab1_evaluation_fundamentals/lab1_evaluation_fundamentals.ipynb
Duration: 60 minutes
- Understand why LLM evaluation is critical for production systems
- Learn core evaluation metrics (relevance, coherence, groundedness)
- Perform your first evaluation using basic evaluation tools and utilities
- Recognize the non-deterministic nature of LLM outputs
-
Introduction (15 min)
- Why traditional software testing doesn't work for LLMs
- Demo: Same prompt, different outputs
- Business impact of poor LLM performance
-
Core Concepts (15 min)
- Quality metrics: Relevance, Coherence, Fluency, Groundedness
- Safety metrics: Harmful content, Bias detection
- Performance metrics: Latency, Token usage, Cost analysis
-
Hands-On: Basic Evaluation (25 min)
- Create your first evaluation dataset (Q&A pairs)
- Run simple evaluators programmatically using helper utilities in
lab1_evaluation_fundamentals/utils/lab1_helpers.py - Interpret evaluation results
-
Wrap-up (5 min): Key takeaways and preview
# Key code patterns you'll implement
from azure.ai.evaluation import evaluate, RelevanceEvaluator
# Basic evaluation setup
evaluator = RelevanceEvaluator()
results = evaluate(data=test_data, evaluators={"relevance": evaluator})- Successfully run first evaluation
- Understand metric interpretation
- Can explain why evaluation is necessary
Notebook: lab2_aifoundry_evals/Evaluate_Azure_AI_Agent_Quality.ipynb
Duration: 60β90 minutes
- Integrate evaluations with Azure AI Foundry and the Azure AI evaluation SDK
- Run batch evaluations and compare models using Foundry tooling
- Build reproducible evaluation workflows that log results to Azure
- Understand Foundry-specific metrics and telemetry
-
Foundry Overview (10β15 min)
- What Azure AI Foundry provides for evaluation and monitoring
- Differences between local evaluators and Foundry-managed evaluation
-
Hands-On: Foundry Setup (15β20 min)
- Configure credentials and project settings (environment variables or
.env) - Connect to Foundry clients in
shared_utils/azure_clients.py
- Configure credentials and project settings (environment variables or
-
Hands-On: Batch & Multi-Model Evaluation (25β35 min)
- Create batch evaluation jobs using dataset files and programmatic APIs
- Compare model variants/configurations and collect standardized metrics
- Persist evaluation results and telemetry to Foundry for downstream analysis
-
Best Practices & Observability (10 min)
- Logging, monitoring and cost-aware evaluation strategies
# Example patterns
from shared_utils.azure_clients import create_foundry_client
from shared_utils.evaluation_helpers import run_batch_evaluation
foundry = create_foundry_client()
results = run_batch_evaluation(foundry, dataset_path="lab2_aifoundry_evals/data/test_dataset.jsonl")- Successfully connect to Azure AI Foundry for evaluation runs
- Run batch evaluations and compare at least two model configurations
- Log and inspect evaluation telemetry in Foundry
Notebook: lab3_redteaming/AI_RedTeaming.ipynb (and AI_Red_Teaming_Agent_Part2.ipynb)
Duration: 60β90 minutes
- Design and run red-team style, adversarial tests for generative models
- Implement automated red teaming pipelines and synthetic adversarial case generation
- Measure safety, jailbreak resistance, and robustness using structured metrics
- Combine automated red teaming with manual review and triage workflows
-
Red Teaming Concepts (15 min)
- Threat modeling for generative systems
- Adversarial patterns: prompt injection, jailbreaks, content steering
-
Hands-On: Creating Red Team Cases (20β25 min)
- Generate adversarial prompts programmatically (synthetic generation + curated cases)
- Store red-team cases in
lab3_redteaming/red_team_output.jsonand related datasets
-
Hands-On: Automated Red Team Pipeline (20β30 min)
- Run automated red-team tests against model endpoints
- Capture safety metrics, severity scores, and rationale logging
- Integrate results into evaluation dashboards and alerting
-
Triage & Remediation (10 min)
- Prioritize issues and recommend hardened prompt/response filters
- Create regression tests to track fixes
# Example red teaming pattern
from lab3_redteaming import red_team_runner
cases = red_team_runner.load_cases("lab3_redteaming/red_team_output.json")
results = red_team_runner.run_against_model(cases, model_config)- Produce a catalog of adversarial tests
- Automate at least one red-team run and capture safety-related metrics
- Produce remediation steps and regression checks for discovered issues
git clone <workshop-repository-url>
cd llm-evaluations-workshoppython -m venv venv
source venv/bin/activate # On macOS/Linux
pip install -r requirements.txt# Set up environment variables
export AZURE_OPENAI_ENDPOINT="your-endpoint"
export AZURE_OPENAI_API_KEY="your-api-key"
export AZURE_AI_FOUNDRY_PROJECT_NAME="your-project"
# Or create .env file these valuesRun the setup verification notebook:
jupyter notebook shared_utils/setup_verification.ipynbjupyter notebook lab1_evaluation_fundamentals/lab1_evaluation_fundamentals.ipynbThis README reflects the repository structure used in the workshop:
llm-evaluations-workshop/
βββ README.md
βββ requirements.txt
βββ temp_evaluation_data.jsonl
β
βββ lab1_evaluation_fundamentals/
β βββ lab1_evaluation_fundamentals.ipynb
β βββ data/
β β βββ lab1_basic_evaluation.json
β β βββ sample_qa_pairs.jsonl
β βββ utils/
β βββ lab1_helpers.py
β
βββ lab2_aifoundry_evals/
β βββ Evaluate_Azure_AI_Agent_Quality.ipynb
β βββ user_functions.py
β βββ data/
β βββ foundry_test_dataset.jsonl
β
βββ lab3_redteaming/
β βββ AI_RedTeaming.ipynb
β βββ red_team_output.json
β βββ redteam.log
β
βββ shared_utils/
β βββ azure_clients.py
β βββ evaluation_helpers.py
β βββ data_utils.py
βββ docs/
βββ evaluation_metrics_guide.md
βββ azure_setup_guide.md
βββ troubleshooting.md
By the end of this workshop series, you will be able to:
- β Implement comprehensive LLM evaluation pipelines
- β Create custom evaluators for domain-specific requirements
- β Integrate evaluations with Azure AI Foundry for production monitoring
- β Run automated red-team tests and build remediation workflows
- β Evaluate the quality vs. cost trade-offs of different LLM approaches
- β Design evaluation strategies that scale with your application
- β Implement continuous evaluation in production environments
- β Build evaluation frameworks that support regulatory compliance
- β Adapt AutoAuth evaluation patterns to other domains
- β Integrate evaluation into CI/CD pipelines
- β Create evaluation dashboards and monitoring systems
- β Establish evaluation best practices for your team
- Azure AI Evaluation SDK Documentation
- AutoAuth Solution Accelerator
- Azure AI Foundry Evaluation Guide
- LLM Evaluation Best Practices
- Workshop Q&A: Use GitHub Issues in this repository
- Azure AI Community: Microsoft Tech Community
- Evaluation Patterns: Check out the
lab2_aifoundry_evals/examples/andlab3_redteamingfor additional use cases
After completing this workshop, consider exploring:
- Advanced RAG Evaluation: Specialized patterns for retrieval-augmented generation
- Multi-modal Evaluation: Evaluating vision and audio capabilities
- A/B Testing for LLMs: Statistical approaches to model comparison
- Production Monitoring: Real-time evaluation and alerting systems
We welcome contributions to improve this workshop! Please see our Contributing Guidelines for details on:
- Reporting issues or bugs
- Suggesting new lab exercises
- Adding evaluation patterns
- Improving documentation
This workshop is licensed under the MIT License. See LICENSE for details.
- Technical Issues: Open an issue in this repository
- Azure Setup Problems: Check our Azure Setup Guide
- Evaluation Questions: Review the Troubleshooting Guide
- Advanced Topics: Explore Advanced Topics Documentation