Skip to content

JinLee794/llm-evals-sandbox

Repository files navigation

LLM Evaluations Workshop

From Fundamentals to Production-Ready Evaluation Frameworks

🎯 Overview

This hands-on workshop series takes you through a progressive journey of LLM evaluation techniques, starting from basic concepts and building up to sophisticated, production-ready evaluation frameworks like the ones used in enterprise solutions such as the AutoAuth Solution Accelerator.

Target Audience

  • Technical professionals with basic programming experience
  • Developers new to AI/LLM development
  • Solution architects exploring LLM integration patterns
  • Data scientists looking to implement systematic LLM evaluation

Prerequisites

  • Basic Python programming knowledge
  • Azure subscription with access to create resources
  • Familiarity with Jupyter notebooks
  • Understanding of basic software development concepts

πŸ—οΈ Workshop Structure

The workshop is organized into 3 progressive labs, each building upon the previous one. Each lab combines theory, hands-on coding, and real-world application patterns.

Lab Duration Focus Area Key Skills
Lab 1 60 min Fundamentals Basic evaluation concepts, dataset creation, simple evaluators
Lab 2 60–90 min Azure AI Foundry Evaluations Using AI Foundry evaluation tools, batch evaluation, integration patterns
Lab 3 60–90 min Red Teaming & Adversarial Evaluation Red teaming practices, automated adversarial testing, safety metrics

πŸ“š Lab Details

Lab 1: LLM Evaluation Fundamentals

Notebook: lab1_evaluation_fundamentals/lab1_evaluation_fundamentals.ipynb Duration: 60 minutes

🎯 Learning Objectives

  • Understand why LLM evaluation is critical for production systems
  • Learn core evaluation metrics (relevance, coherence, groundedness)
  • Perform your first evaluation using basic evaluation tools and utilities
  • Recognize the non-deterministic nature of LLM outputs

πŸ“‹ Lab Structure

  1. Introduction (15 min)

    • Why traditional software testing doesn't work for LLMs
    • Demo: Same prompt, different outputs
    • Business impact of poor LLM performance
  2. Core Concepts (15 min)

    • Quality metrics: Relevance, Coherence, Fluency, Groundedness
    • Safety metrics: Harmful content, Bias detection
    • Performance metrics: Latency, Token usage, Cost analysis
  3. Hands-On: Basic Evaluation (25 min)

  4. Wrap-up (5 min): Key takeaways and preview

πŸ”§ Technical Components

# Key code patterns you'll implement
from azure.ai.evaluation import evaluate, RelevanceEvaluator

# Basic evaluation setup
evaluator = RelevanceEvaluator()
results = evaluate(data=test_data, evaluators={"relevance": evaluator})

πŸ“ˆ Success Metrics

  • Successfully run first evaluation
  • Understand metric interpretation
  • Can explain why evaluation is necessary

Lab 2: Azure AI Foundry Evaluations

Notebook: lab2_aifoundry_evals/Evaluate_Azure_AI_Agent_Quality.ipynb Duration: 60–90 minutes

🎯 Learning Objectives

  • Integrate evaluations with Azure AI Foundry and the Azure AI evaluation SDK
  • Run batch evaluations and compare models using Foundry tooling
  • Build reproducible evaluation workflows that log results to Azure
  • Understand Foundry-specific metrics and telemetry

πŸ“‹ Lab Structure

  1. Foundry Overview (10–15 min)

    • What Azure AI Foundry provides for evaluation and monitoring
    • Differences between local evaluators and Foundry-managed evaluation
  2. Hands-On: Foundry Setup (15–20 min)

  3. Hands-On: Batch & Multi-Model Evaluation (25–35 min)

    • Create batch evaluation jobs using dataset files and programmatic APIs
    • Compare model variants/configurations and collect standardized metrics
    • Persist evaluation results and telemetry to Foundry for downstream analysis
  4. Best Practices & Observability (10 min)

    • Logging, monitoring and cost-aware evaluation strategies

πŸ”§ Technical Components

# Example patterns
from shared_utils.azure_clients import create_foundry_client
from shared_utils.evaluation_helpers import run_batch_evaluation

foundry = create_foundry_client()
results = run_batch_evaluation(foundry, dataset_path="lab2_aifoundry_evals/data/test_dataset.jsonl")

πŸ“ˆ Success Metrics

  • Successfully connect to Azure AI Foundry for evaluation runs
  • Run batch evaluations and compare at least two model configurations
  • Log and inspect evaluation telemetry in Foundry

Lab 3: Red Teaming & Adversarial Evaluation

Notebook: lab3_redteaming/AI_RedTeaming.ipynb (and AI_Red_Teaming_Agent_Part2.ipynb) Duration: 60–90 minutes

🎯 Learning Objectives

  • Design and run red-team style, adversarial tests for generative models
  • Implement automated red teaming pipelines and synthetic adversarial case generation
  • Measure safety, jailbreak resistance, and robustness using structured metrics
  • Combine automated red teaming with manual review and triage workflows

πŸ“‹ Lab Structure

  1. Red Teaming Concepts (15 min)

    • Threat modeling for generative systems
    • Adversarial patterns: prompt injection, jailbreaks, content steering
  2. Hands-On: Creating Red Team Cases (20–25 min)

  3. Hands-On: Automated Red Team Pipeline (20–30 min)

    • Run automated red-team tests against model endpoints
    • Capture safety metrics, severity scores, and rationale logging
    • Integrate results into evaluation dashboards and alerting
  4. Triage & Remediation (10 min)

    • Prioritize issues and recommend hardened prompt/response filters
    • Create regression tests to track fixes

πŸ”§ Technical Components

# Example red teaming pattern
from lab3_redteaming import red_team_runner
cases = red_team_runner.load_cases("lab3_redteaming/red_team_output.json")
results = red_team_runner.run_against_model(cases, model_config)

πŸ“ˆ Success Metrics

  • Produce a catalog of adversarial tests
  • Automate at least one red-team run and capture safety-related metrics
  • Produce remediation steps and regression checks for discovered issues

πŸš€ Getting Started

1. Environment Setup

Clone the Workshop Repository

git clone <workshop-repository-url>
cd llm-evaluations-workshop

Create Python Environment

python -m venv venv
source venv/bin/activate  # On macOS/Linux
pip install -r requirements.txt

Configure Azure Resources

# Set up environment variables
export AZURE_OPENAI_ENDPOINT="your-endpoint"
export AZURE_OPENAI_API_KEY="your-api-key"
export AZURE_AI_FOUNDRY_PROJECT_NAME="your-project"

# Or create .env file these values

2. Verify Setup

Run the setup verification notebook:

jupyter notebook shared_utils/setup_verification.ipynb

3. Start with Lab 1

jupyter notebook lab1_evaluation_fundamentals/lab1_evaluation_fundamentals.ipynb

πŸ“ Repository Structure

This README reflects the repository structure used in the workshop:

llm-evaluations-workshop/
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ temp_evaluation_data.jsonl
β”‚
β”œβ”€β”€ lab1_evaluation_fundamentals/
β”‚   β”œβ”€β”€ lab1_evaluation_fundamentals.ipynb
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ lab1_basic_evaluation.json
β”‚   β”‚   └── sample_qa_pairs.jsonl
β”‚   └── utils/
β”‚       └── lab1_helpers.py
β”‚
β”œβ”€β”€ lab2_aifoundry_evals/
β”‚   β”œβ”€β”€ Evaluate_Azure_AI_Agent_Quality.ipynb
β”‚   β”œβ”€β”€ user_functions.py
β”‚   └── data/
β”‚       └── foundry_test_dataset.jsonl
β”‚
β”œβ”€β”€ lab3_redteaming/
β”‚   β”œβ”€β”€ AI_RedTeaming.ipynb
β”‚   β”œβ”€β”€ red_team_output.json
β”‚   └── redteam.log
β”‚
β”œβ”€β”€ shared_utils/
β”‚   β”œβ”€β”€ azure_clients.py
β”‚   β”œβ”€β”€ evaluation_helpers.py
β”‚   └── data_utils.py
└── docs/
    β”œβ”€β”€ evaluation_metrics_guide.md
    β”œβ”€β”€ azure_setup_guide.md
    └── troubleshooting.md

🎯 Workshop Outcomes

By the end of this workshop series, you will be able to:

Technical Skills

  • βœ… Implement comprehensive LLM evaluation pipelines
  • βœ… Create custom evaluators for domain-specific requirements
  • βœ… Integrate evaluations with Azure AI Foundry for production monitoring
  • βœ… Run automated red-team tests and build remediation workflows

Strategic Understanding

  • βœ… Evaluate the quality vs. cost trade-offs of different LLM approaches
  • βœ… Design evaluation strategies that scale with your application
  • βœ… Implement continuous evaluation in production environments
  • βœ… Build evaluation frameworks that support regulatory compliance

Real-World Application

  • βœ… Adapt AutoAuth evaluation patterns to other domains
  • βœ… Integrate evaluation into CI/CD pipelines
  • βœ… Create evaluation dashboards and monitoring systems
  • βœ… Establish evaluation best practices for your team

πŸ“– Additional Resources

Documentation & References

Community & Support

  • Workshop Q&A: Use GitHub Issues in this repository
  • Azure AI Community: Microsoft Tech Community
  • Evaluation Patterns: Check out the lab2_aifoundry_evals/examples/ and lab3_redteaming for additional use cases

Next Steps

After completing this workshop, consider exploring:

  • Advanced RAG Evaluation: Specialized patterns for retrieval-augmented generation
  • Multi-modal Evaluation: Evaluating vision and audio capabilities
  • A/B Testing for LLMs: Statistical approaches to model comparison
  • Production Monitoring: Real-time evaluation and alerting systems

🀝 Contributing

We welcome contributions to improve this workshop! Please see our Contributing Guidelines for details on:

  • Reporting issues or bugs
  • Suggesting new lab exercises
  • Adding evaluation patterns
  • Improving documentation

πŸ“„ License

This workshop is licensed under the MIT License. See LICENSE for details.


πŸ†˜ Need Help?

  • Technical Issues: Open an issue in this repository
  • Azure Setup Problems: Check our Azure Setup Guide
  • Evaluation Questions: Review the Troubleshooting Guide
  • Advanced Topics: Explore Advanced Topics Documentation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published