Skip to content

Latest commit

 

History

History
130 lines (110 loc) · 10.1 KB

File metadata and controls

130 lines (110 loc) · 10.1 KB

DeepEval Tutorial - Unit Testing LLM AI applications

Disclaimer: This is a personal summary and interpretation based on a YouTube video. It is not official material and not endorsed by the original creator. All rights remain with the respective creators.

This document summarizes the key takeaways from the video. I highly recommend watching the full video for visual context and coding demonstrations.

Before You Get Started

  • I summarize key points to help you learn and review quickly.
  • Simply click on Ask AI links to dive into any topic you want.

AI-Powered buttons

Teach Me: 5 Years Old | Beginner | Intermediate | Advanced | (reset auto redirect)

Learn Differently: Analogy | Storytelling | Cheatsheet | Mindmap | Flashcards | Practical Projects | Code Examples | Common Mistakes

Check Understanding: Generate Quiz | Interview Me | Refactor Challenge | Assessment Rubric | Next Steps

Introduction to DeepEval

  • Summary: DeepEval is an open-source framework for testing and evaluating large language models, helping ensure accuracy, safety, and reliability in applications like chatbots, RAG pipelines, or agentic systems. It offers metrics for issues like hallucinations and toxicity, with over half a million monthly downloads.
  • Key Takeaway/Example: Think of it as pytest for LLMs—plug-and-play tools to measure performance before deployment.
  • Link for More Details: Ask AI: Introduction to DeepEval

Setup and Installation

  • Summary: Start by creating a project folder, initializing with UV for virtual environment and dependencies. Install DeepEval, pandas, numpy, pytest, and Google Generative AI. Verify installation by importing DeepEval in Python.
  • Key Takeaway/Example: Use commands like uv init, uv add deepeval, and uv add google-generativeai. In VS Code, check the project.toml for dependencies and run a simple import test.
# Verification example
uv python -c "import deepeval; print('DeepEval installed successfully')"

Configuring the LLM Model

  • Summary: Configure Google's Gemini 2.0 Flash model using the CLI with your API key. This setup ensures the model is used for evaluations without needing to specify it in code every time.
  • Key Takeaway/Example: Run deepeval set gemini --model-name=gemini-2.0-flash --gemini-api-key=YOUR_API_KEY. CLI config works reliably, unlike environment variables.
  • Link for More Details: Ask AI: Configuring LLM in DeepEval

Basic Test: Answer Relevancy

  • Summary: Create a simple test using the answer relevancy metric to check if outputs match inputs accurately. Structure tests with setup (input/output), execution (measure metric), and assertion (check score and success).
  • Key Takeaway/Example: For input "What is the capital of France?" and output "The capital of France is Paris...", set a threshold of 0.7. A score of 1.0 means perfect relevance.
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_answer_relevancy():
    input_text = "What is the capital of France?"
    output_text = "The capital of France is Paris which is located to the northern part of the country."
    test_case = LLMTestCase(input=input_text, actual_output=output_text)
    metric = AnswerRelevancyMetric(threshold=0.7)
    metric.measure(test_case)
    assert metric.score > 0.7
    assert metric.success is True

Advanced Test: Hallucination with Fictional Data

  • Summary: Test for hallucinations by providing context that contradicts the output, like describing a fictional city as real. Use the hallucination metric with a lower threshold to detect inconsistencies.
  • Key Takeaway/Example: Input: "What is the population of Atlantic City?" Output claims it's real with details, but context states it's fictional (Atlantis). Score of 1.0 indicates full hallucination, success False.
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

def test_hallucination():
    input_text = "What is the population of Atlantic City?"
    output_text = "Atlantic City has a population of 2.5 million people as of 2023. It is located in the Mediterranean Sea..."
    context = ["Atlantis is a fictional city... no real place of that name exists."]
    test_case = LLMTestCase(input=input_text, actual_output=output_text, context=context)
    metric = HallucinationMetric(threshold=0.5)
    metric.measure(test_case)
    assert metric.score > 0.5
    assert metric.success is False

Advanced Test: Hallucination with Factual Data

  • Summary: Validate factual accuracy by aligning output and context, resulting in low hallucination scores. This confirms the model sticks to provided information without inventing details.
  • Key Takeaway/Example: Input: "What is the population of Tokyo?" Output and context both state accurate facts (e.g., 14 million residents). Score of 0.0 means no hallucination, success True.
def test_hallucination_factual():
    input_text = "What is the population of Tokyo?"
    output_text = "Tokyo has a population of approximately 14 million..."
    context = ["Tokyo is the capital of Japan... metropolitan area having 14 million residents..."]
    test_case = LLMTestCase(input=input_text, actual_output=output_text, context=context)
    metric = HallucinationMetric(threshold=0.5)
    metric.measure(test_case)
    assert metric.score == 0.0
    assert metric.success is True

Available Metrics and Conclusion

  • Summary: DeepEval offers over 14 metrics across categories like RAG, agentic, and conversational. The tutorial covers setup to running tests, emphasizing its role in unit testing LLMs.
  • Key Takeaway/Example: Explore metrics like toxicity or bias in the documentation for broader evaluations.
  • Link for More Details: Ask AI: DeepEval Metrics Overview

About the summarizer

I'm Ali Sol, a Backend Developer. Learn more: