Skip to content

MRLab12/deepeval-multirun

Repository files navigation

DeepEval MultiRun

A plug-and-play Python library for running multi-run evaluations with DeepEval to achieve more consistent and reliable LLM evaluation results.

PyPI version Python 3.8+ License: MIT

Why Multi-Run Evaluations?

LLM-based evaluations can be non-deterministic, leading to inconsistent results especially for edge cases. DeepEval MultiRun solves this by:

  • Running evaluations multiple times per test case
  • Applying aggregate scoring to determine final pass/fail
  • Providing confidence levels to identify cases needing human review
  • Rate limiting to manage API costs and limits

Installation

pip install deepeval-multirun

Or install from source:

git clone https://github.com/MRLab12/deepeval-multirun.git
cd deepeval-multirun
pip install -e .

Quick Start

Basic Usage

Replace your standard DeepEval assert_test with multirun_assert_test:

from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval_multirun import multirun_assert_test

def test_answer_relevancy():
    test_case = LLMTestCase(
        input="What is the capital of France?",
        actual_output="Paris is the capital of France.",
        expected_output="Paris"
    )

    metric = AnswerRelevancyMetric(threshold=0.7)

    # This will run the evaluation 5 times and require 3+ passes
    multirun_assert_test(test_case, [metric])

Advanced Usage with Custom Configuration

from deepeval_multirun import MultiRunEvaluator, ConfidenceLevel

# Create evaluator with custom settings
evaluator = MultiRunEvaluator(
    num_runs=7,           # Run 7 times
    pass_threshold=5,     # Require 5+ passes
    rate_limit_delay=2.0  # 2 second delay between runs
)

# Run evaluation and get detailed results
result = evaluator.evaluate_test_case(test_case, [metric], "test_001")

print(f"Final Pass: {result.final_pass}")
print(f"Confidence: {result.confidence_level.value}")
print(f"Pass Rate: {result.pass_count}/{result.total_runs}")

# Access per-metric breakdown
for metric_name, breakdown in result.metrics_breakdown.items():
    print(f"{metric_name}: {breakdown['pass_count']}/{breakdown['total_runs']} passes")

Environment-Based Configuration

Enable multi-run evaluation conditionally using environment variables:

from deepeval_multirun import should_use_multirun_evaluation, get_multirun_config
from deepeval import assert_test

# Check if multi-run is enabled
if should_use_multirun_evaluation():
    multirun_assert_test(test_case, metrics)
else:
    # Fall back to standard single-run evaluation
    assert_test(test_case, metrics)

Environment variables:

# Enable/disable multi-run
export ENABLE_MULTIRUN=true

# Configure multi-run behavior
export MULTIRUN_NUM_RUNS=5
export MULTIRUN_PASS_THRESHOLD=3
export MULTIRUN_RATE_LIMIT_DELAY=1.0
export MULTIRUN_ENABLE_LOGGING=true

Understanding Results

Confidence Levels

  • HIGH: Clear consensus (0-1 passes or 4-5 passes out of 5)
  • LOW: Unclear consensus (2-3 passes out of 5) - recommended for human review

Result Analysis

from deepeval_multirun import ResultsAnalyzer

# Collect results from multiple test cases
results = [result1, result2, result3, ...]

# Get overall statistics
stats = ResultsAnalyzer.calculate_overall_stats(results)
print(f"Overall pass rate: {stats['overall_pass_rate']:.2%}")
print(f"High confidence rate: {stats['confidence_rate']:.2%}")

# Get cases needing review
low_confidence = ResultsAnalyzer.get_low_confidence_cases(results)
for result in low_confidence:
    print(f"Review needed: {result.test_case_id}")

Configuration Options

MultiRunEvaluator Parameters

Parameter Type Default Description
num_runs int 5 Number of evaluation runs per test case
pass_threshold int 3 Minimum passes required for overall pass
enable_logging bool True Enable detailed logging
rate_limit_delay float 1.0 Delay (seconds) between runs to manage API limits

Environment Variables

Variable Default Description
ENABLE_MULTIRUN false Enable multi-run evaluation
MULTIRUN_NUM_RUNS 5 Number of runs
MULTIRUN_PASS_THRESHOLD 3 Pass threshold
MULTIRUN_RATE_LIMIT_DELAY 1.0 Delay between runs
MULTIRUN_ENABLE_LOGGING true Enable logging

Best Practices

  1. Start with defaults: The default configuration (5 runs, 3 passes) works well for most cases
  2. Use environment variables for CI/CD: Enable multi-run only in staging/production
  3. Review low-confidence cases: These often indicate genuine edge cases or unclear requirements
  4. Monitor API costs: Multi-run increases API calls - adjust num_runs and rate_limit_delay accordingly
  5. Combine with standard tests: Use multi-run for critical test cases, standard evaluation for others

Cost Considerations

Multi-run evaluation increases API calls proportionally:

  • 5 runs = 5x API costs (default)
  • 7 runs = 7x API costs (recommended for production)
  • 10 runs = 10x API costs (maximum reliability)

Mitigation strategies:

  • Use rate_limit_delay to spread API load over time
  • Enable multi-run only for critical test cases
  • Adjust num_runs based on budget constraints
  • Reserve multi-run for staging/production environments

When to Use Multi-Run

Use Multi-Run For:

  • Production/staging environments where reliability is critical
  • Edge case testing with non-deterministic evaluation scenarios
  • Critical test suites requiring high confidence in results
  • Regression testing to ensure consistent behavior over time

Standard Evaluation is Fine For:

  • Development environments where fast iteration matters
  • Deterministic tests with clear pass/fail criteria
  • Large test suites where cost is a primary concern
  • Initial prototyping and rapid development

Examples

Check out the examples/ directory for complete working examples:

Example Description
basic_usage.py Simple drop-in replacement for assert_test
advanced_usage.py Custom configuration and detailed results
environment_config.py Environment-based conditional setup
pytest_integration.py Integration patterns for pytest
complete_demo.py Comprehensive feature demonstration

Integration with pytest

# conftest.py
import pytest
from deepeval_multirun import should_use_multirun_evaluation, multirun_assert_test
from deepeval import assert_test

@pytest.fixture
def assert_func():
    """Return appropriate assert function based on environment."""
    if should_use_multirun_evaluation():
        return multirun_assert_test
    return assert_test

# test_my_llm.py
def test_my_case(assert_func):
    test_case = LLMTestCase(...)
    metrics = [...]
    assert_func(test_case, metrics)

Example Output

INFO - Starting multi-run evaluation for test_001
INFO - Running evaluation 1/5 for test_001
INFO - Running evaluation 2/5 for test_001
INFO - Running evaluation 3/5 for test_001
INFO - Running evaluation 4/5 for test_001
INFO - Running evaluation 5/5 for test_001
INFO - Completed evaluation for test_001: Pass=True, Confidence=high

MultiRunResult(
    test_case_id='test_001',
    final_pass=True,
    pass_count=1,
    total_runs=1,
    confidence_level=ConfidenceLevel.HIGH,
    metrics_breakdown={
        'AnswerRelevancyMetric': {
            'pass_count': 4,
            'total_runs': 5,
            'final_pass': True,
            'confidence_level': 'high',
            'pass_rate': 0.8,
            'average_raw_score': 0.82
        }
    }
)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages