A plug-and-play Python library for running multi-run evaluations with DeepEval to achieve more consistent and reliable LLM evaluation results.
LLM-based evaluations can be non-deterministic, leading to inconsistent results especially for edge cases. DeepEval MultiRun solves this by:
- Running evaluations multiple times per test case
- Applying aggregate scoring to determine final pass/fail
- Providing confidence levels to identify cases needing human review
- Rate limiting to manage API costs and limits
pip install deepeval-multirunOr install from source:
git clone https://github.com/MRLab12/deepeval-multirun.git
cd deepeval-multirun
pip install -e .Replace your standard DeepEval assert_test with multirun_assert_test:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval_multirun import multirun_assert_test
def test_answer_relevancy():
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
expected_output="Paris"
)
metric = AnswerRelevancyMetric(threshold=0.7)
# This will run the evaluation 5 times and require 3+ passes
multirun_assert_test(test_case, [metric])from deepeval_multirun import MultiRunEvaluator, ConfidenceLevel
# Create evaluator with custom settings
evaluator = MultiRunEvaluator(
num_runs=7, # Run 7 times
pass_threshold=5, # Require 5+ passes
rate_limit_delay=2.0 # 2 second delay between runs
)
# Run evaluation and get detailed results
result = evaluator.evaluate_test_case(test_case, [metric], "test_001")
print(f"Final Pass: {result.final_pass}")
print(f"Confidence: {result.confidence_level.value}")
print(f"Pass Rate: {result.pass_count}/{result.total_runs}")
# Access per-metric breakdown
for metric_name, breakdown in result.metrics_breakdown.items():
print(f"{metric_name}: {breakdown['pass_count']}/{breakdown['total_runs']} passes")Enable multi-run evaluation conditionally using environment variables:
from deepeval_multirun import should_use_multirun_evaluation, get_multirun_config
from deepeval import assert_test
# Check if multi-run is enabled
if should_use_multirun_evaluation():
multirun_assert_test(test_case, metrics)
else:
# Fall back to standard single-run evaluation
assert_test(test_case, metrics)Environment variables:
# Enable/disable multi-run
export ENABLE_MULTIRUN=true
# Configure multi-run behavior
export MULTIRUN_NUM_RUNS=5
export MULTIRUN_PASS_THRESHOLD=3
export MULTIRUN_RATE_LIMIT_DELAY=1.0
export MULTIRUN_ENABLE_LOGGING=true- HIGH: Clear consensus (0-1 passes or 4-5 passes out of 5)
- LOW: Unclear consensus (2-3 passes out of 5) - recommended for human review
from deepeval_multirun import ResultsAnalyzer
# Collect results from multiple test cases
results = [result1, result2, result3, ...]
# Get overall statistics
stats = ResultsAnalyzer.calculate_overall_stats(results)
print(f"Overall pass rate: {stats['overall_pass_rate']:.2%}")
print(f"High confidence rate: {stats['confidence_rate']:.2%}")
# Get cases needing review
low_confidence = ResultsAnalyzer.get_low_confidence_cases(results)
for result in low_confidence:
print(f"Review needed: {result.test_case_id}")| Parameter | Type | Default | Description |
|---|---|---|---|
num_runs |
int | 5 | Number of evaluation runs per test case |
pass_threshold |
int | 3 | Minimum passes required for overall pass |
enable_logging |
bool | True | Enable detailed logging |
rate_limit_delay |
float | 1.0 | Delay (seconds) between runs to manage API limits |
| Variable | Default | Description |
|---|---|---|
ENABLE_MULTIRUN |
false | Enable multi-run evaluation |
MULTIRUN_NUM_RUNS |
5 | Number of runs |
MULTIRUN_PASS_THRESHOLD |
3 | Pass threshold |
MULTIRUN_RATE_LIMIT_DELAY |
1.0 | Delay between runs |
MULTIRUN_ENABLE_LOGGING |
true | Enable logging |
- Start with defaults: The default configuration (5 runs, 3 passes) works well for most cases
- Use environment variables for CI/CD: Enable multi-run only in staging/production
- Review low-confidence cases: These often indicate genuine edge cases or unclear requirements
- Monitor API costs: Multi-run increases API calls - adjust
num_runsandrate_limit_delayaccordingly - Combine with standard tests: Use multi-run for critical test cases, standard evaluation for others
Multi-run evaluation increases API calls proportionally:
- 5 runs = 5x API costs (default)
- 7 runs = 7x API costs (recommended for production)
- 10 runs = 10x API costs (maximum reliability)
Mitigation strategies:
- Use
rate_limit_delayto spread API load over time - Enable multi-run only for critical test cases
- Adjust
num_runsbased on budget constraints - Reserve multi-run for staging/production environments
- Production/staging environments where reliability is critical
- Edge case testing with non-deterministic evaluation scenarios
- Critical test suites requiring high confidence in results
- Regression testing to ensure consistent behavior over time
- Development environments where fast iteration matters
- Deterministic tests with clear pass/fail criteria
- Large test suites where cost is a primary concern
- Initial prototyping and rapid development
Check out the examples/ directory for complete working examples:
| Example | Description |
|---|---|
basic_usage.py |
Simple drop-in replacement for assert_test |
advanced_usage.py |
Custom configuration and detailed results |
environment_config.py |
Environment-based conditional setup |
pytest_integration.py |
Integration patterns for pytest |
complete_demo.py |
Comprehensive feature demonstration |
# conftest.py
import pytest
from deepeval_multirun import should_use_multirun_evaluation, multirun_assert_test
from deepeval import assert_test
@pytest.fixture
def assert_func():
"""Return appropriate assert function based on environment."""
if should_use_multirun_evaluation():
return multirun_assert_test
return assert_test
# test_my_llm.py
def test_my_case(assert_func):
test_case = LLMTestCase(...)
metrics = [...]
assert_func(test_case, metrics)INFO - Starting multi-run evaluation for test_001
INFO - Running evaluation 1/5 for test_001
INFO - Running evaluation 2/5 for test_001
INFO - Running evaluation 3/5 for test_001
INFO - Running evaluation 4/5 for test_001
INFO - Running evaluation 5/5 for test_001
INFO - Completed evaluation for test_001: Pass=True, Confidence=high
MultiRunResult(
test_case_id='test_001',
final_pass=True,
pass_count=1,
total_runs=1,
confidence_level=ConfidenceLevel.HIGH,
metrics_breakdown={
'AnswerRelevancyMetric': {
'pass_count': 4,
'total_runs': 5,
'final_pass': True,
'confidence_level': 'high',
'pass_rate': 0.8,
'average_raw_score': 0.82
}
}
)
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License - see LICENSE file for details