Background
I'm building a RAG evaluation pipeline for financial documents using DeepEval (related to my finrag-eval project). Financial document corpora are inherently heterogeneous — a single test run will include:
- 10-K filings (dense narrative + structured tables)
- Earnings call transcripts (dialogue, forward-looking statements, no tables)
- Balance sheets (pure tabular, structured financial data)
- Analyst reports (opinion + quantitative mixed)
Problem
Current DeepEval metrics (faithfulness, answer relevancy, contextual precision) use a single threshold across all document types. This creates two issues:
- False negatives on structured docs: A faithfulness score of 0.7 on a balance sheet question (where the answer is a specific number) should be treated as a failure — but the same score on an earnings call narrative might be acceptable.
- Threshold calibration is impossible: Setting a single threshold that works across document types means either too many false positives (narrative docs) or too many false negatives (structured docs).
Minimal Repro:
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase
# These two test cases need DIFFERENT thresholds
balance_sheet_case = LLMTestCase(
input="What is Apple's total current assets?",
actual_output="$135.4B", # Either exactly right or completely wrong
retrieval_context=[...]
)
earnings_call_case = LLMTestCase(
input="What is management's outlook on AI investment?",
actual_output="Management expressed cautious optimism...", # Graded on a spectrum
retrieval_context=[...]
)
metric = FaithfulnessMetric(threshold=0.7) # Same threshold doesn't work for both
Proposed Solution
Add a document_type field to LLMTestCase that allows per-type threshold configuration:
# Option 1: Per-test-case document type
balance_sheet_case = LLMTestCase(
input="What is Apple's total current assets?",
actual_output="$135.4B",
retrieval_context=[...],
metadata={"document_type": "balance_sheet"} # Using existing metadata field
)
# Option 2: MetricConfig per document type
metric = FaithfulnessMetric(
threshold_by_type={
"balance_sheet": 0.95, # Structured — must be exact
"earnings_call": 0.70, # Narrative — some latitude
"10k_filing": 0.85, # Mixed — middle ground
"default": 0.75
}
)
Why This Matters
For production RAG systems over heterogeneous corpora, a single eval threshold creates misleading pass/fail signals. This is especially critical in financial/legal domains where structured data retrieval quality is binary (right number or wrong), but narrative retrieval is gradient.
I explored this problem in depth in finrag-eval and had to write custom threshold logic outside DeepEval. Native support would make this significantly cleaner.
Related
Background
I'm building a RAG evaluation pipeline for financial documents using DeepEval (related to my
finrag-evalproject). Financial document corpora are inherently heterogeneous — a single test run will include:Problem
Current DeepEval metrics (faithfulness, answer relevancy, contextual precision) use a single threshold across all document types. This creates two issues:
Minimal Repro:
Proposed Solution
Add a
document_typefield toLLMTestCasethat allows per-type threshold configuration:Why This Matters
For production RAG systems over heterogeneous corpora, a single eval threshold creates misleading pass/fail signals. This is especially critical in financial/legal domains where structured data retrieval quality is binary (right number or wrong), but narrative retrieval is gradient.
I explored this problem in depth in
finrag-evaland had to write custom threshold logic outside DeepEval. Native support would make this significantly cleaner.Related
finrag-evalproject: https://github.com/Ruthwik-Data/finrag-eval