Skip to content

[Feature] Eval metrics for heterogeneous financial document chunks — handling mixed document types in a single test run #2775

Description

@Ruthwik-Data

Background

I'm building a RAG evaluation pipeline for financial documents using DeepEval (related to my finrag-eval project). Financial document corpora are inherently heterogeneous — a single test run will include:

  • 10-K filings (dense narrative + structured tables)
  • Earnings call transcripts (dialogue, forward-looking statements, no tables)
  • Balance sheets (pure tabular, structured financial data)
  • Analyst reports (opinion + quantitative mixed)

Problem

Current DeepEval metrics (faithfulness, answer relevancy, contextual precision) use a single threshold across all document types. This creates two issues:

  1. False negatives on structured docs: A faithfulness score of 0.7 on a balance sheet question (where the answer is a specific number) should be treated as a failure — but the same score on an earnings call narrative might be acceptable.
  2. Threshold calibration is impossible: Setting a single threshold that works across document types means either too many false positives (narrative docs) or too many false negatives (structured docs).

Minimal Repro:

from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# These two test cases need DIFFERENT thresholds
balance_sheet_case = LLMTestCase(
    input="What is Apple's total current assets?",
    actual_output="$135.4B",  # Either exactly right or completely wrong
    retrieval_context=[...]
)

earnings_call_case = LLMTestCase(
    input="What is management's outlook on AI investment?",
    actual_output="Management expressed cautious optimism...",  # Graded on a spectrum
    retrieval_context=[...]
)

metric = FaithfulnessMetric(threshold=0.7)  # Same threshold doesn't work for both

Proposed Solution

Add a document_type field to LLMTestCase that allows per-type threshold configuration:

# Option 1: Per-test-case document type
balance_sheet_case = LLMTestCase(
    input="What is Apple's total current assets?",
    actual_output="$135.4B",
    retrieval_context=[...],
    metadata={"document_type": "balance_sheet"}  # Using existing metadata field
)

# Option 2: MetricConfig per document type
metric = FaithfulnessMetric(
    threshold_by_type={
        "balance_sheet": 0.95,      # Structured — must be exact
        "earnings_call": 0.70,      # Narrative — some latitude
        "10k_filing": 0.85,         # Mixed — middle ground
        "default": 0.75
    }
)

Why This Matters

For production RAG systems over heterogeneous corpora, a single eval threshold creates misleading pass/fail signals. This is especially critical in financial/legal domains where structured data retrieval quality is binary (right number or wrong), but narrative retrieval is gradient.

I explored this problem in depth in finrag-eval and had to write custom threshold logic outside DeepEval. Native support would make this significantly cleaner.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions