[Feature] Eval metrics for heterogeneous financial document chunks — handling mixed document types in a single test run

## Background

I'm building a RAG evaluation pipeline for financial documents using DeepEval (related to my [`finrag-eval`](https://github.com/Ruthwik-Data/finrag-eval) project). Financial document corpora are inherently **heterogeneous** — a single test run will include:

- **10-K filings** (dense narrative + structured tables)
- **Earnings call transcripts** (dialogue, forward-looking statements, no tables)
- **Balance sheets** (pure tabular, structured financial data)
- **Analyst reports** (opinion + quantitative mixed)

## Problem

Current DeepEval metrics (faithfulness, answer relevancy, contextual precision) use a single threshold across all document types. This creates two issues:

1. **False negatives on structured docs**: A faithfulness score of 0.7 on a balance sheet question (where the answer is a specific number) should be treated as a failure — but the same score on an earnings call narrative might be acceptable.
2. **Threshold calibration is impossible**: Setting a single threshold that works across document types means either too many false positives (narrative docs) or too many false negatives (structured docs).

**Minimal Repro:**

```python
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# These two test cases need DIFFERENT thresholds
balance_sheet_case = LLMTestCase(
    input="What is Apple's total current assets?",
    actual_output="$135.4B",  # Either exactly right or completely wrong
    retrieval_context=[...]
)

earnings_call_case = LLMTestCase(
    input="What is management's outlook on AI investment?",
    actual_output="Management expressed cautious optimism...",  # Graded on a spectrum
    retrieval_context=[...]
)

metric = FaithfulnessMetric(threshold=0.7)  # Same threshold doesn't work for both
```

## Proposed Solution

Add a `document_type` field to `LLMTestCase` that allows per-type threshold configuration:

```python
# Option 1: Per-test-case document type
balance_sheet_case = LLMTestCase(
    input="What is Apple's total current assets?",
    actual_output="$135.4B",
    retrieval_context=[...],
    metadata={"document_type": "balance_sheet"}  # Using existing metadata field
)

# Option 2: MetricConfig per document type
metric = FaithfulnessMetric(
    threshold_by_type={
        "balance_sheet": 0.95,      # Structured — must be exact
        "earnings_call": 0.70,      # Narrative — some latitude
        "10k_filing": 0.85,         # Mixed — middle ground
        "default": 0.75
    }
)
```

## Why This Matters

For production RAG systems over heterogeneous corpora, a single eval threshold creates misleading pass/fail signals. This is especially critical in financial/legal domains where structured data retrieval quality is binary (right number or wrong), but narrative retrieval is gradient.

I explored this problem in depth in `finrag-eval` and had to write custom threshold logic outside DeepEval. Native support would make this significantly cleaner.

## Related

- My `finrag-eval` project: https://github.com/Ruthwik-Data/finrag-eval
- Building on top of the heterogeneous chunking patterns discussed in various RAG evaluation literature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Eval metrics for heterogeneous financial document chunks — handling mixed document types in a single test run #2775

Background

Problem

Proposed Solution

Why This Matters

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] Eval metrics for heterogeneous financial document chunks — handling mixed document types in a single test run #2775

Description

Background

Problem

Proposed Solution

Why This Matters

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions