Skip to content

[Bug] ContextualRecallMetric over-penalises overlapping chunks — parallel issue to #2594 (ContextualPrecision) #2788

Description

@Ruthwik-Data

Bug Summary

ContextualRecallMetric has the same overlapping-chunk over-penalisation problem that was fixed for ContextualPrecisionMetric in PR #2743, but the fix was not applied symmetrically to the recall metric.

When a RAG pipeline uses sliding-window chunking (10–20% overlap — standard for dense financial documents like 10-Ks), adjacent chunks share content. ContextualRecallMetric evaluates whether every statement in the expected_output is covered by the retrieval_context. If the same expected statement spans two overlapping chunks, the metric:

  1. Scores both chunks as "covering" the statement — correct.
  2. But if the LLM judge returns "yes" for chunk A and "no" for chunk B (because chunk B is partially redundant), the recall score is penalised even though the information is present.

Redundancy ≠ Missing coverage.

Minimal Reproducer

from deepeval.metrics import ContextualRecallMetric
from deepeval.test_case import LLMTestCase

# 10-K with 20% overlap across section boundary
chunk_a = "Revenue for FY2023 was $4.2B, up 12% YoY, driven by cloud growth."
chunk_b = "Driven by cloud growth, revenue for FY2023 reached $4.2B (12% increase)."

test_case = LLMTestCase(
    input="What was revenue for FY2023?",
    actual_output="Revenue was $4.2 billion in FY2023, up 12% year-over-year.",
    expected_output="Revenue for FY2023 was $4.2 billion, up 12% year-over-year.",
    retrieval_context=[chunk_a, chunk_b],
)
metric = ContextualRecallMetric(threshold=0.8)
metric.measure(test_case)
print(metric.score)  # Expected: ~1.0  Actual: ~0.5 (penalises chunk_b redundancy)

Root Cause

ContextualRecallMetric does not have a source-grouping step equivalent to _group_retrieval_contexts() added to ContextualPrecisionMetric in PR #2743. Each chunk is scored independently, so overlapping content gets double-counted on the "wrong" side of the verdict.

Proposed Fix

Apply the same RetrievedContextData.source-based grouping to ContextualRecallMetric:

  1. Accept RetrievedContextData objects in retrieval_context (already partially supported).
  2. Before scoring, merge chunks sharing the same source string into a single evaluation unit.
  3. Score the merged unit once per source, not once per chunk.

This mirrors exactly what was done for ContextualPrecisionMetric and should be a small, symmetric change.

Impact

Happy to draft a PR

I can draft the fix + regression tests (parallel to PR #2787) if this aligns with the team's direction for ContextualRecallMetric.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions