Bug Summary
ContextualRecallMetric has the same overlapping-chunk over-penalisation problem that was fixed for ContextualPrecisionMetric in PR #2743, but the fix was not applied symmetrically to the recall metric.
When a RAG pipeline uses sliding-window chunking (10–20% overlap — standard for dense financial documents like 10-Ks), adjacent chunks share content. ContextualRecallMetric evaluates whether every statement in the expected_output is covered by the retrieval_context. If the same expected statement spans two overlapping chunks, the metric:
- Scores both chunks as "covering" the statement — correct.
- But if the LLM judge returns "yes" for chunk A and "no" for chunk B (because chunk B is partially redundant), the recall score is penalised even though the information is present.
Redundancy ≠ Missing coverage.
Minimal Reproducer
from deepeval.metrics import ContextualRecallMetric
from deepeval.test_case import LLMTestCase
# 10-K with 20% overlap across section boundary
chunk_a = "Revenue for FY2023 was $4.2B, up 12% YoY, driven by cloud growth."
chunk_b = "Driven by cloud growth, revenue for FY2023 reached $4.2B (12% increase)."
test_case = LLMTestCase(
input="What was revenue for FY2023?",
actual_output="Revenue was $4.2 billion in FY2023, up 12% year-over-year.",
expected_output="Revenue for FY2023 was $4.2 billion, up 12% year-over-year.",
retrieval_context=[chunk_a, chunk_b],
)
metric = ContextualRecallMetric(threshold=0.8)
metric.measure(test_case)
print(metric.score) # Expected: ~1.0 Actual: ~0.5 (penalises chunk_b redundancy)
Root Cause
ContextualRecallMetric does not have a source-grouping step equivalent to _group_retrieval_contexts() added to ContextualPrecisionMetric in PR #2743. Each chunk is scored independently, so overlapping content gets double-counted on the "wrong" side of the verdict.
Proposed Fix
Apply the same RetrievedContextData.source-based grouping to ContextualRecallMetric:
- Accept
RetrievedContextData objects in retrieval_context (already partially supported).
- Before scoring, merge chunks sharing the same
source string into a single evaluation unit.
- Score the merged unit once per source, not once per chunk.
This mirrors exactly what was done for ContextualPrecisionMetric and should be a small, symmetric change.
Impact
Happy to draft a PR
I can draft the fix + regression tests (parallel to PR #2787) if this aligns with the team's direction for ContextualRecallMetric.
Bug Summary
ContextualRecallMetrichas the same overlapping-chunk over-penalisation problem that was fixed forContextualPrecisionMetricin PR #2743, but the fix was not applied symmetrically to the recall metric.When a RAG pipeline uses sliding-window chunking (10–20% overlap — standard for dense financial documents like 10-Ks), adjacent chunks share content.
ContextualRecallMetricevaluates whether every statement in theexpected_outputis covered by theretrieval_context. If the same expected statement spans two overlapping chunks, the metric:Redundancy ≠ Missing coverage.
Minimal Reproducer
Root Cause
ContextualRecallMetricdoes not have a source-grouping step equivalent to_group_retrieval_contexts()added toContextualPrecisionMetricin PR #2743. Each chunk is scored independently, so overlapping content gets double-counted on the "wrong" side of the verdict.Proposed Fix
Apply the same
RetrievedContextData.source-based grouping toContextualRecallMetric:RetrievedContextDataobjects inretrieval_context(already partially supported).sourcestring into a single evaluation unit.This mirrors exactly what was done for
ContextualPrecisionMetricand should be a small, symmetric change.Impact
Happy to draft a PR
I can draft the fix + regression tests (parallel to PR #2787) if this aligns with the team's direction for
ContextualRecallMetric.