[Bug] ContextualRecallMetric over-penalises overlapping chunks — parallel issue to #2594 (ContextualPrecision)

## Bug Summary

`ContextualRecallMetric` has the same overlapping-chunk over-penalisation problem that was fixed for `ContextualPrecisionMetric` in PR #2743, but the fix was **not applied symmetrically** to the recall metric.

When a RAG pipeline uses sliding-window chunking (10–20% overlap — standard for dense financial documents like 10-Ks), adjacent chunks share content. `ContextualRecallMetric` evaluates whether every statement in the `expected_output` is covered by the `retrieval_context`. If the same expected statement spans two overlapping chunks, the metric:

1. Scores both chunks as "covering" the statement — correct.
2. But if the LLM judge returns "yes" for chunk A and "no" for chunk B (because chunk B is partially redundant), the recall score is penalised even though the information is present.

**Redundancy ≠ Missing coverage.**

## Minimal Reproducer

```python
from deepeval.metrics import ContextualRecallMetric
from deepeval.test_case import LLMTestCase

# 10-K with 20% overlap across section boundary
chunk_a = "Revenue for FY2023 was $4.2B, up 12% YoY, driven by cloud growth."
chunk_b = "Driven by cloud growth, revenue for FY2023 reached $4.2B (12% increase)."

test_case = LLMTestCase(
    input="What was revenue for FY2023?",
    actual_output="Revenue was $4.2 billion in FY2023, up 12% year-over-year.",
    expected_output="Revenue for FY2023 was $4.2 billion, up 12% year-over-year.",
    retrieval_context=[chunk_a, chunk_b],
)
metric = ContextualRecallMetric(threshold=0.8)
metric.measure(test_case)
print(metric.score)  # Expected: ~1.0  Actual: ~0.5 (penalises chunk_b redundancy)
```

## Root Cause

`ContextualRecallMetric` does not have a source-grouping step equivalent to `_group_retrieval_contexts()` added to `ContextualPrecisionMetric` in PR #2743. Each chunk is scored independently, so overlapping content gets double-counted on the "wrong" side of the verdict.

## Proposed Fix

Apply the same `RetrievedContextData.source`-based grouping to `ContextualRecallMetric`:
1. Accept `RetrievedContextData` objects in `retrieval_context` (already partially supported).
2. Before scoring, merge chunks sharing the same `source` string into a single evaluation unit.
3. Score the merged unit once per source, not once per chunk.

This mirrors exactly what was done for `ContextualPrecisionMetric` and should be a small, symmetric change.

## Impact

- Financial RAG pipelines with standard 10–20% chunk overlap systematically under-report recall.
- Users increasing overlap to improve answer quality paradoxically see recall scores drop.
- Same failure mode as #2594, different metric.

## Happy to draft a PR

I can draft the fix + regression tests (parallel to PR #2787) if this aligns with the team's direction for `ContextualRecallMetric`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] ContextualRecallMetric over-penalises overlapping chunks — parallel issue to #2594 (ContextualPrecision) #2788

Bug Summary

Minimal Reproducer

Root Cause

Proposed Fix

Impact

Happy to draft a PR

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] ContextualRecallMetric over-penalises overlapping chunks — parallel issue to #2594 (ContextualPrecision) #2788

Description

Bug Summary

Minimal Reproducer

Root Cause

Proposed Fix

Impact

Happy to draft a PR

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions