Skip to content

test(metrics): add overlapping-chunk regression fixtures for ContextualRecallMetric (closes #2788)#2789

Draft
Ruthwik-Data wants to merge 8 commits into
confident-ai:mainfrom
Ruthwik-Data:test/contextual-recall-overlapping-chunks
Draft

test(metrics): add overlapping-chunk regression fixtures for ContextualRecallMetric (closes #2788)#2789
Ruthwik-Data wants to merge 8 commits into
confident-ai:mainfrom
Ruthwik-Data:test/contextual-recall-overlapping-chunks

Conversation

@Ruthwik-Data

Copy link
Copy Markdown
Contributor

Summary

Adds tests/test_metrics/test_contextual_recall_overlapping_chunks.py — regression test fixtures for ContextualRecallMetric overlapping-chunk behaviour.

Closes #2788.

This is the symmetric complement to PR #2787 (which addresses the same issue for ContextualPrecisionMetric). The tests document the failure mode and will serve as regression targets once the source-grouping fix is applied to ContextualRecallMetric.

Motivation

PR #2743 added _group_retrieval_contexts() (source-grouping deduplication) to ContextualPrecisionMetric to fix issue #2594. The same fix was not applied symmetrically to ContextualRecallMetric, leaving an identical failure mode:

  • RAG pipelines with 10–20% sliding-window chunk overlap (standard for dense financial documents) produce chunks where the same information appears in adjacent chunks.
  • ContextualRecallMetric scores each chunk independently. If the LLM judge returns yes for the first chunk and no for the second (partial redundancy), the recall score is halved — even though the expected output is fully covered.
  • Redundancy ≠ Missing coverage.

What this PR adds

Fixtures

Fixture Scenario
overlapping_revenue_chunks Two same-source overlapping chunks from 10-K MD&A
multi_statement_expected_output Three-statement expected output spanning two sources
retrieval_context_multi_source Two overlapping 10-K chunks + one earnings-call chunk

Tests

Test Guards against
test_same_source_overlap_does_not_lower_recall Recall penalised for same-source chunk redundancy
test_multi_source_recall_not_inflated_by_overlap Overlapping chunks distorting multi-statement recall
test_increasing_overlap_does_not_decrease_recall Monotonicity: more context from same source → recall stays stable

Design notes

Type of change

  • Test (adding missing tests or correcting existing tests)

…alRecallMetric (issue confident-ai#2788)

This test suite verifies the behavior of the ContextualRecallMetric in scenarios with overlapping chunks, ensuring that recall scores remain accurate and do not penalize redundancy. It includes tests for same-source overlaps, multi-source retrieval, and the impact of increasing overlap on recall.
@vercel

vercel Bot commented Jun 20, 2026

Copy link
Copy Markdown

@Ruthwik-Data is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] ContextualRecallMetric over-penalises overlapping chunks — parallel issue to #2594 (ContextualPrecision)

1 participant