test(metrics): add overlapping-chunk regression fixtures for ContextualRecallMetric (closes #2788)#2789
Draft
Ruthwik-Data wants to merge 8 commits into
Conversation
…alRecallMetric (issue confident-ai#2788) This test suite verifies the behavior of the ContextualRecallMetric in scenarios with overlapping chunks, ensuring that recall scores remain accurate and do not penalize redundancy. It includes tests for same-source overlaps, multi-source retrieval, and the impact of increasing overlap on recall.
|
@Ruthwik-Data is attempting to deploy a commit to the Confident AI Team on Vercel. A member of the Team first needs to authorize it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
tests/test_metrics/test_contextual_recall_overlapping_chunks.py— regression test fixtures forContextualRecallMetricoverlapping-chunk behaviour.Closes #2788.
This is the symmetric complement to PR #2787 (which addresses the same issue for
ContextualPrecisionMetric). The tests document the failure mode and will serve as regression targets once the source-grouping fix is applied toContextualRecallMetric.Motivation
PR #2743 added
_group_retrieval_contexts()(source-grouping deduplication) toContextualPrecisionMetricto fix issue #2594. The same fix was not applied symmetrically toContextualRecallMetric, leaving an identical failure mode:ContextualRecallMetricscores each chunk independently. If the LLM judge returnsyesfor the first chunk andnofor the second (partial redundancy), the recall score is halved — even though the expected output is fully covered.What this PR adds
Fixtures
overlapping_revenue_chunksmulti_statement_expected_outputretrieval_context_multi_sourceTests
test_same_source_overlap_does_not_lower_recalltest_multi_source_recall_not_inflated_by_overlaptest_increasing_overlap_does_not_decrease_recallDesign notes
RetrievedContextData(content=..., source=...)— post-feat(contextual-precision): add RetrievedContextData source grouping and fix weighted precision score #2743 API.threshold=0.0in the monotonicity test isolates scoring logic from pass/fail cutoffs.Type of change