Description
[ ] I have checked the documentation and related resources and couldn't resolve my bug.
Describe the bug
ContextPrecision evaluator in RAGAS produces inconsistent scores that are strongly influenced by the position of relevant contexts in the input array. When a relevant context appears as the first element in the retrieved_contexts array, the score is artificially inflated (often >0.9), even when most contexts are irrelevant and when the score gradually decreases if the relevant context's position moves ahead.
Ragas version:0.2.14
Python version:3.11
Code to Reproduce
The below test fails with the score of 0.9999999999
def test_context_precision():
sample = SingleTurnSample(
reference="The capital of France is Paris.",
retrieved_contexts=[
"The capital of France is Paris",
"Bahmni is a comprehensive, easy-to-use, and fully open-source Hospital Information System (HIS)",
"it suitable for a wide range of healthcare facilities, from small clinics to large hospitals.",
"A resilient EMR and hospital management system built on reliable open-source components.",
],
user_input="What is the capital of France?",
)
context_precision = ContextPrecision(llm=evaluator_llm)
score = context_precision.single_turn_score(sample)
print("Context Precision score: ", score)
assert score < 0.3
WHEREAS
The below test passes. Here I have just changed the position of the relevant context as the last element.
def test_context_precision():
sample = SingleTurnSample(
reference="The capital of France is Paris.",
retrieved_contexts=[
"Bahmni is a comprehensive, easy-to-use, and fully open-source Hospital Information System (HIS)",
"it suitable for a wide range of healthcare facilities, from small clinics to large hospitals.",
"A resilient EMR and hospital management system built on reliable open-source components.",
"The capital of France is Paris",
],
user_input="What is the capital of France?",
)
context_precision = ContextPrecision(llm=evaluator_llm)
score = context_precision.single_turn_score(sample)
print("Context Precision score: ", score)
assert score < 0.3
Also,
When the relevant context is in second position, the score is 0.49999999995
When it is is in third position, the score is 0.3333333333
Error trace
Not applicable.
Expected behavior
The score should not be affected by the position of relevant context.
Additional context
The issue is consistent across the models Open AI GPT 4o, AWS Bedrock Claude Sonnet 3.7 and Azure OpenAI GPT4o. This position bias severely impacts the reliability of the Context Precision metric for evaluating RAG systems. In real-world applications, the ordering of retrieved contexts should not affect the precision score, as the goal is to measure how many of the retrieved contexts are actually relevant to the question.