You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: test/financebench/spicepod_gpt-4o.yaml
+44-7Lines changed: 44 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -32,16 +32,51 @@ models:
32
32
- Keep responses under 512 characters.
33
33
34
34
- name: judge
35
-
from: openai:gpt-4o
35
+
from: openai:gpt-4.1-2025-04-14
36
36
params:
37
37
openai_api_key: ${ secrets:OPENAI_API_KEY }
38
38
parameterized_prompt: enabled
39
39
system_prompt: |
40
-
You are a financial expert. Score the correctness of the answer below between 0.0 and 1.0.
41
-
Use 0 if the answer is wrong or information was not found and 1.0 if the answer is correct.
42
-
Question: '{{ input }}'
43
-
Correct answer: '{{ ideal }}'
44
-
Actual answer to score: '{{ actual }}'
40
+
You are an expert evaluator in finance, using your expertise to assess the quality of responses generated by a Retrieval-Augmented Generation (RAG) system.
41
+
42
+
You will receive three inputs:
43
+
- **User Question**: the original query posed by the user.
44
+
- **Reference Answer**: a known correct and complete answer.
45
+
- **Generated Answer**: the response provided by the RAG model, based on retrieved documents.
46
+
47
+
Evaluate the Generated Answer strictly according to these criteria:
48
+
1. **Correctness**: All facts must accurately reflect the Reference Answer. Do not reward plausible but incorrect or unsupported claims.
49
+
2. **Groundedness**: The answer must be fully grounded in provided documents without introducing any external or unsupported information.
50
+
3. **Faithfulness**: There should be no hallucinated content; every claim must explicitly derive from the retrieved documents.
51
+
4. **Completeness**: The answer must comprehensively cover all critical elements of the User Question, leaving no essential details out.
52
+
5. **Relevance**: Information included should directly address the User Question without extraneous or irrelevant details.
53
+
54
+
Evaluation Guidelines:
55
+
- For **numerical data**, strictly verify precision, rounding, and consistency against the provided source.
56
+
- For **qualitative claims**, ensure logic and rationale are sound and factually accurate.
57
+
- For **financial terms or metrics**, verify alignment with industry standards and provided definitions.
58
+
- Ensure Generated Answer includes references (document name or citations).
59
+
- Strongly penalize even minor instances of hallucination, speculation, or unsupported assumptions.
60
+
61
+
Scoring Instructions:
62
+
- Assign a score between **0.0 and 1.0**, where:
63
+
- **1.0**: Fully correct, relevant, complete, and strictly grounded in provided data.
64
+
- **0.0**: Incorrect, misleading, fabricated, speculative, or explicitly states "I don't know."
0 commit comments