Skip to content

Commit 1a71538

Browse files
authored
FinanceBench: update scorer instructions and switch scoring model to gpt-4.1 (spiceai#5395)
* FinanceBench: update scorer instructions and switch scoring model to `gpt-4.1` * Include evals response message when running evals benchmark
1 parent 226e383 commit 1a71538

2 files changed

Lines changed: 52 additions & 13 deletions

File tree

test/financebench/spicepod_gpt-4o.yaml

Lines changed: 44 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,16 +32,51 @@ models:
3232
- Keep responses under 512 characters.
3333
3434
- name: judge
35-
from: openai:gpt-4o
35+
from: openai:gpt-4.1-2025-04-14
3636
params:
3737
openai_api_key: ${ secrets:OPENAI_API_KEY }
3838
parameterized_prompt: enabled
3939
system_prompt: |
40-
You are a financial expert. Score the correctness of the answer below between 0.0 and 1.0.
41-
Use 0 if the answer is wrong or information was not found and 1.0 if the answer is correct.
42-
Question: '{{ input }}'
43-
Correct answer: '{{ ideal }}'
44-
Actual answer to score: '{{ actual }}'
40+
You are an expert evaluator in finance, using your expertise to assess the quality of responses generated by a Retrieval-Augmented Generation (RAG) system.
41+
42+
You will receive three inputs:
43+
- **User Question**: the original query posed by the user.
44+
- **Reference Answer**: a known correct and complete answer.
45+
- **Generated Answer**: the response provided by the RAG model, based on retrieved documents.
46+
47+
Evaluate the Generated Answer strictly according to these criteria:
48+
1. **Correctness**: All facts must accurately reflect the Reference Answer. Do not reward plausible but incorrect or unsupported claims.
49+
2. **Groundedness**: The answer must be fully grounded in provided documents without introducing any external or unsupported information.
50+
3. **Faithfulness**: There should be no hallucinated content; every claim must explicitly derive from the retrieved documents.
51+
4. **Completeness**: The answer must comprehensively cover all critical elements of the User Question, leaving no essential details out.
52+
5. **Relevance**: Information included should directly address the User Question without extraneous or irrelevant details.
53+
54+
Evaluation Guidelines:
55+
- For **numerical data**, strictly verify precision, rounding, and consistency against the provided source.
56+
- For **qualitative claims**, ensure logic and rationale are sound and factually accurate.
57+
- For **financial terms or metrics**, verify alignment with industry standards and provided definitions.
58+
- Ensure Generated Answer includes references (document name or citations).
59+
- Strongly penalize even minor instances of hallucination, speculation, or unsupported assumptions.
60+
61+
Scoring Instructions:
62+
- Assign a score between **0.0 and 1.0**, where:
63+
- **1.0**: Fully correct, relevant, complete, and strictly grounded in provided data.
64+
- **0.0**: Incorrect, misleading, fabricated, speculative, or explicitly states "I don't know."
65+
- Intermediate scores (e.g., 0.6, 0.8) indicate partially correct, incomplete, or partially grounded responses.
66+
- Do NOT reward answers that sound correct but lack explicit grounding in provided documents.
67+
- Assign **0.0** immediately if required information from retrieved documents is missing entirely or if the answer explicitly admits ignorance.
68+
69+
You must ONLY return final score, no commentary or explanation
70+
71+
# User Question:
72+
{{ input }}
73+
74+
# Reference Answer:
75+
{{ ideal }}
76+
77+
# Generated Answer:
78+
{{ actual }}
79+
4580
openai_response_format:
4681
type: json_schema
4782
json_schema:
@@ -50,12 +85,14 @@ models:
5085
type: object
5186
properties:
5287
score:
88+
description: >
89+
The score assigned to the actual answer based on the evaluation criteria.
5390
type: number
5491
format: float
5592
additionalProperties: true
5693
required:
5794
- score
58-
strict: false
95+
strict: true
5996

6097
views:
6198
- name: financebench.evals

tools/testoperator/src/commands/evals/mod.rs

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -62,14 +62,16 @@ pub(crate) async fn run(args: &EvalsTestArgs) -> anyhow::Result<()> {
6262
.send()
6363
.await?;
6464

65-
if !response.status().is_success() {
66-
return Err(anyhow::anyhow!(
67-
"Failed to execute evals: {}",
68-
response.text().await?
69-
));
65+
let response_status = response.status();
66+
let response_msq = response.text().await?;
67+
68+
if !response_status.is_success() {
69+
return Err(anyhow::anyhow!("Failed to execute evals: {response_msq}"));
7070
}
7171

72-
println!("Execution completed, retrieving results...");
72+
println!("Evals completed:\n{response_msq}");
73+
74+
println!("Retrieving results...");
7375

7476
let mut flight_client = spiced_instance.flight_client(None).await?;
7577

0 commit comments

Comments
 (0)