-
Notifications
You must be signed in to change notification settings - Fork 108
Add Formal Math Eval Docs #729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: George Armstrong <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did some style fixes, so that docs are properly displayed.
I think our default evaluation setup here might actually need to be updated. The default prompt seems to be for non-reasoning models (asking to do lean code right away), we should probably change that. Also the FINAL ANSWER thing isn't even mentioned in that prompt, so we should probably either not use it by default or make consistent with the prompt.
We probably need to change the default setup and update the docs accordingly. Would be good to add an example evaluation command that can match results of DS-prover or Geodel prover
Signed-off-by: Stephen Ge <[email protected]>
Signed-off-by: Stephen Ge <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a few comments - please also make sure to run mkdocs serve and check that the rendering on the website looks good
docs/evaluation/formal-math.md
Outdated
| - If the line already includes a complete proof artifact, it will be used directly; otherwise the proof is assembled from the model’s generated text and dataset metadata. | ||
| - `restate_formal_statement` (default: True) | ||
| - Controls whether the dataset’s `formal_statement` is inserted before the proof. Keeping this enabled enforces the canonical theorem; disabling it relies on the model’s emitted statement and is generally not recommended for benchmarking. | ||
| - `timeout` (default: 30.0 seconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we increase default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30s is a good default (still low single digit % timeout on most benchmarks). Flagging it in this section good for user who wants to increase for finer evaluation environment control
docs/evaluation/formal-math.md
Outdated
| ++inference.top_p=0.95 \ | ||
| ++inference.tokens_to_generate=38912 \ | ||
| --extra_eval_args="++eval_config.timeout=400" \ | ||
| ++prompt_config=lean4/formal-proof-deepseek-prover-v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this default? If not, we should probably make it default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it looks like that is default prompt_config in minif2f init.py
Signed-off-by: Stephen Ge <[email protected]>
Signed-off-by: Stephen Ge <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can't approve because I opened the PR, but I can help merge when ready.
Adds docs for formal math evaluation