Skip to content

Conversation

@gwarmstrong
Copy link
Collaborator

@gwarmstrong gwarmstrong commented Aug 27, 2025

Adds docs for formal math evaluation

Signed-off-by: George Armstrong <[email protected]>
@gwarmstrong gwarmstrong requested review from Kipok and fchen97 August 27, 2025 17:00
Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did some style fixes, so that docs are properly displayed.

I think our default evaluation setup here might actually need to be updated. The default prompt seems to be for non-reasoning models (asking to do lean code right away), we should probably change that. Also the FINAL ANSWER thing isn't even mentioned in that prompt, so we should probably either not use it by default or make consistent with the prompt.

We probably need to change the default setup and update the docs accordingly. Would be good to add an example evaluation command that can match results of DS-prover or Geodel prover

@stephencge stephencge requested review from Kipok and removed request for fchen97 November 3, 2025 13:28
Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a few comments - please also make sure to run mkdocs serve and check that the rendering on the website looks good

- If the line already includes a complete proof artifact, it will be used directly; otherwise the proof is assembled from the model’s generated text and dataset metadata.
- `restate_formal_statement` (default: True)
- Controls whether the dataset’s `formal_statement` is inserted before the proof. Keeping this enabled enforces the canonical theorem; disabling it relies on the model’s emitted statement and is generally not recommended for benchmarking.
- `timeout` (default: 30.0 seconds)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we increase default?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30s is a good default (still low single digit % timeout on most benchmarks). Flagging it in this section good for user who wants to increase for finer evaluation environment control

++inference.top_p=0.95 \
++inference.tokens_to_generate=38912 \
--extra_eval_args="++eval_config.timeout=400" \
++prompt_config=lean4/formal-proof-deepseek-prover-v2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this default? If not, we should probably make it default

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it looks like that is default prompt_config in minif2f init.py

Copy link
Collaborator Author

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can't approve because I opened the PR, but I can help merge when ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants