Add Formal Math Eval Docs #729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

gwarmstrong wants to merge 9 commits into main from georgea/formal-math-eval-docs

+74 −6

Collaborator

gwarmstrong commented Aug 27, 2025 •

edited

Loading

Adds docs for formal math evaluation


          DOC add notes for lean4 eval

dae3926

Signed-off-by: George Armstrong <[email protected]>

gwarmstrong requested review from Kipok and fchen97

August 27, 2025 17:00

Kipok added 2 commits

August 27, 2025 15:49


          Merge branch 'main' into georgea/formal-math-eval-docs

4c98828


          Style fixes

91f5172

Signed-off-by: Igor Gitman <[email protected]>

Kipok reviewed

View reviewed changes

Collaborator

Kipok left a comment

did some style fixes, so that docs are properly displayed.

I think our default evaluation setup here might actually need to be updated. The default prompt seems to be for non-reasoning models (asking to do lean code right away), we should probably change that. Also the FINAL ANSWER thing isn't even mentioned in that prompt, so we should probably either not use it by default or make consistent with the prompt.

We probably need to change the default setup and update the docs accordingly. Would be good to add an example evaluation command that can match results of DS-prover or Geodel prover

gwarmstrong and others added 3 commits

September 2, 2025 15:10


          Merge branch 'main' into georgea/formal-math-eval-docs

39edc8c


          Merge branch 'main' into georgea/formal-math-eval-docs

b68ec63

Signed-off-by: Stephen Ge <[email protected]>


          update formal math docs

5976c63

Signed-off-by: Stephen Ge <[email protected]>

stephencge requested review from Kipok and removed request for fchen97

November 3, 2025 13:28

gwarmstrong commented

View reviewed changes

docs/evaluation/formal-math.md Show resolved Hide resolved

Kipok reviewed

View reviewed changes

Collaborator

Kipok left a comment

added a few comments - please also make sure to run mkdocs serve and check that the rendering on the website looks good

docs/evaluation/formal-math.md Outdated Show resolved Hide resolved

docs/evaluation/formal-math.md Outdated Show resolved Hide resolved

docs/evaluation/formal-math.md Outdated

    
                - If the line already includes a complete proof artifact, it will be used directly; otherwise the proof is assembled from the model’s generated text and dataset metadata.

              - `restate_formal_statement` (default: True)

                - Controls whether the dataset’s `formal_statement` is inserted before the proof. Keeping this enabled enforces the canonical theorem; disabling it relies on the model’s emitted statement and is generally not recommended for benchmarking.

              - `timeout` (default: 30.0 seconds)

Collaborator

Kipok Nov 4, 2025

should we increase default?

Collaborator

stephencge Nov 5, 2025

30s is a good default (still low single digit % timeout on most benchmarks). Flagging it in this section good for user who wants to increase for finer evaluation environment control

docs/evaluation/formal-math.md Outdated Show resolved Hide resolved

docs/evaluation/formal-math.md Outdated

    
                  ++inference.top_p=0.95 \

                  ++inference.tokens_to_generate=38912 \

                  --extra_eval_args="++eval_config.timeout=400" \

                  ++prompt_config=lean4/formal-proof-deepseek-prover-v2

Collaborator

Kipok Nov 4, 2025

is this default? If not, we should probably make it default

Collaborator

stephencge Nov 5, 2025

yes it looks like that is default prompt_config in minif2f init.py

Kipok reviewed

View reviewed changes

docs/evaluation/formal-math.md Show resolved Hide resolved

stephencge added 2 commits

November 5, 2025 08:37


          disable final_answer_key split string by default

f699277

Signed-off-by: Stephen Ge <[email protected]>


          further formatting cleanups, add summarize_results output

f1c3f75

Signed-off-by: Stephen Ge <[email protected]>

gwarmstrong commented

View reviewed changes

Collaborator Author

gwarmstrong left a comment

LGTM. Can't approve because I opened the PR, but I can help merge when ready.


          Merge branch 'main' into georgea/formal-math-eval-docs

d30dbe5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet