A retrieval system for Lean 4 theorem proving. The corpus is extracted from Mathlib with jixia and encoded with Qwen3-Embedding-8B; two modes sit on top of the same corpus:
- Standard mode — single-query embedding + Qwen3-Reranker-8B pipeline.
- Reasoning mode — iterative decompose / retrieve / filter / judge loop driven by an LLM, targeting global premise retrieval.
This repository contains the code, data, and reproduction scripts for the paper LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving (arXiv:2605.13137).
Standard mode is also hosted publicly at https://leansearch.net/:
curl -X POST https://leansearch.net/search -H 'Content-Type: application/json' \
-d '{"query": ["the order of a group element divides the order of the group"], "num_results": 10}'Full API reference at https://leansearch.net/docs. Note that the public
deployment uses Qwen3-Reranker-4B (a cost-driven variant); the paper's
Table 1 numbers come from the 8B reranker that ./scripts/serve.sh loads
locally.
config.yaml paths, models, retriever URL, LLM, reasoning, prove
config.example.yaml template for `config.local.yaml` overrides
benchmark/ MathlibQR.json, MathlibMPR.json, MathlibMPR_Prop_ids.txt,
FATE-H.jsonl + FATE-H.LICENSE, MathlibQR_shared171.json
src/leansearchv2/ library code (lightweight to import; GPU deps lazy)
├── config.py, llm.py, standard_client.py
├── pipeline.py, server.py standard mode (GPU)
├── corpus/ jixia → JSONL → cuVS pipeline
├── reasoning/ decompose → search → filter → judge loop
├── prove/ prover + verifier + simple reflection loop
└── eval/ metric implementations
scripts/ entry points
external/jixia/ git submodule (pinned to v4.28.0-rc1)
data/ gitignored: cuVS index, corpus workdir, etc.
git clone --recurse-submodules https://github.com/frenzymath/LeanSearch-v2
cd LeanSearch-v2
# Editable install (looser bounds from pyproject.toml):
pip install -e .
pip install -e ".[cuvs]" --extra-index-url https://pypi.nvidia.com # serving / build_cuvs
pip install -e ".[lean]" # prove task
# Or, for an exact reproduce of the environment used to produce the paper:
pip install -r requirements.txt --extra-index-url https://pypi.nvidia.comEdit config.yaml to point at the cuVS index, model checkpoints, retriever
URL, and LLM endpoint. Local overrides go in config.local.yaml (gitignored).
| Artifact | Source |
|---|---|
| Pre-embedded cuVS corpus (Mathlib v4.28.0-rc1, Qwen3-Embedding-8B) | https://huggingface.co/datasets/FrenzyMath/lsv2-mathlib-v4.28.0-rc1-cuvs |
| Same corpus as JSONL (one record per declaration) | https://huggingface.co/datasets/FrenzyMath/lsv2-mathlib-v4.28.0-rc1-jsonl |
benchmark/MathlibQR.json (200 declarations × 6 query styles = 946 rows) |
this repo |
benchmark/MathlibQR_shared171.json (171-decl shared subset → 810 fair query rows) |
this repo |
benchmark/MathlibMPR.json (69 theorems, premise-group ground truth) |
this repo |
benchmark/MathlibMPR_Prop_ids.txt (50-id Prop subset, ⊆ MathlibMPR) |
this repo |
benchmark/FATE-H.jsonl (100 problems) |
redistributed under CC BY 4.0; see benchmark/FATE-H.LICENSE and cite the FATE paper |
After downloading the cuVS dataset, extract it into data/cuvs/mathlib-v4.28.0-rc1/
(or set paths.cuvs_db elsewhere).
If you want to rebuild from Mathlib source instead of using the pre-built cuVS dataset, run the four-step pipeline:
# Prerequisites: Lean 4 toolchain (elan/lake) installed; a Mathlib checkout
# at `paths.mathlib4` with `lake exe cache get` already done; jixia built
# via `cd external/jixia && lake build`; LLM key exported (for step 1c).
./scripts/build_corpus.shSteps individually:
python -m leansearchv2.corpus.jixia_extract # Mathlib *.lean → raw JSON
python -m leansearchv2.corpus.merge_to_jsonl # raw → merged JSONL (carries refs)
python -m leansearchv2.corpus.informalize # dep-aware bottom-up informalizer
python -m leansearchv2.corpus.build_cuvs # JSONL → cuVS index (GPU)The informalizer mirrors §3.1 of the paper: declarations are filtered with
the is_internal heuristic (drops compiler-generated _proof_1 / match_*
style names), typeReferences (and valueReferences for non-Prop kinds)
are resolved against the corpus, the resulting DAG is topologically sorted
with Kahn's algorithm, and informalization runs level-by-level so each
prompt sees its dependencies' already-informalized descriptions. Kind-
specific Jinja templates (templates/theorem.md.j2, definition.md.j2,
instance.md.j2, technical_entry.md.j2) drive the prompts. Structural
kinds (inductive, classInductive, structure, class, constructor,
recursor) are informalized too so downstream prompts have context for
them, but only the user-facing kinds (theorem, definition, instance,
abbrev, opaque, axiom) end up in the output JSONL.
The released HuggingFace JSONL was generated by an internal PostgreSQL- backed variant of this pipeline with stricter resume / cost-tracking machinery; the algorithm is the same.
The embedding model and reranker each require a dedicated GPU, so at least 2 GPUs are needed. With paths.cuvs_db and model paths set in config.yaml:
./scripts/serve.shThis starts leansearchv2.server:app on 0.0.0.0:8000 (override with
HOST / PORT). Endpoints: POST /search, POST /search_with_profile,
GET /health. Body params for /search:
query: list[str], num_results: int = 10, rerank: bool = True,
retrieve_k: int | None = None.
Start ./scripts/serve.sh on a GPU pod (the canonical 8B reranker
configuration matches the paper's Table 1). Then:
python scripts/reproduce_search.py --url http://localhost:8000Outputs reports/search/{per_query.jsonl, summary.json}. The script runs
both retriever-only (rerank=False) and the full rerank pipeline against
the 946 rows of MathlibQR and prints the fair-subset (810 rows / 171
declarations) and full-subset (946 rows) summary on stdout, reproducing
the LSv2 (retriever-only) and LSv2 (rerank) rows of Table 1.
Reasoning mode talks to the standard-mode endpoint over HTTP. Start
./scripts/serve.sh, define one or more LLM profiles under llm.<name>
in config.yaml (or config.local.yaml) with the relevant API key env
var, route the sketch / filter / judge roles via reasoning.{sketch,filter,judge}_llm
(paper used sonnet for sketch/judge and kimi for filter), then:
python scripts/reproduce_premise.py --url http://localhost:8000Outputs reports/premise/{per_query.jsonl, summary.json} and prints
Recall@k(group) and Covered@k (k ∈ {5, 10, 20, 30, 50}) on stdout, matching
the LSv2 (reasoning) row of Table 2.
Same LLM-profile setup as reasoning mode; route the prover and reflect-time
query generator via prove.{prover,query}_llm in config.yaml (paper used
sonnet for both). Then:
python scripts/reproduce_prove.py --url http://localhost:8000 \
--retriever-modes standard,reasoning \
--benchmark allOutputs reports/prove/<benchmark>_<mode>/{per_problem.jsonl, summary.json}
and prints a #solved table for FATE-H (100) × {standard, reasoning} and
MathlibMPR-Prop (50) × {standard, reasoning}, matching the LSv2 standard
and LSv2 reasoning rows of Table 3.
Requires pip install -e ".[lean]" plus a Lean 4 toolchain (elan/lake)
reachable on PATH. The first verification call lazily downloads
the Mathlib cache used by lean-interact; expect a one-time delay.
@article{leansearchv2,
title = {LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving},
author = {Gao, Guoxiong and Sun, Zeming and Jiang, Jiedong and Wang, Yutong and
Xu, Jingda and Wu, Peihao and Dai, Bryan and Dong, Bin},
year = {2026},
eprint = {2605.13137},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}If you use FATE-H please also cite
@article{jiang2025fate,
title={Fate: A formal benchmark series for frontier algebra of multiple difficulty levels},
author={Jiang, Jiedong and He, Wanyi and Wang, Yuefeng and Gao, Guoxiong and Hu, Yongle and Wang, Jingting and Guan, Nailin and Wu, Peihao and Dai, Chunbo and Xiao, Liang and Dong, Bin},
journal={arXiv preprint arXiv:2511.02872},
year={2025}
}Apache 2.0. See LICENSE. benchmark/FATE-H.jsonl is redistributed under
CC BY 4.0; see benchmark/FATE-H.LICENSE. external/jixia is licensed
separately (Apache 2.0).