Skip to content

Commit 0ad1fa8

Browse files
authored
Integrate LeanProgress
Integrate LeanProgress
2 parents d4a9e1a + afe3a01 commit 0ad1fa8

31 files changed

+789
-1500
lines changed

README.md

Lines changed: 267 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,143 @@
11
# LeanDojo-v2
2+
3+
LeanDojo-v2 is an end-to-end framework for training, evaluating, and deploying AI-assisted theorem provers for Lean 4. It combines repository tracing, lifelong dataset management, retrieval-augmented agents, Hugging Face fine-tuning, and external inference APIs into one toolkit.
4+
5+
---
6+
7+
## Table of Contents
8+
9+
1. [Overview](#overview)
10+
2. [Key Features](#key-features)
11+
3. [Repository Layout](#repository-layout)
12+
4. [Requirements](#requirements)
13+
5. [Installation](#installation)
14+
6. [Environment Setup](#environment-setup)
15+
7. [Quick Start](#quickstart)
16+
8. [Working with Agents and Trainers](#working-with-agents-and-trainers)
17+
9. [Tracing and Dataset Generation](#tracing-and-dataset-generation)
18+
10. [External APIs and LeanCopilot](#external-apis-and-leancopilot)
19+
11. [Testing](#testing)
20+
12. [Troubleshooting & Tips](#troubleshooting--tips)
21+
13. [Contributing](#contributing)
22+
14. [License](#license)
23+
24+
---
25+
26+
## Overview
27+
28+
LeanDojo-v2 extends the original LeanDojo stack with the LeanAgent lifelong learning pipeline. It automates the entire loop of:
29+
30+
1. Cloning Lean repositories (GitHub or local) and tracing them with Lean instrumentation.
31+
2. Storing structured theorem information in a dynamic database.
32+
3. Training agent policies with supervised fine-tuning (SFT), GRPO-style RL, or retrieval objectives.
33+
4. Driving Pantograph-based provers to fill in sorrys or verify solutions.
34+
5. Using HuggingFace API for large model inference.
35+
36+
The codebase is modular: you can reuse the tracing pipeline without the agents, swap in custom trainers, or stand up your own inference service via the external API layer.
37+
38+
---
39+
40+
## Key Features
41+
42+
- **Unified Agent Abstractions**: `BaseAgent` orchestrates repository setup, training, and proving. Concrete implementations (`HFAgent`, `LeanAgent`, and `ExternalAgent`) tailor the workflow to Hugging Face models, retrieval-based provers, or REST-backed models.
43+
- **Powerful Trainers**: `SFTTrainer`, `GRPOTrainer`, and `RetrievalTrainer` cover LoRA-enabled supervised fine-tuning, group-relative policy optimization, and retriever-only curriculum learning.
44+
- **Multi-Modal Provers**: `HFProver`, `RetrievalProver`, and `ExternalProver` run on top of Pantograph’s Lean RPC server to search for tactics, generate whole proofs, or delegate to custom models.
45+
- **Lean Tracing Pipeline**: `lean_dojo` includes the Lean 4 instrumentation (`ExtractData.lean`) and Python utilities to trace commits, normalize ASTs, and cache proof states.
46+
- **Dynamic Repository Database**: `database` tracks repositories, theorems, curriculum difficulty, and sorry status, enabling lifelong training schedules.
47+
- **External API**: The `external_api` folder exposes HTTP endpoints (FastAPI + uvicorn) and Lean frontend snippets so you can query LLMs from Lean editors.
48+
49+
---
50+
51+
## Repository Layout
52+
53+
| Path | Description |
54+
|------|-------------|
55+
| `lean_dojo_v2/agent/` | Base class plus `HFAgent`, `LeanAgent`, and helpers to manage repositories and provers. |
56+
| `lean_dojo_v2/trainer/` | SFT, GRPO, and retrieval trainers with Hugging Face + DeepSpeed integration. |
57+
| `lean_dojo_v2/prover/` | Pantograph-based prover implementations (HF, retrieval, external). |
58+
| `lean_dojo_v2/lean_dojo/` | Lean tracing, dataset generation, caching, and AST utilities. |
59+
| `lean_dojo_v2/lean_agent/` | Lifelong learning pipeline (configs, database, retrieval stack, generator). |
60+
| `lean_dojo_v2/external_api/` | LeanCopilot code (Lean + Python server) to query external models. |
61+
| `lean_dojo_v2/utils/` | Shared helpers for Git, filesystem operations, and constants. |
62+
| `lean_dojo_v2/tests/` | Pytest regression suite. |
63+
64+
For deeper documentation on the lifelong learning component, see `lean_dojo_v2/lean_agent/README.md`.
65+
66+
---
67+
268
## Requirements
3-
* Python >= 3.11
4-
* GPU
69+
70+
- Python ≥ 3.11.
71+
- CUDA-capable GPU for training and inference (tested with CUDA 12.6).
72+
- Git ≥ 2.25 and `wget`.
73+
- [elan](https://github.com/leanprover/elan) Lean toolchain to trace repositories locally.
74+
- Adequate disk space for the `raid/` working directory (datasets, checkpoints, traces).
75+
76+
Python dependencies are declared in `pyproject.toml` and include PyTorch, PyTorch Lightning, Transformers, DeepSpeed, TRL, PEFT, and more.
77+
78+
---
79+
580
## Installation
6-
To install LeanDojo-v2, run
7-
``` sh
8-
uv pip install lean-dojo-v2
9-
```
10-
install Pantograph
11-
``` sh
12-
uv add git+https://github.com/stanford-centaur/PyPantograph
13-
```
14-
make sure you've installed CUDA-compiled torch,
15-
``` sh
16-
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
17-
```
18-
export your GitHub access token,
19-
``` sh
20-
export GITHUB_ACCESS_TOKEN=<GITHUB_ACCESS_TOKEN>
81+
82+
### Option 1: From PyPI
83+
84+
```sh
85+
# Install the core package
86+
pip install lean-dojo-v2
87+
88+
# Pantograph is required for Lean RPC
89+
pip install git+https://github.com/stanford-centaur/PyPantograph
90+
91+
# Install a CUDA-enabled torch build (adjust the index URL for your CUDA version)
92+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
2193
```
22-
To use the HuggingFace API, you need to export your HuggingFace token,
23-
``` sh
24-
export HF_TOKEN=<HF_TOKEN>
94+
95+
### Option 2: From Source (development)
96+
97+
```sh
98+
git clone https://github.com/lean-dojo/LeanDojo-v2.git
99+
cd LeanDojo-v2
100+
python -m venv .venv
101+
source .venv/bin/activate
102+
pip install --upgrade pip
103+
pip install -e .[dev]
104+
pip install git+https://github.com/stanford-centaur/PyPantograph
105+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
25106
```
26-
## Example
27-
``` python
107+
108+
> Tip: You can use [uv](https://github.com/astral-sh/uv) (`uv pip install lean-dojo-v2`) as an alternative Python package manager.
109+
110+
---
111+
112+
## Environment Setup
113+
114+
1. **GitHub Access Token (required)**
115+
The tracing pipeline calls the GitHub API extensively. Create a personal access token and export it before running any agent:
116+
```sh
117+
export GITHUB_ACCESS_TOKEN=<your-token>
118+
```
119+
120+
2. **Hugging Face Token (optional but needed for gated models)**
121+
```sh
122+
export HF_TOKEN=<your-hf-token>
123+
```
124+
125+
3. **Working directories**
126+
By default all datasets, caches, and checkpoints live under `<repo>/raid`. Change the layout by editing `lean_dojo_v2/utils/constants.py` or by pointing `RAID_DIR` to faster storage.
127+
128+
4. **Lean toolchains**
129+
Ensure `elan` is configured and Lean 4 (e.g., `leanprover/lean4:nightly`) is available on your `$PATH`. The tracing scripts look under `~/.elan/toolchains/`.
130+
131+
---
132+
133+
## Quick Start
134+
135+
```python
28136
from lean_dojo_v2.agent.hf_agent import HFAgent
29137
from lean_dojo_v2.trainer.sft_trainer import SFTTrainer
30138

31139
url = "https://github.com/durant42040/lean4-example"
32-
commit = "b14fef0ceca29a65bc3122bf730406b33c7effe5"
140+
commit = "005de00d03f1aaa32cb2923d5e3cbaf0b954a192"
33141

34142
trainer = SFTTrainer(
35143
model_name="deepseek-ai/DeepSeek-Prover-V2-7B",
@@ -43,5 +151,141 @@ agent = HFAgent(trainer=trainer)
43151
agent.setup_github_repository(url=url, commit=commit)
44152
agent.train()
45153
agent.prove()
154+
```
155+
156+
This example:
46157

158+
1. Downloads and traces the target Lean repository + commit.
159+
2. Builds a supervised dataset from sorry theorems.
160+
3. Fine-tunes the specified Hugging Face model (optionally with LoRA).
161+
4. Launches an `HFProver` backed by Pantograph to search for proofs.
162+
163+
---
164+
165+
## Working with Agents and Trainers
166+
167+
### Supervised Fine-Tuning (`SFTTrainer`)
168+
169+
- Accepts any Hugging Face causal LM identifier.
170+
- Supports LoRA by passing a `peft.LoraConfig`.
171+
- Key arguments: `epochs_per_repo`, `batch_size`, `max_seq_len`, `lr`, `warmup_steps`, `gradient_checkpointing`.
172+
- Produces checkpoints under `output_dir` that the `HFProver` consumes.
173+
174+
### GRPO Trainer (`GRPOTrainer`)
175+
176+
- Implements Group Relative Policy Optimization for reinforcement-style refinement.
177+
- Accepts `reference_model`, `reward_weights`, and `kl_beta` settings.
178+
- Useful for improving search policies on curated theorem batches.
179+
180+
### Retrieval Trainer & LeanAgent
181+
182+
- `RetrievalTrainer` trains the dense retriever that scores prior proofs.
183+
- `LeanAgent` wraps the trainer, maintains repository curricula, and couples it with `RetrievalProver`.
184+
185+
Each agent inherits `BaseAgent`, so you can implement your own by overriding `_get_build_deps()` and `_setup_prover()` to register new trainer/prover pairs.
186+
187+
---
188+
189+
## Tracing and Dataset Generation
190+
191+
The `lean_dojo_v2/lean_dojo/data_extraction` package powers repository tracing:
192+
193+
- `lean.py` clones repositories (GitHub, remote, or local), validates Lean versions, and normalizes URLs.
194+
- `trace.py` drives Lean with the custom `ExtractData.lean` instrumented module to capture theorem states.
195+
- `dataset.py` converts traced files to JSONL datasets ready for trainers.
196+
- `cache.py` memoizes repository metadata to avoid redundant downloads.
197+
- `traced_data.py` exposes typed wrappers for traced AST nodes and sorrys.
198+
199+
Typical usage:
200+
201+
```python
202+
from lean_dojo_v2.database import DynamicDatabase
203+
204+
url = "https://github.com/durant42040/lean4-example"
205+
commit = "005de00d03f1aaa32cb2923d5e3cbaf0b954a192"
206+
207+
database = DynamicDatabase()
208+
209+
database.setup_github_repository(
210+
url=url,
211+
commit=commit,
212+
build_deps=False,
213+
)
214+
```
215+
216+
The generated artifacts flow into the `DynamicDatabase`, which keeps repositories sorted by difficulty and appends new sorrys without retracing everything.
217+
218+
---
219+
220+
## External APIs and LeanCopilot
221+
222+
`lean_dojo_v2/external_api` contains Lean and Python code to expose models through LeanCopilot:
223+
224+
- `LeanCopilot.lean` registers RPC endpoints inside Lean.
225+
- `python/server.py` hosts a FastAPI service with adapters for Anthropic, OpenAI, Google Generative AI, vLLM, and custom HF models.
226+
- Start the service with:
227+
```sh
228+
cd lean_dojo_v2/external_api/python
229+
pip install -r requirements.txt
230+
uvicorn server:app --port 23337
231+
```
232+
- Point your Lean client to the running server to interactively request tactics, proofs, or completions from external models.
233+
234+
### LeanProgress Step-Prediction Workflow
235+
236+
- Generate a JSONL dataset with remaining-step targets (or replace it with your own LeanProgress export):
237+
```sh
238+
python -m lean_dojo_v2.lean_progress.create_sample_dataset --output raid/data/sample_leanprogress_dataset.jsonl
239+
```
240+
- Fine-tune a regression head that predicts `steps_remaining`:
241+
```sh
242+
python -m lean_dojo_v2.lean_progress.train_steps_model \
243+
--dataset raid/data/sample_leanprogress_dataset.jsonl \
244+
--output-dir raid/checkpoints/leanprogress_steps \
245+
--model-name bert-base-uncased
246+
```
247+
- Tell the LeanCopilot server where to find the checkpoint by exporting:
248+
```sh
249+
export LEANPROGRESS_MODEL=raid/checkpoints/leanprogress_steps
250+
uvicorn server:app --port 23337
251+
```
252+
- Add `use_reward=true` when calling `/generate`. Each output now includes `steps_remaining` and a reward value (currently `-steps_remaining`) so agents can minimize proof length.
253+
254+
---
255+
256+
## Testing
257+
258+
We use `pytest` for regression coverage.
259+
260+
```sh
261+
pip install -e .[dev] # make sure dev extras like pytest/trl are present
262+
export GITHUB_ACCESS_TOKEN=<token>
263+
export HF_TOKEN=<hf-token> # only required for tests touching HF APIs
264+
pytest -v
47265
```
266+
267+
---
268+
269+
## Troubleshooting & Tips
270+
271+
- **401 Bad Credentials / rate limits**: Ensure `GITHUB_ACCESS_TOKEN` is exported and has `repo` + `read:org` scopes.
272+
- **Lean tracing failures**: Confirm that the repo’s Lean version exists locally (`elan toolchain install <version>`).
273+
- **Missing CUDA libraries**: Install the PyTorch wheel that matches your driver and CUDA version.
274+
- **Dataset location**: The default `raid/` directory can grow large. Point it to high-throughput storage or use symlinks.
275+
- **Pantograph errors**: Reinstall Pantograph from source (`pip install git+https://github.com/stanford-centaur/PyPantograph`) whenever Lean upstream changes.
276+
277+
---
278+
279+
## Contributing
280+
281+
Issues and pull requests are welcome! Please:
282+
283+
1. Open an issue describing the bug or feature.
284+
2. Run formatters (`black`, `isort`) and `pytest` before submitting.
285+
3. Mention if your change touches Lean tracing files so reviewers can re-generate artifacts.
286+
287+
---
288+
289+
## License
290+
291+
LeanDojo-v2 is released under the MIT License. See `LICENSE` for details.

examples/grpo.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ def reward_func(completions, **kwargs):
1212
return torch.tensor([1.0] * len(completions))
1313

1414

15-
url = "https://github.com/durant42040/lean4-example",
15+
url = ("https://github.com/durant42040/lean4-example",)
1616
commit = "b14fef0ceca29a65bc3122bf730406b33c7effe5"
1717

1818
trainer = GRPOTrainer(

lean_dojo_v2/__init__.py

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,2 @@
1-
21
__version__ = "1.0.0"
32
__author__ = "LeanDojo-v2 Contributors"
4-
5-
# Import main components for easy access
6-
from .agent import BaseAgent, HFAgent, LeanAgent
7-
from .prover import BaseProver, ExternalProver, HFProver, RetrievalProver
8-
9-
__all__ = [
10-
"BaseAgent",
11-
"HFAgent",
12-
"LeanAgent",
13-
"BaseProver",
14-
"HFProver",
15-
"RetrievalProver",
16-
"ExternalProver",
17-
]

lean_dojo_v2/agent/base_agent.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from loguru import logger
66
from pantograph import Server
77

8-
from lean_dojo_v2.lean_agent.database.dynamic_database import DynamicDatabase
8+
from lean_dojo_v2.database.dynamic_database import DynamicDatabase
99
from lean_dojo_v2.lean_dojo.data_extraction.trace import get_traced_repo_path
1010
from lean_dojo_v2.utils.constants import DATA_DIR, RAID_DIR
1111

File renamed without changes.

lean_dojo_v2/lean_agent/database/dynamic_database.py renamed to lean_dojo_v2/database/dynamic_database.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,6 @@
3131
search_github_repositories,
3232
)
3333
from lean_dojo_v2.utils.lean import get_lean4_version_from_config
34-
from lean_dojo_v2.utils.repository import save_sorted_repos
3534

3635
from .models import Repository, Theorem
3736

@@ -482,7 +481,7 @@ def trace_repository(
482481
if (
483482
total_theorems < 3 * BATCH_SIZE
484483
): # Should be enough theorems for train/val/test
485-
logger.info(f"Not enough theorems found in {url}")
484+
logger.info(f"Not enough theorems found in {repo.url}")
486485
return None
487486

488487
config = repo.get_config("lean-toolchain")
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)