You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LeanDojo-v2 is an end-to-end framework for training, evaluating, and deploying AI-assisted theorem provers for Lean 4. It combines repository tracing, lifelong dataset management, retrieval-augmented agents, Hugging Face fine-tuning, and external inference APIs into one toolkit.
4
+
5
+
---
6
+
7
+
## Table of Contents
8
+
9
+
1.[Overview](#overview)
10
+
2.[Key Features](#key-features)
11
+
3.[Repository Layout](#repository-layout)
12
+
4.[Requirements](#requirements)
13
+
5.[Installation](#installation)
14
+
6.[Environment Setup](#environment-setup)
15
+
7.[Quick Start](#quickstart)
16
+
8.[Working with Agents and Trainers](#working-with-agents-and-trainers)
17
+
9.[Tracing and Dataset Generation](#tracing-and-dataset-generation)
18
+
10.[External APIs and LeanCopilot](#external-apis-and-leancopilot)
LeanDojo-v2 extends the original LeanDojo stack with the LeanAgent lifelong learning pipeline. It automates the entire loop of:
29
+
30
+
1. Cloning Lean repositories (GitHub or local) and tracing them with Lean instrumentation.
31
+
2. Storing structured theorem information in a dynamic database.
32
+
3. Training agent policies with supervised fine-tuning (SFT), GRPO-style RL, or retrieval objectives.
33
+
4. Driving Pantograph-based provers to fill in sorrys or verify solutions.
34
+
5. Using HuggingFace API for large model inference.
35
+
36
+
The codebase is modular: you can reuse the tracing pipeline without the agents, swap in custom trainers, or stand up your own inference service via the external API layer.
37
+
38
+
---
39
+
40
+
## Key Features
41
+
42
+
-**Unified Agent Abstractions**: `BaseAgent` orchestrates repository setup, training, and proving. Concrete implementations (`HFAgent`, `LeanAgent`, and `ExternalAgent`) tailor the workflow to Hugging Face models, retrieval-based provers, or REST-backed models.
43
+
-**Powerful Trainers**: `SFTTrainer`, `GRPOTrainer`, and `RetrievalTrainer` cover LoRA-enabled supervised fine-tuning, group-relative policy optimization, and retriever-only curriculum learning.
44
+
-**Multi-Modal Provers**: `HFProver`, `RetrievalProver`, and `ExternalProver` run on top of Pantograph’s Lean RPC server to search for tactics, generate whole proofs, or delegate to custom models.
45
+
-**Lean Tracing Pipeline**: `lean_dojo` includes the Lean 4 instrumentation (`ExtractData.lean`) and Python utilities to trace commits, normalize ASTs, and cache proof states.
46
+
-**Dynamic Repository Database**: `database` tracks repositories, theorems, curriculum difficulty, and sorry status, enabling lifelong training schedules.
47
+
-**External API**: The `external_api` folder exposes HTTP endpoints (FastAPI + uvicorn) and Lean frontend snippets so you can query LLMs from Lean editors.
48
+
49
+
---
50
+
51
+
## Repository Layout
52
+
53
+
| Path | Description |
54
+
|------|-------------|
55
+
|`lean_dojo_v2/agent/`| Base class plus `HFAgent`, `LeanAgent`, and helpers to manage repositories and provers. |
56
+
|`lean_dojo_v2/trainer/`| SFT, GRPO, and retrieval trainers with Hugging Face + DeepSpeed integration. |
> Tip: You can use [uv](https://github.com/astral-sh/uv) (`uv pip install lean-dojo-v2`) as an alternative Python package manager.
109
+
110
+
---
111
+
112
+
## Environment Setup
113
+
114
+
1.**GitHub Access Token (required)**
115
+
The tracing pipeline calls the GitHub API extensively. Create a personal access token and export it before running any agent:
116
+
```sh
117
+
export GITHUB_ACCESS_TOKEN=<your-token>
118
+
```
119
+
120
+
2.**Hugging Face Token (optional but needed for gated models)**
121
+
```sh
122
+
export HF_TOKEN=<your-hf-token>
123
+
```
124
+
125
+
3.**Working directories**
126
+
By default all datasets, caches, and checkpoints live under `<repo>/raid`. Change the layout by editing `lean_dojo_v2/utils/constants.py` or by pointing `RAID_DIR` to faster storage.
127
+
128
+
4.**Lean toolchains**
129
+
Ensure `elan` is configured and Lean 4 (e.g., `leanprover/lean4:nightly`) is available on your `$PATH`. The tracing scripts look under `~/.elan/toolchains/`.
130
+
131
+
---
132
+
133
+
## Quick Start
134
+
135
+
```python
28
136
from lean_dojo_v2.agent.hf_agent import HFAgent
29
137
from lean_dojo_v2.trainer.sft_trainer import SFTTrainer
- Produces checkpoints under `output_dir` that the `HFProver` consumes.
173
+
174
+
### GRPO Trainer (`GRPOTrainer`)
175
+
176
+
- Implements Group Relative Policy Optimization for reinforcement-style refinement.
177
+
- Accepts `reference_model`, `reward_weights`, and `kl_beta` settings.
178
+
- Useful for improving search policies on curated theorem batches.
179
+
180
+
### Retrieval Trainer & LeanAgent
181
+
182
+
-`RetrievalTrainer` trains the dense retriever that scores prior proofs.
183
+
-`LeanAgent` wraps the trainer, maintains repository curricula, and couples it with `RetrievalProver`.
184
+
185
+
Each agent inherits `BaseAgent`, so you can implement your own by overriding `_get_build_deps()` and `_setup_prover()` to register new trainer/prover pairs.
186
+
187
+
---
188
+
189
+
## Tracing and Dataset Generation
190
+
191
+
The `lean_dojo_v2/lean_dojo/data_extraction` package powers repository tracing:
192
+
193
+
-`lean.py` clones repositories (GitHub, remote, or local), validates Lean versions, and normalizes URLs.
194
+
-`trace.py` drives Lean with the custom `ExtractData.lean` instrumented module to capture theorem states.
195
+
-`dataset.py` converts traced files to JSONL datasets ready for trainers.
196
+
-`cache.py` memoizes repository metadata to avoid redundant downloads.
197
+
-`traced_data.py` exposes typed wrappers for traced AST nodes and sorrys.
The generated artifacts flow into the `DynamicDatabase`, which keeps repositories sorted by difficulty and appends new sorrys without retracing everything.
217
+
218
+
---
219
+
220
+
## External APIs and LeanCopilot
221
+
222
+
`lean_dojo_v2/external_api` contains Lean and Python code to expose models through LeanCopilot:
- Add `use_reward=true` when calling `/generate`. Each output now includes `steps_remaining` and a reward value (currently `-steps_remaining`) so agents can minimize proof length.
253
+
254
+
---
255
+
256
+
## Testing
257
+
258
+
We use `pytest` for regression coverage.
259
+
260
+
```sh
261
+
pip install -e .[dev] # make sure dev extras like pytest/trl are present
262
+
export GITHUB_ACCESS_TOKEN=<token>
263
+
export HF_TOKEN=<hf-token># only required for tests touching HF APIs
264
+
pytest -v
47
265
```
266
+
267
+
---
268
+
269
+
## Troubleshooting & Tips
270
+
271
+
-**401 Bad Credentials / rate limits**: Ensure `GITHUB_ACCESS_TOKEN` is exported and has `repo` + `read:org` scopes.
272
+
-**Lean tracing failures**: Confirm that the repo’s Lean version exists locally (`elan toolchain install <version>`).
273
+
-**Missing CUDA libraries**: Install the PyTorch wheel that matches your driver and CUDA version.
274
+
-**Dataset location**: The default `raid/` directory can grow large. Point it to high-throughput storage or use symlinks.
0 commit comments