Skip to content

Commit a10b541

Browse files
committed
Add tabpfn
1 parent 4ee3d58 commit a10b541

15 files changed

Lines changed: 3623 additions & 1017 deletions

File tree

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,4 +162,7 @@ cython_debug/
162162
#.idea/
163163

164164
outputs
165-
wandb
165+
wandb
166+
__pycache__/
167+
*.pth
168+
*.h5

README.md

Lines changed: 92 additions & 284 deletions
Original file line numberDiff line numberDiff line change
@@ -1,297 +1,105 @@
1-
# 🚀 ML Project Template
1+
# nanoTabPFN
22

3-
A modern template for machine learning experimentation using **wandb**, **hydra-zen**, and **submitit** on a Slurm cluster with Docker/Apptainer containerization.
3+
Train your own small TabPFN in less than 500 LOC and a few minutes.
44

5-
> **Note**: This template is optimized for the ML Group cluster setup but can be easily adapted to similar environments.
5+
The purpose of this repository is to be a good starting point for students and researchers that are interested in learning about how TabPFN works under the hood.
66

7-
<div align="center">
8-
9-
[![Python 3.12](https://img.shields.io/badge/Python-3.12-blue.svg)](https://www.python.org/downloads/release/python-3120/)
10-
[![Docker](https://img.shields.io/badge/Docker-Container-blue.svg)](https://www.docker.com/)
11-
[![WandB](https://img.shields.io/badge/WandB-Logging-yellow.svg)](https://wandb.ai)
12-
[![Hydra Zen](https://img.shields.io/badge/Hydra%20Zen-Config-green.svg)](https://github.com/mit-ll-responsible-ai/hydra-zen)
13-
[![Submitit](https://img.shields.io/badge/Submitit-Jobs-orange.svg)](https://github.com/facebookincubator/submitit)
14-
15-
</div>
16-
17-
## ✨ Key Features
18-
19-
- 📦 Python environment in Docker via [uv](https://docs.astral.sh/uv/)
20-
- 📊 Logging and visualizations via [Weights and Biases](https://wandb.com)
21-
- 🧩 Reproducibility and modular type-checked configs via [hydra-zen](https://github.com/mit-ll-responsible-ai/hydra-zen)
22-
- 🖥️ Submit Slurm jobs and parameter sweeps directly from Python via [submitit](https://github.com/facebookincubator/submitit)
23-
- 🔄 No `.def` or `.sh` files needed for Apptainer/Slurm
24-
25-
## 📋 Table of Contents
26-
27-
- [🔑 Container Registry Authentication](#-container-registry-authentication)
28-
- [🐳 Container Setup](#-container-setup)
29-
- [Option 1: Apptainer (Cluster)](#option-1-apptainer-cluster)
30-
- [Option 2: Docker (Local Machine)](#option-2-docker-local-machine)
31-
- [📦 Package Management](#-package-management)
32-
- [🛠️ Development Notes](#️-development-notes)
33-
- [🧪 Running Experiments](#-running-experiments)
34-
- [WandB Logging](#wandb-logging)
35-
- [Example Project](#example-project)
36-
- [Single Job](#single-job)
37-
- [Distributed Sweep](#distributed-sweep)
38-
- [👥 Contributions](#-contributions)
39-
- [🙏 Acknowledgements](#-acknowledgements)
40-
41-
## 🔑 Container Registry Authentication
42-
43-
### Generate Token
44-
45-
1. Create a new GitHub token at [Settings → Developer settings → Personal access tokens](https://github.com/settings/tokens) with:
46-
- `read:packages` permission
47-
- `write:packages` permission
48-
49-
### Log In
50-
51-
With Apptainer:
52-
```bash
53-
apptainer remote login --username <your GitHub username> docker://ghcr.io
7+
Clone the repository, afterwards install dependencies via:
548
```
55-
56-
With Docker:
57-
```bash
58-
docker login ghcr.io -u <your GitHub username>
9+
pip install numpy torch schedulefree h5py scikit-learn openml seaborn
5910
```
6011

61-
When prompted, enter your token as the password.
62-
63-
## 🐳 Container Setup
64-
65-
Choose one of the following methods to set up your environment:
66-
67-
### Option 1: Apptainer (Cluster)
68-
69-
1. **Install VSCode Remote Tunnels Extension**
70-
71-
First, install the [Remote Tunnels](https://marketplace.visualstudio.com/items?itemName=ms-vscode.remote-server) extension in VSCode.
72-
73-
2. **Connect to compute resources**
74-
75-
For CPU resources:
76-
```bash
77-
srun --partition=cpu-2h --pty bash
78-
```
79-
80-
For GPU resources:
81-
```bash
82-
srun --partition=gpu-2h --gpus-per-task=1 --pty bash
83-
```
84-
85-
3. **Launch container**
86-
87-
To open a tunnel to connect your local VSCode to the container on the cluster:
88-
```bash
89-
apptainer run --nv --writable-tmpfs oras://ghcr.io/marvinsxtr/ml-project-template:latest-sif code tunnel
90-
```
91-
92-
> 💡 You can specify a version tag (e.g., `v0.0.1`) instead of `latest`. Available versions are listed at [GitHub Container Registry](https://github.com/marvinsxtr/ml-project-template/pkgs/container/ml-project-template).
93-
94-
In VSCode press `Shift+Alt+P` (Windows/Linux) or `Shift+Cmd+P` (Mac), type "connect to tunnel", select GitHub and select your named node on the cluster. Your IDE is now connected to the cluster.
95-
96-
To open a shell in the container on the cluster:
97-
```bash
98-
apptainer run --nv --writable-tmpfs oras://ghcr.io/marvinsxtr/ml-project-template:latest-sif /bin/bash
99-
```
100-
101-
> 💡 This may take a few minutes on the first run as the container image is downloaded.
102-
103-
### Option 2: Docker (Local Machine)
104-
105-
1. **Install VSCode Dev Containers Extension**
106-
107-
First, install the [Dev Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) extension in VSCode.
108-
109-
2. **Open the Repository in the Dev Container**
110-
111-
Click the `Reopen in Container` button in the pop-up that appears once you open the repository in VSCode.
112-
113-
Alternatively, open the command palette in VSCode by pressing `Shift+Alt+P` (Windows/Linux) or `Shift+Cmd+P` (Mac), and type `Dev Containers: Reopen in Container`.
114-
115-
### Using Slurm within Apptainer
116-
117-
In order to access Slurm with submitit from within the container, you first need to set up passwordless SSH to the login node.
118-
119-
On the cluster, create a new SSH key pair in case you don't have one yet
120-
121-
```bash
122-
ssh-keygen -t ed25519 -C "your_email@example.com"
123-
```
124-
125-
and add your public key to the `authorized_keys`:
126-
127-
```bash
128-
cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys
129-
```
130-
131-
You can verify that this works by running
132-
133-
```bash
134-
ssh $USER@$HOST exit
135-
```
136-
137-
which should return without any prompt.
138-
139-
## 📦 Package Management
140-
141-
1. **Update dependencies**
142-
143-
This project uses [uv](https://docs.astral.sh/uv/) for Python dependency management.
144-
145-
Inside the container (!):
146-
```bash
147-
# Add a specific package
148-
uv add <package-name>
149-
150-
# Update all dependencies from pyproject.toml
151-
uv sync
152-
```
153-
154-
2. **Commit changes** to the repository:
155-
156-
Use tags for versioning:
157-
158-
```bash
159-
git add pyproject.toml uv.lock
160-
git commit -m "Updated dependencies"
161-
git tag v0.0.1
162-
git push && git push --tags
163-
```
164-
165-
3. **Use the updated image**:
166-
167-
The GitHub Actions workflow automatically builds a new image when changes are pushed.
168-
169-
With Apptainer:
170-
```bash
171-
apptainer run --nv --writable-tmpfs oras://ghcr.io/marvinsxtr/ml-project-template:v0.0.1-sif /bin/bash
172-
```
12+
### Our Code
17313

174-
With Docker:
175-
```bash
176-
docker run -it --rm --platform=linux/amd64 ghcr.io/marvinsxtr/ml-project-template:v0.0.1 /bin/bash
177-
```
14+
- `model.py` contains the implementation of the architecture and a sklearn-like interface in less than 200 lines of code.
15+
- `train.py` implements a simple training loop and prior dump data loader in under 200 lines
16+
- `experiment.ipynb` will recreate the experiment from the paper
17817

179-
## 🛠️ Development Notes
18018

181-
### Building Locally for Testing
19+
### Pretrain your own nanoTabPFN
18220

183-
Test your Dockerfile locally before pushing:
21+
To pretrain your own nanoTabPFN, you need to first download a prior data dump from [here](http://ml.informatik.uni-freiburg.de/research-artifacts/nanoTabPFN/300k_150x5_2.h5), then run `train.py`.
18422

18523
```bash
186-
docker buildx build -t ml-project-template .
24+
cd nanoTabPFN
25+
26+
# download data dump
27+
curl http://ml.informatik.uni-freiburg.de/research-artifacts/nanoTabPFN/300k_150x5_2.h5 --output 300k_150x5_2.h5
28+
29+
python train.py
30+
```
31+
32+
#### Step by Step explanation:
33+
34+
First we import our code from model.py and train.py
35+
```py
36+
from model import NanoTabPFNModel
37+
from model import NanoTabPFNClassifier
38+
from train import PriorDumpDataLoader
39+
from train import train, get_default_device
40+
```
41+
Then we instantiate our model
42+
```py
43+
model = NanoTabPFNModel(
44+
embedding_size=96,
45+
num_attention_heads=4,
46+
mlp_hidden_size=192,
47+
num_layers=3,
48+
num_outputs=2
49+
)
50+
```
51+
and our dataloader
52+
```py
53+
prior = PriorDumpDataLoader(
54+
"300k_150x5_2.h5",
55+
num_steps=2500,
56+
batch_size=32,
57+
)
58+
```
59+
Now we can train our model:
60+
```py
61+
device = get_default_device()
62+
model, _ = train(
63+
model,
64+
prior,
65+
lr = 4e-3,
66+
device = device
67+
)
68+
```
69+
and finally we can instantiate our classifier:
70+
```py
71+
clf = NanoTabPFNClassifier(model, device)
72+
```
73+
and use its `.fit`, `.predict` and `.predict_proba`:
74+
```py
75+
from sklearn.datasets import load_breast_cancer
76+
from sklearn.metrics import roc_auc_score, accuracy_score
77+
from sklearn.model_selection import train_test_split
78+
79+
X, y = load_breast_cancer(return_X_y=True)
80+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
81+
82+
clf.fit(X_train, y_train)
83+
prob = clf.predict_proba(X_test)
84+
pred = clf.predict(X_test)
85+
print('ROC AUC', roc_auc_score(y_test, prob))
86+
print('Accuracy', accuracy_score(y_test, pred))
87+
```
88+
89+
### TFM-Playground
90+
91+
The nanoTabPFN repository is supposed to stay ultra small and simple, but we created another repository,
92+
the [TFM-Playground](https://github.com/automl/TFM-Playground/) which we are building out to have a lot more features,
93+
like regression, multiple prior interfaces, multiple architectures, ensembling of different pre-processings and more,
94+
so check it out if you are interested!
95+
96+
### BibTex Citation
97+
98+
```
99+
@article{pfefferle2025nanotabpfn,
100+
title={nanoTabPFN: A Lightweight and Educational Reimplementation of TabPFN},
101+
author={Pfefferle, Alexander and Hog, Johannes and Purucker, Lennart and Hutter, Frank},
102+
journal={arXiv preprint arXiv:2511.03634},
103+
year={2025}
104+
}
187105
```
188-
189-
Run the container directly with:
190-
191-
```bash
192-
docker run -it --rm --platform=linux/amd64 ml-project-template /bin/bash
193-
```
194-
195-
## 🧪 Running Experiments
196-
197-
### WandB Logging
198-
199-
Logging to WandB is optional for local jobs but mandatory for jobs submitted to the cluster.
200-
201-
Create a `.env` file in the root of the repository with:
202-
203-
```bash
204-
WANDB_API_KEY=your_api_key
205-
WANDB_ENTITY=your_entity
206-
WANDB_PROJECT=your_project_name
207-
```
208-
209-
### Example Project
210-
211-
The folder `example` contains an example project which can serve as a starting point for ML experimentation. Configuring a function
212-
```python
213-
from ml_project_template.utils import logger
214-
215-
def main(foo: int = 42, bar: int = 3) -> None:
216-
"""Run a main function from a config."""
217-
logger.info(f"Hello World! cfg={cfg}, bar={bar}, foo={foo}")
218-
219-
if __name__ == "__main__":
220-
main()
221-
```
222-
223-
is as easy as adding (1) a `Run` as the first argument, (2) importing the config stores and (3) wrapping the `main` function with `run`:
224-
225-
```python
226-
from ml_project_template.config import run
227-
from ml_project_template.runs import Run
228-
from ml_project_template.utils import logger
229-
230-
def main(cfg: Run, foo: int = 42, bar: int = 3) -> None:
231-
"""Run a main function from a config."""
232-
logger.info(f"Hello World! cfg={cfg}, bar={bar}, foo={foo}")
233-
234-
if __name__ == "__main__":
235-
from example import stores # noqa: F401
236-
run(main)
237-
```
238-
239-
You can try running this example with:
240-
241-
```bash
242-
python example/main.py
243-
```
244-
245-
Hydra will automatically generate a `config.yaml` in the `outputs/<date>/<time>/.hydra` folder which you can use to reproduce the same run later.
246-
247-
Try overriding the values passed to the `main` function and see how it changes the output (config):
248-
249-
```bash
250-
python example/main.py foo=123
251-
```
252-
253-
Reproduce the results of a previous run/config:
254-
255-
```bash
256-
python example/main.py -cp outputs/<date>/<time>/.hydra -cn config.yaml
257-
```
258-
259-
Enabling WandB logging:
260-
261-
```bash
262-
python example/main.py cfg/wandb=base
263-
```
264-
265-
Run WandB in offline mode:
266-
267-
```bash
268-
python example/main.py cfg/wandb=base cfg.wandb.mode=offline
269-
```
270-
271-
### Single Job
272-
273-
Run a job on the cluster:
274-
275-
```bash
276-
python example/main.py cfg/job=base
277-
```
278-
279-
This will automatically enable WandB logging. See `example/configs.py` to configure the job settings.
280-
281-
### Distributed Sweep
282-
283-
Run a parameter sweep over multiple seeds using multiple nodes:
284-
285-
```bash
286-
python example/main.py cfg/job=sweep
287-
```
288-
289-
This will automatically enable WandB logging. See `example/configs.py` to configure sweep parameters.
290-
291-
## 👥 Contributions
292-
293-
Contributions to this documentation and template are very welcome! Feel free to open a PR or reach out with suggestions.
294-
295-
## 🙏 Acknowledgements
296-
297-
This template is based on a [previous example project](https://github.com/mx-e/example_project_ml_cluster).

0 commit comments

Comments
 (0)