marvinsxtr
diff --git a/‎README.md‎
Lines changed: 113 additions & 68 deletions b/‎README.md‎
Lines changed: 113 additions & 68 deletions
diff --git a/‎assets/overview.png‎
-4.76 KB b/‎assets/overview.png‎
-4.76 KB
diff --git a/‎baselines/condot/condot_module.py‎
Lines changed: 5 additions & 1 deletion b/‎baselines/condot/condot_module.py‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎baselines/metafm/gnn.py‎
Lines changed: 10 additions & 8 deletions b/‎baselines/metafm/gnn.py‎
Lines changed: 10 additions & 8 deletions
diff --git a/‎baselines/metafm/metafm_module.py‎
Lines changed: 6 additions & 2 deletions b/‎baselines/metafm/metafm_module.py‎
Lines changed: 6 additions & 2 deletions
@@ -1,109 +1,154 @@
-# MapPFN: Learning Causal Perturbation Maps in Context
+<h1 align="center">MapPFN: Learning Causal Perturbation Maps in Context</h1>
 
-This repository contains the code, configurations, and data processing scripts to reproduce the experiments for **MapPFN**, a prior-data fitted network (PFN) that uses in-context learning to predict perturbation effects in unseen biological contexts.
+<p align="center">
+  <a href="https://arxiv.org/abs/2601.21092"><img src="https://img.shields.io/badge/arXiv-b31b1b?style=for-the-badge&logo=arxiv" alt="arXiv"/></a>
+  <a href="https://marvinsxtr.github.io/MapPFN"><img src="https://img.shields.io/badge/Project_Page-007ec6?style=for-the-badge&logo=htmx&logoColor=white" alt="Project Page"/></a>
+  <a href="https://huggingface.co/marvinsxtr/MapPFN"><img src="https://img.shields.io/badge/Models-f5a623?style=for-the-badge&logo=huggingface&logoColor=white" alt="Models"/></a>
+  <a href="https://huggingface.co/datasets/marvinsxtr/MapPFN"><img src="https://img.shields.io/badge/Datasets-f5a623?style=for-the-badge&logo=huggingface&logoColor=white" alt="Datasets"/></a>
+</p>
 
-![MapPFN Overview](assets/overview.png)
+**MapPFN** is a prior-data fitted network (PFN) that uses in-context learning to predict perturbation effects in unseen biological contexts.
+
+<div align="center">
+  <img src="assets/overview.png" width="80%">
+  <p><em><strong>MapPFN overview.</strong> During pre-training, synthetic causal models are drawn to generate observational and interventional distributions. MapPFN meta-learns to map between pre- and post-perturbation distributions across many causal structures. At inference, it predicts cell-level post-perturbation distributions in one forward pass through amortized inference.</em></p>
+</div>
 
 ## Abstract
 
-Planning effective interventions in biological systems requires treatment-effect models that adapt to unseen biological contexts by identifying their specific underlying mechanisms. Yet single-cell perturbation datasets span only a handful of biological contexts, and existing methods cannot leverage new interventional evidence at inference time to adapt beyond their training data. To meta-learn a perturbation effect estimator, we present MapPFN, a prior-data fitted network (PFN) pretrained on synthetic data generated from a prior over causal perturbations. Given a set of experiments, MapPFN uses in-context learning to predict post-perturbation distributions, without gradient-based optimization. Despite being pretrained on *in silico* gene knockouts alone, MapPFN identifies differentially expressed genes, matching the performance of models trained on real single-cell data.
+Planning effective interventions in biological systems requires treatment-effect models that adapt to unseen biological contexts by identifying their specific underlying mechanisms. Yet single-cell perturbation datasets span only a handful of biological contexts, and existing methods cannot leverage new interventional evidence at inference time to adapt beyond their training data. To meta-learn a perturbation effect estimator, we present MapPFN, a prior-data fitted network (PFN) pre-trained on synthetic data generated from a prior over causal perturbations. Given a set of experiments, MapPFN uses in-context learning to predict post-perturbation distributions. Pre-trained on *in silico* gene knockouts alone, MapPFN identifies differentially expressed genes on par with models trained on real single-cell data. Fine-tuned, it consistently outperforms all baselines across downstream datasets.
 
-## Table of Contents
+## Setup
 
-- [Setup](#setup)
-- [Repository Structure](#repository-structure)
-- [Usage](#usage)
-  - [Data Generation](#data-generation)
-  - [Training](#training)
-- [Dependencies](#dependencies)
+A Docker image and devcontainer configuration are provided with all dependencies:
 
-## Setup
+```bash
+docker run --rm -it --gpus all -v .:/srv/repo ghcr.io/marvinsxtr/mappfn:latest bash
+```
+
+<details>
+<summary>VSCode & Slurm</summary>
 
-A `Dockerfile` is provided for containerized environments. The image includes all dependencies and can be used with Docker or Apptainer on HPC clusters.
+Use the [Remote - Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) extension to open the devcontainer locally, or connect to a remote tunnel by replacing `bash` with `code tunnel`.
 
-Logging to WandB is optional for local jobs but mandatory for jobs submitted to the cluster. Create a `.env` file in the root of the repository with:
+This setup also works with Apptainer on Slurm clusters. See the [ml-project-template](https://github.com/marvinsxtr/ml-project-template) for instructions.
+
+</details>
+
+<details>
+<summary>WandB logging (optional)</summary>
+
+Create a `.env` file in the repository root:
 
 ```bash
 WANDB_API_KEY=your_api_key
 WANDB_ENTITY=your_entity
 WANDB_PROJECT=your_project_name
 ```
 
-## Repository Structure
+</details>
 
-```
-MapPFN/
-├── map_pfn/
-│   ├── configs/         # Hydra-zen configuration files
-│   ├── data/            # Dataset classes and data generation
-│   │   ├── linear_scm.py        # Linear SCM data generation
-│   │   ├── sergio_dataset.py    # SERGIO GRN simulation
-│   │   └── perturbation_dataset.py
-│   ├── models/          # Model architectures
-│   │   ├── map_pfn.py           # MapPFN model
-│   │   └── mmdit.py             # MMDiT architecture
-│   ├── eval/            # Evaluation metrics
-│   ├── loss/            # Loss functions (CFM)
-│   ├── scripts/         # Training and data generation scripts
-│   │   ├── train.py
-│   │   └── generate_data.py
-│   ├── train/           # Training utilities
-│   └── utils/           # Helper functions
-├── baselines/
-│   ├── condot/          # Conditional Optimal Transport baseline
-│   └── metafm/          # Meta Flow Matching baseline
-└── datasets/            # Generated datasets (gitignored)
-```
+## Data
 
-## Usage
+Download pre-trained [weights](https://huggingface.co/marvinsxtr/MapPFN) and [datasets](https://huggingface.co/datasets/marvinsxtr/MapPFN) from Hugging Face:
 
-### Data Generation
+```python
+from huggingface_hub import hf_hub_download
+
+hf_hub_download("marvinsxtr/MapPFN", "model.ckpt", local_dir="checkpoints", repo_type="model")
+hf_hub_download("marvinsxtr/MapPFN", "frangieh.h5ad", local_dir="datasets/single_cell", repo_type="dataset")
+hf_hub_download("marvinsxtr/MapPFN", "papalexi.h5ad", local_dir="datasets/single_cell", repo_type="dataset")
+hf_hub_download("marvinsxtr/MapPFN", "sergio.h5ad", local_dir="datasets/synthetic", repo_type="dataset")
+```
 
-Generate synthetic datasets from linear SCMs or biological priors:
+<details>
+<summary>Preprocessing & generation</summary>
 
+Preprocess single-cell datasets:
 ```bash
-# Generate linear SCM data
-python map_pfn/scripts/generate_data.py cfg=linear_scm
+python map_pfn/scripts/process_sc_data.py
+```
 
-# Generate SERGIO GRN data
-python map_pfn/scripts/generate_data.py cfg=sergio_grn
+Generate synthetic datasets:
+```bash
+python map_pfn/scripts/generate_data.py cfg=linear   # Linear SCMs
+python map_pfn/scripts/generate_data.py cfg=sergio    # Biological prior
 ```
 
-### Training
+</details>
+
+## Inference
+
+```python
+from map_pfn.eval.evaluate import load_model
+
+trainer, module, datamodule = load_model(
+    method="map_pfn",
+    checkpoint_path="checkpoints/model.ckpt",
+    dataset_path="datasets/single_cell/frangieh.h5ad",
+)
+preds = trainer.predict(module, datamodule=datamodule)
+```
 
-Train MapPFN or baselines using the provided configurations:
+## Fine-tuning
 
+Fine-tune from a pre-trained checkpoint:
 ```bash
-# Train MapPFN on linear SCMs
-python map_pfn/scripts/train.py cfg=map_pfn_scm
-````
+python map_pfn/scripts/train.py \
+    cfg=map_pfn_rna \
+    cfg/datamodule=frangieh_finetune \
+    cfg.load_checkpoint=checkpoints/model.ckpt \
+    cfg.trainer.val_check_interval=500 \
+    cfg.trainer.callbacks.2.max_steps=3000 \
+    cfg/wandb=base
+```
 
-Available model configs: `map_pfn_scm`, `map_pfn_rna`, `condot_scm`, `condot_rna`, `metafm_scm`, `metafm_rna`
+## Pre-training
 
+Train MapPFN from scratch:
 ```bash
-# Run distributed sweep on Slurm
-python map_pfn/scripts/train.py cfg/job=methods_scm
+python map_pfn/scripts/train.py cfg=map_pfn_rna
 ```
 
-Available sweep configs: `methods_scm`, `methods_sergio`, `map_pfn_scm`, `map_pfn_sergio`
+## Configuration
 
-See [map_pfn/configs/train/config_stores.py](map_pfn/configs/train/config_stores.py) for all available configurations. This project uses [hydra-zen](https://github.com/mit-ll-responsible-ai/hydra-zen) for configuration management. Override parameters via command line:
+This project uses [hydra-zen](https://github.com/mit-ll-responsible-ai/hydra-zen) for configuration. Display all available options:
 
 ```bash
-python map_pfn/scripts/train.py cfg=map_pfn_scm cfg.datamodule.batch_size=64
+python map_pfn/scripts/train.py --help
+python map_pfn/scripts/generate_data.py --help
 ```
 
-## Dependencies
+## Repository Structure
+
+```
+MapPFN/
+├── map_pfn/
+│   ├── configs/         # Hydra-zen configuration
+│   ├── data/            # Datasets and data generation
+│   ├── models/          # MapPFN and MMDiT architecture
+│   ├── eval/            # Evaluation metrics
+│   ├── loss/            # Loss functions (CFM)
+│   ├── scripts/         # Training and data generation
+│   ├── train/           # Training utilities
+│   └── utils/           # Helpers
+├── baselines/
+│   ├── condot/          # Conditional Optimal Transport
+│   └── metafm/          # Meta Flow Matching
+└── datasets/            # Generated datasets (gitignored)
+```
+
+## Citation
+
+```bibtex
+@article{sextro2026mappfn,
+  title   = {{MapPFN}: Learning Causal Perturbation Maps in Context},
+  author  = {Sextro, Marvin and K\l{}os, Weronika and Dernbach, Gabriel},
+  journal = {arXiv preprint arXiv:2601.21092},
+  year    = {2026}
+}
+```
 
-This project builds on the following open-source libraries:
+## Contributing
 
-- [JAX](https://github.com/google/jax) - High-performance numerical computing
-- [Equinox](https://github.com/patrick-kidger/equinox) - Neural networks in JAX
-- [Hydra-zen](https://github.com/mit-ll-responsible-ai/hydra-zen) - Configuration management
-- [Diffrax](https://github.com/patrick-kidger/diffrax) - Differential equation solvers in JAX
-- [OTT-JAX](https://github.com/ott-jax/ott) - Optimal transport tools
-- [AnnData](https://github.com/scverse/anndata) - Annotated data for single-cell analysis
-- [Scanpy](https://github.com/scverse/scanpy) - Single-cell analysis in Python
-- [Pertpy](https://github.com/theislab/pertpy) - Perturbation analysis tools
-- [sergio_rs](https://github.com/rainx0r/sergio_rs) - Single-cell expression simulator
-- [grn-paper](https://github.com/maguirre1/grn-paper) - Gene regulatory network sampling
+If you have any feedback, questions, or ideas, please [open an issue](https://github.com/marvinsxtr/MapPFN/issues) or reach out via [email](mailto:m.kleine.sextro@tu-berlin.de).
@@ -327,4 +327,8 @@ def test_step(self, batch, batch_idx) -> dict[str, np.ndarray]:
             BatchKeys.TREATMENT: batch[BatchKeys.TREATMENT].detach().cpu().numpy().squeeze(1),
             BatchKeys.CONTEXT_ID: np.asarray(batch[BatchKeys.CONTEXT_ID]),
             BatchKeys.TREATMENT_ID: np.asarray(batch[BatchKeys.TREATMENT_ID]),
-        }
+        }
+
+    def predict_step(self, batch, batch_idx):
+        """Transport samples and return predictions with metadata."""
+        return self.test_step(batch, batch_idx)
@@ -56,8 +56,9 @@ def __post_init__(self):
                 input_size, self.D, self.num_hidden_decoder, self.num_layers_decoder
             )
 
-            self.temporal_freqs = (
-                torch.arange(1, self.num_temporal_freqs + 1, device="cuda") * torch.pi
+            self.register_buffer(
+                "temporal_freqs",
+                torch.arange(1, self.num_temporal_freqs + 1) * torch.pi,
             )
         else:
             input_size = (
@@ -77,13 +78,14 @@ def __post_init__(self):
                     input_size, self.D, self.num_hidden_decoder, self.num_layers_decoder
                 )
 
-            self.temporal_freqs = (
-                torch.arange(1, self.num_temporal_freqs + 1, device="cuda") * torch.pi
+            self.register_buffer(
+                "temporal_freqs",
+                torch.arange(1, self.num_temporal_freqs + 1) * torch.pi,
             )
-            
-        self.B = (
-            torch.randn((self.D, self.num_spatial_samples), device="cuda")
-            * self.spatial_feat_scale
+
+        self.register_buffer(
+            "B",
+            torch.randn((self.D, self.num_spatial_samples)) * self.spatial_feat_scale,
         )
 
     def embed_source(self, source_samples, cond=None):        
 
@@ -31,7 +31,7 @@ def __init__(
         num_treat_conditions: int = None,
         num_cell_conditions: int = None,
         base: str = "source",
-        integrate_time_steps: int = 500,
+        integrate_time_steps: int = 100,
     ):
         super().__init__()
         self.save_hyperparameters()
@@ -383,4 +383,8 @@ def test_step(self, batch, batch_idx) -> dict[str, np.ndarray]:
             BatchKeys.TREATMENT: batch[BatchKeys.TREATMENT].detach().cpu().numpy().squeeze(1),
             BatchKeys.CONTEXT_ID: np.asarray(batch[BatchKeys.CONTEXT_ID]),
             BatchKeys.TREATMENT_ID: np.asarray(batch[BatchKeys.TREATMENT_ID]),
-        }
+        }
+
+    def predict_step(self, batch, batch_idx):
+        """Transport samples and return predictions with metadata."""
+        return self.test_step(batch, batch_idx)
Original file line number	Diff line number	Diff line change
`@@ -56,8 +56,9 @@ def __post_init__(self):`
`56`	`56`	`input_size, self.D, self.num_hidden_decoder, self.num_layers_decoder`
`57`	`57`	`)`
`58`	`58`
`59`		`- self.temporal_freqs = (`
`60`		`- torch.arange(1, self.num_temporal_freqs + 1, device="cuda") * torch.pi`
	`59`	`+ self.register_buffer(`
	`60`	`+ "temporal_freqs",`
	`61`	`+ torch.arange(1, self.num_temporal_freqs + 1) * torch.pi,`
`61`	`62`	`)`
`62`	`63`	`else:`
`63`	`64`	`input_size = (`
`@@ -77,13 +78,14 @@ def __post_init__(self):`
`77`	`78`	`input_size, self.D, self.num_hidden_decoder, self.num_layers_decoder`
`78`	`79`	`)`
`79`	`80`
`80`		`- self.temporal_freqs = (`
`81`		`- torch.arange(1, self.num_temporal_freqs + 1, device="cuda") * torch.pi`
	`81`	`+ self.register_buffer(`
	`82`	`+ "temporal_freqs",`
	`83`	`+ torch.arange(1, self.num_temporal_freqs + 1) * torch.pi,`
`82`	`84`	`)`
`83`		`-`
`84`		`- self.B = (`
`85`		`- torch.randn((self.D, self.num_spatial_samples), device="cuda")`
`86`		`- * self.spatial_feat_scale`
	`85`	`+`
	`86`	`+ self.register_buffer(`
	`87`	`+ "B",`
	`88`	`+ torch.randn((self.D, self.num_spatial_samples)) * self.spatial_feat_scale,`
`87`	`89`	`)`
`88`	`90`
`89`	`91`	`def embed_source(self, source_samples, cond=None):`