A typical experiment goes through these steps:
- Prepare data — serialize graphs and train BPE (
prepare_data_new.py) - Pre-train — MLM on the token sequences (
run_pretrain.py) - Fine-tune — downstream property prediction (
run_finetune.py) - Aggregate — collect results across runs (
aggregate_results.py)
For batch experiments across multiple datasets/methods, use batch_pretrain_simple.py and batch_finetune_simple.py.
# Step 1: data preparation
python prepare_data_new.py --datasets qm9 --methods feuler --bpe_merges 2000
# Step 2: pre-training
python run_pretrain.py \
--dataset qm9 --method feuler \
--experiment_group my_ablation \
--epochs 100 --batch_size 256
# Step 3: fine-tuning
python run_finetune.py \
--dataset qm9 --method feuler \
--experiment_group my_ablation \
--target_property homo \
--epochs 200 --batch_size 64The --experiment_group flag keeps related runs organized under the same directory in model/ and log/.
By default, each script supports --repeat_runs N to run the same experiment N times with different seeds. Aggregated statistics (mean, std, best) are computed automatically and saved as *_aggregated_stats.json.
To sweep over multiple datasets and serialization methods:
python batch_pretrain_simple.py \
--datasets qm9,zinc,mutagenicity \
--methods feuler,eulerian,cpp \
--bpe_scenarios all,raw \
--gpus 0,1
python batch_finetune_simple.py \
--datasets qm9,zinc,mutagenicity \
--methods feuler,eulerian,cpp \
--bpe_scenarios all,raw \
--gpus 0,1These scripts distribute jobs across GPUs and run them in parallel.
Metrics are task-dependent:
- Regression: MAE, RMSE, R², Pearson correlation
- Classification: accuracy, F1, ROC-AUC
- Multi-label: AP (average precision)
All metrics are logged per epoch and saved in log/{group}/{dataset}-{method}/run_{i}/.
- All random seeds are fixed (default: 42). The config system sets
torch,numpy, andrandomseeds. torch.backends.cudnn.deterministic = Trueis set by default.- A full config snapshot is saved with each run.
- Dataset splits are fixed and loaded from pre-generated indices — no random splitting at training time.
When comparing methods, make sure:
- Same train/val/test split for all methods
- Same data preprocessing (serialization + BPE settings)
- Same evaluation protocol and metrics
- Multiple runs (at least 3) to report mean ± std
- Same model architecture and training hyperparameters (unless those are the variable being tested)
- Data leakage: using test set statistics during preprocessing
- Mismatched BPE: fine-tuning with different BPE settings than pre-training
- Cherry-picking: reporting only the best run instead of aggregated statistics
- Unfixed seeds: making results non-reproducible