Scripts for inspecting raw graph datasets and converting them into the unified format expected by TokenizerGraph.
TokenizerGraph expects each dataset to be stored as:
data/<dataset>/
├── data.pkl # List of (dgl_graph, label) tuples or List[Dict]
├── train_index.json # Train split indices
├── val_index.json # Validation split indices
└── test_index.json # Test split indices
All graphs must have:
g.ndata['feat']— node token IDs (LongTensor, shape[N, 1]or[N, 2])g.edata['feat']— edge token IDs (LongTensor, shape[E, 1])
This directory layout is only the raw-data prerequisite for the main training pipeline. A dataset should be treated as training-ready only when all of the following are true:
data/<dataset>/data.pklexistsdata/<dataset>/train_index.json,val_index.json,test_index.jsonexist- the dataset name is registered in
src/data/unified_data_factory.py python prepare_data_new.py --datasets <dataset> --methods feulerfinishes successfullydata/processed/<dataset>/...artifacts are created
The audited status below reflects the checked-in repository state on 2026-03-15. The classification is intentionally strict:
Loader OKmeans the dataset is registered insrc/data/unified_data_factory.pyandloader.load_data()succeeds with the local files currently present indata/<dataset>/.feuler training cache readablemeansUnifiedDataInterfacecan read the existingfeulerserialized cache, vocab, and BPE codebook from the current repository state.End-to-end verifiedis reserved for datasets that were actually executed throughprepare_data_new.py -> run_pretrain.py -> run_finetune.py.feuler training cache readabledoes not imply that the dataset has been reproduced from raw data in the current audit window.
| Status | Datasets | Meaning |
|---|---|---|
| End-to-end verified | qm9test |
prepare_data_new.py -> run_pretrain.py -> run_finetune.py has been executed successfully in the current repository state. |
| Loader OK + feuler training cache readable | aqsol, coildel, colors3, dblp, dd, molhiv, mutagenicity, peptides_func, peptides_struct, proteins, qm9, qm9test, synthetic, twitter, zinc |
Raw data files load successfully, and the current feuler cache can be consumed by the training read path. |
| Loader OK, prepare required before training | mnist, mnist_raw |
Loader smoke tests succeed, but the current repository state does not contain the required feuler serialized cache, vocab, and BPE codebook. |
| Blocked in current repository state | code2 |
The raw loader path is blocked because data/code2/data.pkl is missing. Existing partial cache files are not sufficient to claim that the dataset is runnable. |
releasekeeps only the paper-scope reproducibility entrypoints and their minimum verification assets.devkeeps the broader audit surface:- experimental scripts such as
aqsol,zinc, andcode2 - incomplete local-flow scripts such as
mnist - secondary helpers such as
data/qm9/process_qm9_dataset.py
- experimental scripts such as
- If a script is not a canonical paper-scope cold-start entrypoint, assume it belongs to
devunless the release docs say otherwise.
Pre-processed datasets (in the format above) can be downloaded from:
- Google Drive bundle: https://drive.google.com/file/d/10etZF9OnV569_Fp7tpdMUVEH9eZECKdW/view?usp=sharing
Extract into the data/ directory at the project root.
This bundle is the fastest path for reproduction. If you want to integrate a new dataset instead, keep reading: the repository also includes raw-to-unified conversion scripts under dataset directories plus the inventory in scripts/dataset_conversion/. Those scripts are intended to serve as concrete references for new dataset integration, not only as one-off preprocessing utilities.
-
Install dependencies:
pip install dgl ogb torch_geometric rdkit -
Inspect raw datasets (optional, to verify fields):
# Inspect DGL/TU datasets python scripts/dataset_conversion/check_dgl_graphpred.py --datasets PROTEINS COLORS-3 # Inspect OGB datasets python scripts/dataset_conversion/check_ogbg.py --datasets ogbg-molhiv
-
Run conversion: Each dataset loader in
src/data/loader/reads fromdata/<dataset>/data.pkl. To generate these files from raw sources, use the appropriate conversion approach below.For a new dataset, start by copying the closest existing conversion script and matching the same output contract:
data/<dataset>/ ├── data.pkl ├── train_index.json ├── val_index.json └── test_index.json -
Verify final directory layout before running any training command:
data/<dataset>/ ├── data.pkl ├── train_index.json ├── val_index.json └── test_index.json -
Run preprocessing to build serialized sequences and vocabularies:
python prepare_data_new.py --datasets <dataset> --methods feuler
If multi-sampling is used during preparation, the exact K value must be preserved. Training must use the same serialization.multiple_sampling.num_realizations=K; otherwise it will read from a different cache directory.
-
Run smoke tests:
python run_pretrain.py --dataset <dataset> --method feuler --epochs 1 --batch_size 8 python run_finetune.py --dataset <dataset> --method feuler --epochs 1 --batch_size 8
For fine-tuning, CUDA must be available and the command should point to a valid pre-trained checkpoint directory.
| Dataset | Source | Task | Raw Format |
|---|---|---|---|
qm9 |
MoleculeNet / DGL built-in | Regression (16 properties) | DGL graphs with atom/bond features |
zinc |
ZINC-12K | Regression (logP) | DGL graphs |
aqsol |
AqSolDB | Regression (solubility) | DGL graphs |
Node tokens = atomic number; Edge tokens = bond type (SINGLE=1, DOUBLE=2, TRIPLE=3, AROMATIC=4).
Expected practical notes:
qm9/qm9testloaders expectdata.pkland split JSON files- they may also read optional SMILES side files if present
- if SMILES files are absent, the core graph pipeline can still run as long as the required files exist
| Dataset | Source | Task | Notes |
|---|---|---|---|
molhiv |
OGB ogbg-molhiv |
Binary classification | Official train/val/test splits |
peptides_func |
LRGB | Multi-label classification (10 classes) | Official splits |
peptides_struct |
LRGB | Multi-target regression | Official splits |
| Dataset | Source | Task | Node Token Source |
|---|---|---|---|
proteins |
TU PROTEINS |
Binary classification | node_labels |
colors3 |
TU COLORS-3 |
11-class classification | node_attr (one-hot → discrete ID) |
synthetic |
TU SYNTHETIC |
Binary classification | node_labels |
mutagenicity |
TU Mutagenicity |
Binary classification | node_labels + edge_labels |
dd |
TU DD |
Binary classification | node_labels |
coildel |
TU COIL-DEL |
100-class classification | 2-channel node_attr |
dblp |
TU DBLP_v1 |
Binary classification | node_labels |
twitter |
TU TWITTER-Real-Graph-Partial |
Binary classification | node_labels |
Token domain separation: Node tokens are mapped to odd integers (2x+1), edge tokens to even integers (2x). See DGL_tokenization_prep_plan.md for details.
| Dataset | Source | Task |
|---|---|---|
mnist |
MNIST superpixel graphs | 10-class classification |
code2 |
OGB ogbg-code2 |
Code AST classification |
| File | Purpose |
|---|---|
check_dgl_graphpred.py |
Inspect DGL/TU graph datasets: node/edge features, label distribution, graph statistics |
check_ogbg.py |
Inspect OGB graph property prediction datasets |
DGL_tokenization_prep_plan.md |
Detailed tokenization specification for each dataset (Chinese, internal reference) |
DGL_graph_pred_tokenizable_nodes.md |
Survey of DGL datasets with discrete node features suitable for tokenization |
The scripts in this directory, together with the dataset-specific preprocessing scripts under data/<dataset>/, are the recommended references when adapting a new raw dataset to the TokenizerGraph format.
Once data/<dataset>/ contains the required files, run the full pipeline:
# Step 1: Serialize + BPE + build vocab
python prepare_data_new.py --datasets qm9test --methods feuler
# Step 2: Pre-train
python run_pretrain.py --dataset qm9test --method feuler
# Step 3: Fine-tune
python run_finetune.py --dataset qm9test --method feulerNote the CLI difference:
prepare_data_new.pyuses plural arguments:--datasets,--methodsrun_pretrain.py/run_finetune.pyuse singular arguments:--dataset,--method
If preparation uses --multiple_samples K, the training scripts must be launched with matching serialization.multiple_sampling.enabled=true and serialization.multiple_sampling.num_realizations=K. Otherwise UnifiedDataInterface reads from single/ instead of multi_<K>/.
Skipping Step 1 leaves the serialized cache and vocabulary absent, so later training commands will fail.
The following qm9test sequence was executed successfully on the current repository state. It is the only dataset/configuration pair that has been fully verified through prepare -> pretrain -> finetune in this audit.
-
Prepare a fresh
multi_3cache to avoid overwriting the existingsingle/,multi_10/, ormulti_100/artifacts:python prepare_data_new.py \ --datasets qm9test \ --methods feuler \ --multiple_samples 3 \ --workers 1 \ --bpe_merges 64 \ --bpe_min_freq 2 \ --out prepare_results/e2e_qm9test_feuler_multi3Expected new artifacts:
data/processed/qm9test/serialized_data/feuler/multi_3/serialized_data.pickle data/processed/qm9test/vocab/feuler/bpe/multi_3/vocab.json model/bpe/qm9test/feuler/multi_3/bpe_codebook.pkl -
Run a one-epoch pre-training smoke test against the same
multi_3cache:CUDA_VISIBLE_DEVICES=0 python run_pretrain.py \ --dataset qm9test \ --method feuler \ --experiment_group e2e_audit \ --experiment_name e2e_qm9test_feuler_multi3_pretrain \ --epochs 1 \ --batch_size 256 \ --learning_rate 1e-4 \ --bpe_encode_rank_mode all \ --log_style offline \ --plain_logs \ --config_json '{"device":"cuda:0","system":{"device":"cuda:0","num_workers":1,"persistent_workers":true},"serialization":{"multiple_sampling":{"enabled":true,"num_realizations":3},"bpe":{"num_merges":64}}}'This creates a checkpoint at:
model/e2e_audit/e2e_qm9test_feuler_multi3_pretrain/run_0/best/The directory must contain both
config.binandpytorch_model.bin. -
Run a one-epoch fine-tuning smoke test using that explicit checkpoint directory:
CUDA_VISIBLE_DEVICES=1 python run_finetune.py \ --dataset qm9test \ --method feuler \ --target_property homo \ --pretrained_dir model/e2e_audit/e2e_qm9test_feuler_multi3_pretrain/run_0/best \ --experiment_group e2e_audit \ --experiment_name e2e_qm9test_feuler_multi3_finetune \ --epochs 1 \ --batch_size 256 \ --learning_rate 1e-5 \ --bpe_encode_rank_mode all \ --aggregation_mode avg \ --log_style offline \ --plain_logs \ --config_json '{"device":"cuda:0","system":{"device":"cuda:0","num_workers":1,"persistent_workers":true},"serialization":{"multiple_sampling":{"enabled":true,"num_realizations":3},"bpe":{"num_merges":64}},"bert":{"finetuning":{"save_models":false}}}'
If all three steps pass, the prepare_data_new.py -> run_pretrain.py -> run_finetune.py path is confirmed for that dataset/configuration pair only.
dataset.limitshould be treated as legacy metadata rather than a reliable way to subset data for smoke tests. For minimal verification, useqm9testor prepare a dedicated smaller dataset.- The checked-in default config uses
encoder.type: gte, so a run will use the GTE encoder unless you explicitly switch tobert. run_finetune.pycurrently asserts that CUDA is available before any training starts.- In the checked-in repository state audited here,
code2does not pass the loader smoke test becausedata/code2/data.pklis missing, so it should not be documented as immediately runnable without regenerating its processed data first. - The current audited status is maintained directly in this file. If dataset availability changes, update the status table above in the same commit.
- Write a conversion script that produces
data/<name>/data.pkl+ split indices - Create a loader class in
src/data/loader/<name>_loader.pyinheriting fromBaseDataLoader - Register it in
src/data/unified_data_factory.py - See
DGL_tokenization_prep_plan.mdfor the token mapping convention