GraphTokenizer

Graph Tokenization for Bridging Graphs and Transformers

[中文文档 / Chinese README] · [Paper (ICLR 2026 / OpenReview)] · [arXiv]

Branches: release — paper-scope reproducibility only. dev — full development line, including experimental/blocked preprocessing scripts and audit materials that are intentionally excluded from release.

Overview

The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. GraphTokenizer extends this paradigm to graph-structured data by introducing a general graph tokenization framework. It converts arbitrary labeled graphs into discrete token sequences, enabling standard off-the-shelf Transformer models (e.g., BERT, GTE) to be applied directly to graph data without any architectural modifications.

The framework combines reversible graph serialization with Byte Pair Encoding (BPE), the de facto standard tokenizer in large language models. To better capture structural information, the serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring substructures appear as adjacent symbols in the resulting sequence — an ideal input for BPE to discover a meaningful vocabulary of structural graph tokens. The entire process is reversible: the original graph can be faithfully reconstructed from its token sequence.

Framework overview. (A) Substructure frequencies (labeled-edge patterns) are collected from the training graphs. (B) Structure-guided reversible serialization via frequency-guided Eulerian circuit — at each node, the next edge is selected according to the frequency priority (e.g., at the red C node, the C–C pattern has the highest frequency, so that edge is traversed first). (C) A BPE vocabulary is trained on the serialized corpus; BPE iteratively merges the most frequent adjacent symbol pairs into new tokens, compressing sequences to ~10% of their original length while preserving common substructures.

Labeled Graphs  →  Structure-Guided Serialization  →  BPE Tokenization  →  Transformer  →  Predictions

Key Contributions

General Graph Tokenization Framework. Combines reversible graph serialization with BPE to create a bidirectional interface between graphs and sequence models. By decoupling the encoding of graph structure from the model architecture, it enables standard off-the-shelf Transformers to process graph data without any architectural modifications.
Structure-Guided Serialization for BPE. A deterministic serialization mechanism guided by global substructure statistics. It addresses the ordering ambiguity inherent in graphs (permutation invariance) and systematically arranges frequent substructures into adjacent symbol patterns — precisely the input that BPE's greedy merging strategy is designed to exploit.
State-of-the-Art on 14 Benchmarks. Achieves SOTA results across diverse graph classification and regression benchmarks spanning molecular, biomedical, social, academic, and synthetic domains. Scaling from a compact BERT-small to a larger GTE backbone yields consistent gains, demonstrating that graph tokenization can leverage the proven scaling behavior of Transformers.

Main Results

Classification (↑ higher is better) and regression (↓ lower is better) results:

Model	molhiv (AUC↑)	p-func (AP↑)	mutag (Acc↑)	coildel (Acc↑)	dblp (Acc↑)	qm9 (MAE↓)	zinc (MAE↓)	aqsol (MAE↓)	p-struct (MAE↓)
GCN	74.0	53.2	79.7	74.6	76.6	0.134	0.399	1.345	0.342
GIN	76.1	61.4	80.4	72.0	73.8	0.176	0.379	2.053	0.338
GAT	72.1	51.2	80.1	74.4	76.3	0.114	0.445	1.388	0.316
GatedGCN	80.6	51.2	83.6	83.7	86.0	0.096	0.370	0.940	0.312
GraphGPS	78.5	53.5	84.3	80.5	71.6	0.084	0.310	1.587	0.251
Exphormer	82.3	64.5	82.7	91.5	84.9	0.080	0.281	0.749	0.251
GraphMamba	81.2	67.7	85.0	74.5	87.6	0.083	0.209	1.133	0.248
GCN+	80.1	72.6	88.7	88.9	89.6	0.077	0.116	0.712	0.244
GT+BERT	82.6	68.5	87.5	74.1	93.2	0.122	0.241	0.648	0.247
GT+GTE	87.4	73.1	90.1	89.6	93.6	0.071	0.131	0.609	0.242

Results are mean over 5 independent runs. Bold = best. See the paper for full results on all 14 datasets including DD, Twitter, Proteins, Colors-3, and Synthetic.

Supported Serialization Methods

Method	Reversible	Deterministic	Applicable to
Freq-Guided Eulerian (Feuler)	✅	✅	Any labeled graph
Freq-Guided CPP (FCPP)	✅	✅	Any labeled graph
Eulerian circuit	✅	❌	Any labeled graph
Chinese Postman (CPP)	✅	❌	Any labeled graph
Canonical SMILES	✅	✅	Molecular graphs only
DFS / BFS / Topo	❌	❌	Any graph

The default method is Feuler (Frequency-Guided Eulerian circuit), which provides both reversibility and determinism with O(|E|) time complexity.

Project Structure

GraphTokenizer/
├── prepare_data_new.py         # Data preprocessing: serialization + BPE training + vocab
├── run_pretrain.py             # Pre-training entry point (MLM)
├── run_finetune.py             # Fine-tuning entry point (regression/classification)
├── batch_pretrain_simple.py    # Batch pre-training across datasets/methods/GPUs
├── batch_finetune_simple.py    # Batch fine-tuning
├── aggregate_results.py        # Collect and tabulate experiment results
├── config.py                   # Centralized configuration management
├── config/default_config.yml   # Default config values
├── src/
│   ├── algorithms/
│   │   ├── serializer/         # Graph serialization (Freq-Euler, Euler, DFS, BFS, Topo, SMILES, CPP, ...)
│   │   └── compression/        # BPE engine (C++ / Numba / Python backends)
│   ├── data/                   # Unified data interface and per-dataset loaders
│   │   └── loader/             # Per-dataset loaders (QM9, ZINC, AQSOL, MNIST, Peptides, ...)
│   ├── models/                 # Model definitions
│   │   ├── bert/               # BERT encoder, vocab manager, data pipeline
│   │   ├── gte/                # GTE encoder (Alibaba-NLP/gte-multilingual-base)
│   │   └── unified_encoder.py  # Unified encoder interface
│   ├── training/               # Training pipelines (pretrain, finetune, evaluation)
│   └── utils/                  # Logging, metrics, visualization
├── gte_model/                  # Local GTE model config (for offline use)
├── final/                      # Paper experiment scripts and plotting code
└── docs/                       # Documentation

Installation

git clone https://github.com/BUPT-GAMMA/GraphTokenizer.git
cd GraphTokenizer

# Install in development mode.
# The checked-in pyproject.toml declares the build dependency on pybind11,
# so a network-enabled pip can bootstrap the C++ extension build automatically.
pip install -e .

# Build the C++ BPE backend (optional but recommended for speed)
python setup.py build_ext --inplace

Key dependencies: torch, dgl, networkx, rdkit, transformers, pybind11, pandas.

Notes:

If you are installing in an offline environment, preinstall pybind11 into the target environment before running pip install -e ..
pip install -e . only installs the local package metadata/build requirements. Runtime libraries such as torch, dgl, rdkit, and transformers still need to exist in the environment you use for experiments.

Quick Start

1. Data Preparation

Before running prepare_data_new.py, make sure the raw/preprocessed dataset files already exist under data/<dataset>/.

The loaders in src/data/loader/ assume the following files are present:

data/<dataset>/
├── data.pkl
├── train_index.json
├── val_index.json
└── test_index.json

For molecular datasets such as qm9 and zinc, some loaders will also look for optional SMILES files such as smiles_1_direct.txt.

For a clean clone, qm9test should be treated as a smoke-test dataset derived from qm9, not as a checked-in example dataset. Build qm9 first from its public source, then generate qm9test with data/qm9test/create_qm9test_dataset.py.

Serialize graphs and train a BPE tokenizer:

python prepare_data_new.py \
    --datasets qm9test \
    --methods feuler \
    --bpe_merges 2000

This script:

loads data/qm9test/data.pkl together with the fixed split files
serializes every graph with the selected method
trains a BPE model on the serialized corpus
builds the vocabulary used by downstream Transformer runs
writes cached artifacts under data/processed/<dataset>/...

After this step, you should expect processed artifacts in locations similar to:

data/processed/qm9test/
├── serialized_data/feuler/single/serialized_data.pickle
└── vocab/feuler/bpe/single/vocab.json

Refer to the following resources for detailed data preparation and execution instructions:

scripts/dataset_conversion/README.md — dataset-by-dataset conversion notes
src/data/README.md — data layer contract and expected directory layout

Pre-packaged datasets are available here:

Google Drive bundle: https://drive.google.com/file/d/10etZF9OnV569_Fp7tpdMUVEH9eZECKdW/view?usp=sharing

The repository also keeps raw-to-unified conversion scripts under dataset folders and scripts/dataset_conversion/README.md. Those scripts are not only for rebuilding the released datasets; they are also intended as concrete references when integrating a new dataset into the same directory contract.

Current audited status:

qm9test is the only dataset that has been fully verified through prepare_data_new.py -> run_pretrain.py -> run_finetune.py
mnist and mnist_raw currently pass loader-level checks only; prepare_data_new.py must be executed before training
code2 is blocked in the checked-in repository state because data/code2/data.pkl is missing
The complete audited status table is maintained in scripts/dataset_conversion/README.md

Important Notes:

prepare_data_new.py uses plural CLI arguments (--datasets, --methods), while run_pretrain.py and run_finetune.py use singular ones (--dataset, --method)
If preparing data with --multiple_samples K, the training scripts must be launched with matching serialization.multiple_sampling.enabled=true and serialization.multiple_sampling.num_realizations=K; otherwise, they will read from single/ instead of multi_K/
The checked-in default config currently sets encoder.type: gte, so runs will use the GTE encoder unless explicitly switched to bert

2. Pre-training

Pre-train a Transformer encoder with Masked Language Modeling (MLM):

python run_pretrain.py \
    --dataset qm9test \
    --method feuler \
    --experiment_group my_experiment \
    --epochs 100 \
    --batch_size 256

Important Notes:

--dataset and --method are required
The script reads the processed artifacts produced by prepare_data_new.py
The default config uses the paths in config/default_config.yml, where data_dir resolves to data/
A verified one-epoch qm9test smoke test with multi_3 serialization is documented in scripts/dataset_conversion/README.md

3. Fine-tuning

Fine-tune the pre-trained model on downstream graph prediction tasks:

python run_finetune.py \
    --dataset qm9test \
    --method feuler \
    --experiment_group my_experiment \
    --target_property homo \
    --epochs 200 \
    --batch_size 64

For regression datasets such as qm9, set --target_property explicitly. For classification datasets such as mutagenicity or molhiv, the loader metadata is usually sufficient and no regression target is needed.

Fine-tuning Notes:

run_finetune.py currently requires CUDA (torch.cuda.is_available() is asserted at startup)
For smoke tests, --pretrained_dir should point directly to model/<group>/<exp_name>/run_0/best
The pre-trained checkpoint directory must contain both config.bin and pytorch_model.bin

4. Batch Experiments

Run experiments across multiple datasets, serialization methods, and GPUs in parallel:

python batch_pretrain_simple.py \
    --datasets qm9,zinc,mutagenicity \
    --methods feuler,eulerian,cpp \
    --bpe_scenarios all,raw \
    --gpus 0,1

python batch_finetune_simple.py \
    --datasets qm9,zinc,mutagenicity \
    --methods feuler,eulerian,cpp \
    --bpe_scenarios all,raw \
    --gpus 0,1

Reproducing Paper Experiments

Scripts for all paper experiments are in the final/ directory:

Main experiments — final/exp1_main/run/: pre-training and fine-tuning commands for all 14 datasets
Efficiency analysis — final/exp1_speed/: serialization speed, token length stats, training throughput
Multi-sampling comparison — final/exp2_mult_seralize_comp/: effect of multiple serialization samples
BPE vocabulary visualization — final/exp4_bpe_vocab_visual/: codebook inspection and visualization

For the maintained Optuna-based hyperparameter search workflow, see:

hyperopt/README.md

Dataset Preparation Checklist

Use the following checklist to verify that a dataset is runnable end-to-end:

Put the dataset under data/<dataset>/
Ensure data.pkl, train_index.json, val_index.json, and test_index.json all exist
Confirm the dataset name is registered in src/data/unified_data_factory.py
Run:

python prepare_data_new.py --datasets <dataset> --methods feuler

Verify that data/processed/<dataset>/serialized_data/... and data/processed/<dataset>/vocab/... were created
If you prepared with multiple sampling, also verify whether the artifacts were written to single/ or multi_<K>/
Run a small pre-training smoke test:

python run_pretrain.py --dataset <dataset> --method feuler --epochs 1 --batch_size 8

Run a small fine-tuning smoke test:

python run_finetune.py --dataset <dataset> --method feuler --epochs 1 --batch_size 8

Fine-tuning also requires a CUDA-capable device and a valid pre-trained checkpoint.

Documentation

Configuration Guide — config file structure and parameters
Experiment Guide — how to design and run experiments
BPE Usage Guide — BPE engine API and usage
Dataset Conversion Guide — how to prepare data/<dataset>/ so the loaders can run directly

Citation

If you find this work useful, please cite our paper:

@inproceedings{guo2026graphtokenizer,
  title={Graph Tokenization for Bridging Graphs and Transformers},
  author={Guo, Zeyuan and Diao, Enmao and Yang, Cheng and Shi, Chuan},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

Branches

release — Clean version with only the code needed to reproduce paper experiments.
dev — Full development version with all utility scripts, benchmarks, and internal documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
bpe_for_images		bpe_for_images
config		config
data		data
docs		docs
export_system		export_system
final		final
gte_model		gte_model
hyperopt		hyperopt
runs		runs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
8.18_noaug_all.sh		8.18_noaug_all.sh
8.20_new_lrgb_ogbg.sh		8.20_new_lrgb_ogbg.sh
8.23_check_fte_reset.sh		8.23_check_fte_reset.sh
README.md		README.md
README_zh.md		README_zh.md
TokenizerGraph_Detailed_Documentation.md		TokenizerGraph_Detailed_Documentation.md
TokenizerGraph_HighLevel_Summary.md		TokenizerGraph_HighLevel_Summary.md
aggregate_results.py		aggregate_results.py
analyze_export_datasets_stats.py		analyze_export_datasets_stats.py
analyze_lrgb_datasets.py		analyze_lrgb_datasets.py
batch_finetune_simple.py		batch_finetune_simple.py
batch_pretrain_simple.py		batch_pretrain_simple.py
batch_submit_clearml.py		batch_submit_clearml.py
benchmark_serialization_speed.py		benchmark_serialization_speed.py
clearml_agents.py		clearml_agents.py
commands.list		commands.list
config.py		config.py
finetune_wrapper.sh		finetune_wrapper.sh
gen.sh		gen.sh
model_ARCHITECTURE.md		model_ARCHITECTURE.md
prepare_data_new.py		prepare_data_new.py
pretrain_wrapper.sh		pretrain_wrapper.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
run_finetune.py		run_finetune.py
run_pretrain.py		run_pretrain.py
run_serialization_bpe_comparison_simple.py		run_serialization_bpe_comparison_simple.py
setup.py		setup.py
slurm_submit_simple.py		slurm_submit_simple.py
test_gte_integration.py		test_gte_integration.py
test_serialization_methods.py		test_serialization_methods.py
tu_gen.sh		tu_gen.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphTokenizer

Overview

Key Contributions

Main Results

Supported Serialization Methods

Project Structure

Installation

Quick Start

1. Data Preparation

2. Pre-training

3. Fine-tuning

4. Batch Experiments

Reproducing Paper Experiments

Dataset Preparation Checklist

Documentation

Citation

Branches

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GraphTokenizer

Overview

Key Contributions

Main Results

Supported Serialization Methods

Project Structure

Installation

Quick Start

1. Data Preparation

2. Pre-training

3. Fine-tuning

4. Batch Experiments

Reproducing Paper Experiments

Dataset Preparation Checklist

Documentation

Citation

Branches

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages