Skip to content

Latest commit

 

History

History
108 lines (73 loc) · 3.86 KB

File metadata and controls

108 lines (73 loc) · 3.86 KB

Sequential Data Augmentation for Generative Recommendation

This repository contains the code for our work "Sequential Data Augmentation for Generative Recommendation" (arXiv) which will be presented at WSDM 2026. It provides training scripts for generative recommendation models (e.g., SASRec [1], TIGER[2]) using GenPAS and data analysis codes for selecting appropriate hyperparameters.


📦 Environment Setup

We recommend using Python ≥ 3.8 with a virtual environment.

conda create -n genpas python=3.9
conda activate genpas
pip install -r requirements.txt

📁 Datasets

For the Amazon datasets (Beauty, Toys, and Sports) please use this google drive link.

For the MovieLens datasets (ML1M and ML20M), we provided the preprocessing codes for ML1M and ML20M.

All datasets should be stored under the data folder. Please prepare the data in the following format:

data/
├── beauty/
│   ├── training/              # training set
│   ├── evaluation/            # validation set
│   ├── testing/               # test set
│   └── sids                   # semantic IDs for TIGER
├── toys/
├── sports/
├── ml1m/
└── ml20m/

To load sequences for a given split (training / evaluation / testing), refer to the jupyter notebook read_data.ipynb.

# Dataset and split
dataset = 'beauty'
data_type = 'training'  # 'evaluation' or 'testing'

# you can load sequences as follows:
sequences = read_data(dataset, data_type)

# Now `sequences` contains the list of user interaction sequences for the split.
print(len(sequences), 'sequences loaded')

To load the MovieLens-20M dataset to use for the TIGER experiments, please run data/ml20m/text_data_preparation.ipynb.


🚀 Training Models

We provide training scripts for common generative recommendation baselines. Please find the full script in run.sh.

Run SASRec

python src/train.py trainer=ddp experiment=sasrec_train_10_00_00_beauty.yaml logger=csv

Run TIGER

python src/train.py trainer=ddp experiment=tiger_train_10_00_10_beauty.yaml logger=csv

📊 Analysis Tools

We provide scripts that measure how augmentation impacts the target and input–target distributions in data_analysis.

KL Divergence of Target Distribution

Computes KL(p_valid || q_train) between the validation/test target distribution and the training target distribution.

python get_kl.py --dataset beauty --alpha 1.0 --beta 0.0

Alignment & Discrimination of Input–Target Distribution

Computes alignment and discrimination between the validation/test input-target distribution and the training input-target distribution.

python get_align_disc.py --dataset beauty --alpha 1.0 --beta 0.0 --gamma 0.0

For large datasets with longer sequences (e.g., ML1M and ML20M), please use the variant using sampling:

python get_align_disc_sample.py --dataset ml1m  --alpha 1.0 --beta 0.0 --gamma 0.0

🤝 Acknowledgments

Bibliography

[1] Kang, Wang-Cheng, and McAuley, Julian. "Self-Attentive Sequential Recommendation." IEEE International Conference on Data Mining 18 (2018).

[2] Rajput, Shashank, et al. "Recommender systems with generative retrieval." Advances in Neural Information Processing Systems 36 (2023): 10299-10315.