GitHub - UnicornJin/BT4222LLM4Rec: Example code of using LLM in Recommendation System, for BT4222

BT4222 LLM-for-Recommendation System

This repo contains the code for NUS BT4222 module topic: LLM for Recommendation System

The code is modified from project LLM4REC (Code Repo, Paper) For a more detailed understanding of the topic, you can take a look at their paper.

(Meanwhile, for the research direction of "LLM for Recomm Sys", there a awesome summary for the papers, you can take a look if you are interested in this direction: LLM4Rec Awesome Papers )

The system design is based on GPT2. For the training and fine-tuning stages, there are three major scripts in this repo:

llm4rec_training.py The model learns the pattern of user/item interaction, item descriptions, and user reviews. This stage is called pre-training because we are just asking the model to get familiar with the scenario.
llm4rec_finetuning.py The model is required to make predictions in this stage, And the correctness will be judged, and tune the model to make better predictions.
llm4rec_evaluation.py The evaluation step.

Please note that the core idea of this project is not limited to training a model, but a more foundamental concept of how to apply LLMs to recommendation systems. Please go through the notebook BT4222-LLM4Rec.ipynb for a detailed explanation.

The training & finetuning takes 1 * 20GB GPU (RTX4000Ada) and 30+hrs, thus, we provide checkpoints for each stages. If you only have laptop/small machine, you can run the evaluation, which only takes 1GB of memory, and 5 mins of running time.

Prepare running environment

Conda Environment

(The codes are tested on Ubuntu 24.04, but should also work on other Linux distributions.)

Make sure you have conda installed. (If not, you can instrall it from Anaconda)

Run these steps to create a conda-env to run our code:

conda create -n bt4222llm4rec python=3.11 pip -y
conda activate bt4222llm4rec
pip install -r requirements.txt

Prepare GPT2 repo

The GPT2 model weights are hosted on HuggingFace. This year (2025), HuggingFace is transferring the large file downloading from git-lfs to huggingface-cli.

To download the GPT2 weights, first, make sure you have huggingface-cli installed:

# 1) Install tools
pip install -U "huggingface_hub[cli]" hf_transfer

# 2) (Optional but faster) enable accelerated transfers
export HF_HUB_ENABLE_HF_TRANSFER=1

Then you can download the GPT2 repo. (Since GPT2 is public, you suppose to be able to download without login.) Since the downloading is large, you may want to run it in screen or tmux session.

# 3) Download the GPT-2 model repo into ./gpt2  (resumeable, versioned cache)
huggingface-cli download openai-community/gpt2 \
  --local-dir ./gpt2 --repo-type model

Note: This will download 11GB of data into the gpt2/ folder. Make sure you have stable network connection, and you may take a walk/break when waiting for downloading.

Prepare the dataset

The pre-processed dataset is prepared in advance. You can download from: Google Drive

And then put them in current folder, e.g.:

BT4222LLM4REC:
    - dataset:
        - luxury:
            - item_texts\
            - user_item_texts\
            - meta.pkl
            - ...

Meanwhile, we also provide the data pre-processing scripts, you can find them under data_processing_scripts\. If you are interested, you can take a read.

The data processing is to extract the graph-based relationship from the raw dataset, as the paper describes, this is necessary to adapt LLMs to recommendation tasks. Meanwhile, we also need to extract the textual information like item descriptions and user reviews to provide the model with rich context about users and items.

Prepare the tokenizer

As the paper describes, the tokenizer should be modified to handle the user/item IDs.

We have prepared the pre-trained tokenizer for you, you can download from: Google Drive

And put them under:

BT4222LLM4REC:
    - provided_tokenizer:
        - merges.txt
        - vocab_file.json

Prepare the checkpoints

We have prepared the pre-training and fine-tuning checkpoints, you can download from: Google Drive

And put them under:

BT4222LLM4REC:
    - checkpoints:
        - finetune:
            - luxury:
                - collaborative-based\
                - content-based\
        - pretrain
            - luxury:
                - collaborative-based\
                - content-based\

Then we are ready to run the evaluation scripts. (Or you can also try to run the training and fine-tuning scripts by yourself, see below for details.)

Run Evaluation

Simply run:

python llm4rec_evaluation.py

You can see the Recall & NDCG metrics.

Read the code

Just simply running the code cannot help you understand how the things work.

Although you don't need to run the training and finetuning, you are still required to read through them. Please follow the notebook, and refer to the corresponding code files to understand how the project works.

Try Pre-training and Fine-tuning by yourself (optional)

If you happen to have resources, time and interest to run the pre-training and fine-tuning. You can download another dataset from Amazon Reviews, follow the data pre-processing steps in data_processing_scripts\

Remember to change the dataset name in codes, then run:

python llm4rec_training.py 2>&1 | tee self_run_training.log

You may note this will run for long long time. (You can use screen command to maintain the running in background.)

This step may produce 50GB of intermediate data on disk, mainly the checkpoints in the middle of training.

You can find all the saved checkpoints from checkpoints\pretrain

Find a best model with smalled loss from the pretrain checkpoints, rename it to xxxxxx_best.xxx, as how the checkpoints are provided.

Then you need to change the checkpoint loading part of llm4rec_finetuning.py to be loading from your own checkpoint, run fine-tuning with:

python llm4rec_finetuning.py 2>&1 | tee self_run_finetuning.log

This will also run for long time, and produce about 40GB of intermediate data. And you can find the models from \checkpoints\self-running\finetuning\

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BT4222 LLM-for-Recommendation System

Prepare running environment

Conda Environment

Prepare GPT2 repo

Prepare the dataset

Prepare the tokenizer

Prepare the checkpoints

Run Evaluation

Read the code

Try Pre-training and Fine-tuning by yourself (optional)

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
data_processing_scripts		data_processing_scripts
libs		libs
notebook_materials		notebook_materials
provided_tokenizer		provided_tokenizer
results		results
.gitignore		.gitignore
README.md		README.md
finetuning.log		finetuning.log
llm4rec_evaluation.py		llm4rec_evaluation.py
llm4rec_finetuning.py		llm4rec_finetuning.py
llm4rec_training.py		llm4rec_training.py
requirements.txt		requirements.txt
training.log		training.log

Folders and files

Latest commit

History

Repository files navigation

BT4222 LLM-for-Recommendation System

Prepare running environment

Conda Environment

Prepare GPT2 repo

Prepare the dataset

Prepare the tokenizer

Prepare the checkpoints

Run Evaluation

Read the code

Try Pre-training and Fine-tuning by yourself (optional)

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages