Paper implementation and adaption of Self-Rewarding Language Models.
This project explores Self Rewarding Language Models from Yuan et al., 2024, utilizing LLM-as-a-Judge to allow a model to self-improve. It integrates Low-Rank Adaptation from Hu et al., 2021 optimizing adaptability without full tuning.
./setup.shNote: This will create an virtual environment, install the required packages and download the data.
In the config.yaml file, you can set the following parameters:
cuda_visible_devices: The GPU to use (0 for first GPU, 1 for second GPU or 0 and 1 for both)model_name: The name of the model to use. Choose from huggingface hubtokenizer_name: The name of the tokenizer to use. Choose from huggingface hubwandb_enable: True or False. If True, logs will be sent to wandbwandb_project: The name of the wandb projectpeft_config: Adapt the PEFT configuration according your needsiterations: The number of iterations to self improve the modelsft_training: SFT training hyperparametersdpo_training: DPO training hyperparametersgenerate_prompts: The amount of prompts to generate in each iterationgenerate_responses: The amount of responses per prompt to generate in each iteration
To run the training, execute the following command:
python -m src.train.train