- To check for available versions for any module and how to load them use this template
module spider <module_name> - Create
venvin python. If you are facing difficulties in installing libraries with in-built dependency of let's say arrow, usepip install --no-index <pkg_name>
- First force purge all existing modules. Use this command
module --force purge - Load the
StdEnv/2023since it comes bundled with latestgcc (gcc/12.3)andpython 3.11. Load it usingmodule load StdEnv/2023 - Load
arrowusingmodule load arrow/14.0.1
Just execute the setup script like the following
source src/utils/setup_cc.shIf you want to collect data, follow the following steps:
- Download the
jsonfrom SEART tool. - Place the
jsonfile in the input folder (data/input) - Execute the
data-collector.shscript likesource data-collector.shor if using HPC just usesbatch data-collector.sh. - Alternatively, if you want to run the data collection script independently, execute the following:
python -u src/dataprocessing/data.py $input_file_name- After running the data pre processing stage, it'll create a
jsonlfile with the before and after refactoring methods for each repository. - If running via HPC, the output will be a
zipfile present in thedata/outputfoler. Extract it and thejsonlfiles will be in thelocalstoragefolder. - Just execute the
src/deep_learning/dataset_creation.pyscript like following.
First, we need to collate all the data from different repository jsonl files to a single jsonl file.
python dataset_creation.py generate <input folder with all repository jsonl files> <output jsonl file path>After generation, if you want to split the data, execute the following:
python dataset_creation.py split <jsonl file created in the last step> <output folder path>With the collated input data and it'll create train.jsonl, test.jsonl and val.jsonl.
- To execute the fine tuning script, you can run the
src/refactoring-finetune/ft-scripts/supervised_fine_tune.pyas follows:
python code-t5.py --model_save_path ./output/codet5-test --run_name code_t5_test --train_data_file_path data/dl-no-context-len/train.jsonl --eval_data_file_path data/dl-no-context-len/val.jsonl --num_epochs
1- This is just an example. Check out the
ScriptArgumentsclass in the file for more information on the arguments. Or runpython code-t5.py --help.
Note: Make sure to setup WandB in the environment variable if you want to use W&B.
- If you want to run a batch job using HPC, just execute the following:
sbatch fine-tune.sh- Move to the
src/reinforcement-learningdirectory. - Run the
ppo_trl.pyscript with the necessary arguments. An example is given below:
python ppo_trl.py
--model_name src/refactoring-finetune/ft-scripts/output/code-t5-fine-tuned
--tokenizer_name src/refactoring-finetune/ft-scripts/output/code-t5-fine-tuned
--log_with wandb
--train_data_file_path data/dl-large/preprocessed/train.jsonl
--eval_data_file_path data/dl-large/preprocessed/val.jsonl -
Check the
ScriptArgumentsclass in the file for more information. -
If you want to run a batch job using HPC, just execute the following:
sbatch rl-fine-tune.sh