conda env create -f environment.yaml
conda activate ccfOur repository requires flash attention and PyTorch dependencies, which are related to the local environment, and need to be installed manually with suitable versions according to your environment.
cd ccf
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/activation-beacon.tar.gz?download=true -O ./raw_data/activation-beacon.tar.gz
cd raw_data
tar -xzvf activation-beacon.tar.gz
cp activation-beacon/pretrain/redpajama-sample.json .
cp activation-beacon/finetune/longalpaca.json .cd ccf
python train.py --env_conf 32x.jsonIllustration on 32x.json Config
- You can modify the
device_mapfield in the32x.jsonfile to change the GPU used for model loading. By assigning different GPUs to different modules, you can achieve pipeline parallelism. - You can adjust the
config/32x.jsonconfiguration file to change the parameters of LoRA fine-tuning, such as chunk size and compression ratio, etc. - You can modify the
corpusfield to configure your desired dataset. All supported datasets and their specific configuration methods can be found in thesrc/data.pyfile. Use thetruncationfield to configure the maximum token count for each dataset, andpartitionto set the instance ratio for the dataset. - You can modify the
save_ckpfield to set the save path for checkpoints.
cd ccf
python test.py --env_conf 32x.jsonAs long as a checkpoint is generated during the training process (controlled by the save_ckp and save fields), you can use the above command to perform evaluation.
The test.py script will automatically search for a test.json file in the working directory, which is used to configure the dataset to be evaluated. Each instance in the file has the following format:
{
"task_type": "perplexity",
"task_name": "pg19.test.128k",
"num_instance": 100,
"truncation": 99382
}