This repository contains the implementation two differentiable low-rank compression methods:
- Adaptive Rank Selections for Low-Rank Approximation of Language Models: paper
- Learning to Low-Rank Compress: paper
-
Adaptive Rank Selections for Low-Rank Approximation of Language Models:
- The method introduces learnable neural networks to predict optimal decomposition ranks. This repository only implements the rank selection step—i.e., it only implements Algorithm 1: Adaptive Rank Selection. Fine-tuning after rank selection is not implemented.
- Caveats:
- In the main branch, the rank selection layer differs from the original work, assigning one GRU to each layer. Refer to the
fix_hypernetbranch for the exact implementation, where one GRU is used overall, and only linear projection layers are assigned per layer for mask prediction. - Implementation of SVD: ASVD and Fisher SVD are implemented here, while IWSVD is not. IWSVD is used in the final paper.
- In the main branch, the rank selection layer differs from the original work, assigning one GRU to each layer. Refer to the
- Rank Selection Layer: Module
-
Learning to Low-Rank Compress:
- This method introduces a simpler rank selection layer, which is parameterized as a linear layer. Through this, optimal ranks for low-rank decomposition are learned per layer.
- There are some simplifications to make the codebase more uniform across both implementations—for example, the distillation objective and total variation loss from the original work are not included. However, as noted in the Appendix, using a pre-training loss provides similar performance (albeit slightly lower).
- Once the rank selection training is complete, we use the heuristic described in the paper to convert the model to its final form: code
- Rank Selection Layer: Module
conda create --name svd python=3.9; conda activate svdpip install -r requirements.txt- install may fail of eval harness. In that case, install from source as mentioned in their README
# constants
NUM_TRAIN_SAMPLES=50000
MAX_LEN=256
BETA=1.
ACT_AWARE=activation
COMP_VALUES=(0.90 0.85 0.80)
EVAL_BS=8
BATCH_SIZE=4
LTYPE=adaptive
R_LOSS=default
LR=1e-3
MODEL=meta-llama/Llama-2-7b-hf
CACHE_DIR=cache_train_llama2
LAMBDA=16.
GAMMA=1.
#MODEL=meta-llama/Meta-Llama-3-8B
#CACHE_DIR=cache_train_llama
#LAMBDA=8.
#GAMMA=2.
#MODEL=google/gemma-7b
#CACHE_DIR=cache_train_gemma
#LAMBDA=8.
#GAMMA=2.
# Loop over the COMP values
for i in ${!COMP_VALUES[@]}; do
COMP=${COMP_VALUES[$i]}
EXP_NAME="${MODEL#*/}_${LTYPE}_${COMP}_fixmse_${GAMMA}_${LAMBDA}"
p_param=0.4
# Check if it's the first iteration
if [ $i -eq 0 ]; then
# Command for the first iteration without extra arguments
python train_adaptive.py --model=$MODEL --target_param_ratio=$COMP --eval_full --batch_size=$BATCH_SIZE --lr=$LR --num_train_samples=$NUM_TRAIN_SAMPLES --exp_name=$EXP_NAME --max_length=$MAX_LEN --cache_dir=$CACHE_DIR --eval_freq_steps=500 --eval_batch_size=$EVAL_BS --alpha=0.5 --lambda=$LAMBDA --gamma=$GAMMA --act_aware=$ACT_AWARE --layer_type=$LTYPE --beta_scale=$BETA --r_loss=$R_LOSS --tau=0.4 --p_param=$p_param
else
python train_adaptive.py --model=$MODEL --target_param_ratio=$COMP --eval_full --batch_size=$BATCH_SIZE --lr=$LR --num_train_samples=$NUM_TRAIN_SAMPLES --exp_name=$EXP_NAME --max_length=$MAX_LEN --cache_dir=$CACHE_DIR --eval_freq_steps=500 --eval_batch_size=$EVAL_BS --alpha=0.5 --lambda=$LAMBDA --gamma=$GAMMA --act_aware=$ACT_AWARE --layer_type=$LTYPE --beta_scale=$BETA --r_loss=$R_LOSS --tau=0.4 --p_param=$p_param --load_act_cache
fi
done
For this, we can use layer_type="simple"
LTYPE=simple
R_LOSS=default
LR=1e-2
gamma_scale=0. # there's no allignment loss, set scale to 0
lambda_scale=1. # compression scale
beta_scale=0.5 # pre-training scale
