Replies: 1 comment
-
|
Hi there, you raise a good point. I think what we define as "fair" depends. I can see too possible scenarios a) Regular finetuning versus LoRA finetuning, where LoRA targets the same layers as the regular finetuning (all, last, some). Then, one can see whether LoRA, which updates fewer parameters per layer helps or not. b) Using the method with the highest accuracy (alll, last, or some layers) for each of the two (regular versus LoRA), which one is more efficient? I think another important factor that we are not considering in this discussion thread is the memory savings. In practice, a common bottle neck is that one cannot do regular training of a 8B model on many single GPUs (depending on the RAM), but LoRA is fine. All that being said, in the classification case (not in the supervised instruction finetuning case), updating only the last layer like you suggest is reasonable. But again, I think with larger models this becomes totally negligible either way. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey, I have just implemented the LoRA for classification fine-tuning. In the end, I have noticed the comment about how LoRA is slower, due to the added inference cost but that can be negated on larger models.
My question is, if it makes sense to compare this model against last layer LoRA fine-tuning? Because in classification fine-tuning, we only fine-tuned the layers from the last transformer block onwards and did not touch the first layer weights. I think here, using LoRA only on the last transformer block and the out_head will be fairer.
Here are my results for different experiments (I have a slightly different dataset and training loop, so the numbers differ from the book):
I think here, we see the training time cost is not that high when done on the same layers, and also in my experiment with zero hyperparameter tuning, last layer training performed better too (I assumed this might be caused by lora updates to initial layers, causing some "forgetting")
Beta Was this translation helpful? Give feedback.
All reactions