LoRA for Classification #921

UgurKap · 2025-12-05T00:55:42Z

UgurKap
Dec 5, 2025

Hey, I have just implemented the LoRA for classification fine-tuning. In the end, I have noticed the comment about how LoRA is slower, due to the added inference cost but that can be negated on larger models.

My question is, if it makes sense to compare this model against last layer LoRA fine-tuning? Because in classification fine-tuning, we only fine-tuned the layers from the last transformer block onwards and did not touch the first layer weights. I think here, using LoRA only on the last transformer block and the out_head will be fairer.

Here are my results for different experiments (I have a slightly different dataset and training loop, so the numbers differ from the book):

Method	Train	Validation	Test	Training Time
Classification Fine-Tuning (Last Transformer Block Onwards)	99.33%	97.32%	97.33%	0.81 Minutes
LoRA (All Linear Layers)	97.79%	97.32%	94.00%	2.27 Minutes
LoRA (Last Transformer Block Onwards)	96.54%	97.99%	96.00%	1.13 Minutes

I think here, we see the training time cost is not that high when done on the same layers, and also in my experiment with zero hyperparameter tuning, last layer training performed better too (I assumed this might be caused by lora updates to initial layers, causing some "forgetting")

rasbt · 2025-12-19T02:56:04Z

rasbt
Dec 19, 2025
Maintainer

Hi there,

you raise a good point. I think what we define as "fair" depends. I can see too possible scenarios

a) Regular finetuning versus LoRA finetuning, where LoRA targets the same layers as the regular finetuning (all, last, some). Then, one can see whether LoRA, which updates fewer parameters per layer helps or not.

b) Using the method with the highest accuracy (alll, last, or some layers) for each of the two (regular versus LoRA), which one is more efficient?

I think another important factor that we are not considering in this discussion thread is the memory savings. In practice, a common bottle neck is that one cannot do regular training of a 8B model on many single GPUs (depending on the RAM), but LoRA is fine.

All that being said, in the classification case (not in the supervised instruction finetuning case), updating only the last layer like you suggest is reasonable. But again, I think with larger models this becomes totally negligible either way.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LoRA for Classification #921

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

LoRA for Classification #921

Uh oh!

UgurKap Dec 5, 2025

Replies: 1 comment

Uh oh!

Uh oh!

rasbt Dec 19, 2025 Maintainer

UgurKap
Dec 5, 2025

rasbt
Dec 19, 2025
Maintainer