-
Notifications
You must be signed in to change notification settings - Fork 255
Release QAT example with NLS #3480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Release QAT example with NLS #3480
Conversation
Signed-off-by: J. Pablo Muñoz <[email protected]> Co-authored-by: Yuan0320 <[email protected]>
Signed-off-by: J. Pablo Muñoz <[email protected]> Co-authored-by: Yuan0320 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution and very extensive evaluation!
It's great to see an improvement on top of baseline with constant LoRA rank!
On a high level, it looks good for me. Most of the logic is implemented in the sample, changes in NNCF are minimized by extending FQ with LoRA.
I have a few remarks to make it better in terms of integration into NNCF.
One thing that is important for potential customers - total time to get the best checkpoint.
Could you please specify in the readme, how long was tuning and search stage in both cases?
examples/llm_compression/torch/qat_with_lora/NLSDownstreamTasks.md
Outdated
Show resolved
Hide resolved
| Qwen/Qwen2.5-7B-Instruct | BF16 | 0.6401 | | ||
| | INT4 (QAT + LoRA) | 0.7356 | | ||
| | INT4 (QAT + NLS) | **0.7382** | | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Average score for bf16 model is smaller for every model? Usually INT4 model has similar or lower accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ljaljushkin @andreyanufr BF16 is the reference result for the uncompressed model without tuning. We have removed these results to avoid confusion, and as discussed in our meeting, in a future PR, we will try BF16 and BF + LoRA to have a better comparison. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a follow-up ticket-166802 for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a result, numbers for fine-tuned BF16 models will be added, am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a result, numbers for fine-tuned BF16 models will be added, am I right?
We discussed this with Pablo and agreed to add numbers to this PR for the fine-tuned bf16 baseline and also for BF16 + NLS w/o quantization + PTWC (AWQ+SE+GPRQ)
Does the nls example support for fine tuning model without quantization?
as far as I know, it doesn't support.
For that, we need to add a new operation using lora adapters and nncf.com_weights
does not suit for that. Pablo will probably do it using peft, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ljaljushkin @alexsu52 Yes, we are planning to use PEFT to get the reference BF16 numbers. Thanks!
Signed-off-by: J. Pablo Muñoz <[email protected]> Co-authored-by: Yuan0320 <[email protected]>
Signed-off-by: J. Pablo Muñoz <[email protected]>
Signed-off-by: J. Pablo Muñoz <[email protected]>
Signed-off-by: J. Pablo Muñoz <[email protected]> Co-authored-by: Yuan0320 <[email protected]>
examples/llm_compression/torch/qat_with_lora/NLSDownstreamTasks.md
Outdated
Show resolved
Hide resolved
| Qwen/Qwen2.5-7B-Instruct | BF16 | 0.6401 | | ||
| | INT4 (QAT + LoRA) | 0.7356 | | ||
| | INT4 (QAT + NLS) | **0.7382** | | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a follow-up ticket-166802 for that
Signed-off-by: J. Pablo Muñoz <[email protected]> Co-authored-by: Yuan0320 <[email protected]>
Signed-off-by: J. Pablo Muñoz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor remarks
Co-authored-by: Lyalyushkin Nikolay <[email protected]>
Co-authored-by: Lyalyushkin Nikolay <[email protected]>
Co-authored-by: Lyalyushkin Nikolay <[email protected]>
45eb700
to
f8dd856
Compare
Signed-off-by: J. Pablo Muñoz <[email protected]>
f8dd856
to
8a5b7db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpablomch thanks for the contribution!
# If Neural Low-rank Adapter Search (NLS) is enabled, | ||
# configure the LoRA adapters with a random rank configuration from the specified rank space. | ||
if not disable_nls and grad_steps == 0: | ||
current_config = configure_lora_adapters( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Providing a scheduler in NNCF for NLS will simplify the example and improve UX. This comment is not blocking, but I recommended thinking about it. cc' @ljaljushkin
| Qwen/Qwen2.5-7B-Instruct | BF16 | 0.6401 | | ||
| | INT4 (QAT + LoRA) | 0.7356 | | ||
| | INT4 (QAT + NLS) | **0.7382** | | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a result, numbers for fine-tuned BF16 models will be added, am I right?
@@ -25,6 +25,10 @@ The most significant accuracy improvements are usually observed within the first | |||
|
|||
 | |||
|
|||
## Fine-tuning with NLS for Downstream Tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact we have two examples:
- Distillation of a quantized model on wikitext2 to improve similarity metrics between the compressed model and original model. It manages case when the user already has the pretrained model I am right?
- Fine-tuning for Downstream Tasks with quantization.
For more precise positioning, I would suggest having two headings for these two cases and clearly explaining which one the user should use in which scenario.
examples/llm_compression/torch/qat_with_lora/NLSDownstreamTasks.md
Outdated
Show resolved
Hide resolved
Signed-off-by: J. Pablo Muñoz <[email protected]> Co-authored-by: Yuan0320 <[email protected]>
b77ba4a
to
407fbb3
Compare
Changes
Adds example to use NLS fine-tuning with quantization-aware LoRA on downstream tasks.
Reason for changes
To support fine-tuning for downstream scenarios, and NLS often boost the performance of LoRA fine-tuning on downstream tasks.
Related tickets
https://jira.devtools.intel.com/browse/CVS-166802
Tests
See the results in NLSDownstreamTasks.md. We have conducted extensive evaluation on 11 language models and 4 downstream tasks.
examples job: https://github.com/openvinotoolkit/nncf/actions/runs/14934370942