Adds validation loss to LoRA fine tune single device #2238

MaxFrax · 2025-01-08T14:19:08Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.
#1042

Changelog

What are the changes made in this PR?
Adds support to a validation dataset and computes the loss on it after each epoch.

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

tune run lora_finetune_single_device --config llama3_2/1B_lora_single_device

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2025-01-08T14:19:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2238

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-01-08T14:19:14Z

Hi @MaxFrax!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

MaxFrax · 2025-01-08T14:22:54Z

@felipemello1 Finally I have been able to work on this. I'll make my way through the testing plan, but feel free to share any comment you might already have.

facebook-github-bot · 2025-01-08T15:07:21Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

felipemello1 · 2025-01-09T03:03:12Z

hey @MaxFrax , thank you! I am on PTO this week. I will get to it next week if someone doesnt do it before me.

ebsmothers

Hey @MaxFrax, thanks for this PR! The validation loop itself looks pretty reasonable, but I think we should figure out the right way to integrate it. E.g. right now it seems like we perform validation after every training epoch inside of the train method. Personally I would be in favor of splitting out into multiple methods to make things clearer. That will be a bit more work, but I want to make sure we expose this as clearly as possible. What about something like this?

def validate():
	# Should be roughly the code you added

def train():
	# Keep this mostly as it is, but add something like:
	if self.global_step % self.run_val_every_n_steps == 0:
		self.validate()

Then we can expose run_val_every_n_steps via config. A couple other things to think about would be a maximum number of batches in the val loop and early stopping. I don't think we need to worry about the latter for this PR, but should make sure it's something we're able to support later on.

Also cc @joecummings @felipemello1 @calvinpelletier for anything I've missed here.

ebsmothers · 2025-01-10T20:59:50Z

recipes/lora_finetune_single_device.py

+                for idx, batch in enumerate(self._dataloader_val):
+                    utils.batch_to_device(batch, self._device)
+
+                    current_loss = self._loss_step(batch)


I think we will need to toggle eval <-> train mode for the model, right? (Another reason having a separate method will probably be cleanest)

Thanks @ebsmothers ! It makes sense. I will look into it. Do you have any pointers on how to do that?

recipes/lora_finetune_single_device.py

…stopping

MaxFrax · 2025-01-15T13:47:30Z

Hi @ebsmothers ! I have updated the pr with the following edits, as per your recommendation:

Created a stand alone method for the validation loop
Parameter run_val_every_n_steps to invoke validate in specific points of the training epoch
I also added max_validation_batches to cap the amount of batches run in each validation step

If there's any other feedback or comment, just let me know!

felipemello1 · 2025-01-15T17:38:55Z

Thanks for making the changes! I will take a look at this PR later today.

RdoubleA

This looks good to me, although I'm getting worried about the proliferation of cfg.get and cfg validation logic in the recipes. There's nothing inherently wrong about the cfg.get, but it encourages the use of hidden parameters not exposed to the user. I don't have a good long term solution for this, but since we are only modifying one recipe, maybe we could at least update the lora single device configs to expose these fields so users know that it exists and we can check if the cfg field is None directly?

dataset_validation: null
run_val_every_n_steps: null
max_validation_batches: -1

I know this will affect a lot of files, so open to thoughts. We could also do this in a follow-up.

recipes/lora_finetune_single_device.py

RdoubleA · 2025-01-15T17:44:16Z

recipes/lora_finetune_single_device.py

+            )
+
+        self.run_val_every_n_steps = cfg.get("run_val_every_n_steps", None)
+        if self.run_validation:


nit: could remove this if statement and keep all the logic below under the first if self.run_validation check

recipes/lora_finetune_single_device.py

felipemello1 · 2025-01-15T17:55:13Z

maybe we could at least update the lora single device configs to expose these fields so users know that it exists and we can check if the cfg field is None directly?

Lets do this as a follow up. I can use my script to bulk update. But lets make sure that we all agree on how it should like in the config.

codecov-commenter · 2025-01-15T18:29:39Z

Codecov Report

Attention: Patch coverage is 0% with 28 lines in your changes missing coverage. Please review.

Project coverage is 23.93%. Comparing base (213f386) to head (df8cd1e).
Report is 18 commits behind head on main.

Files with missing lines	Patch %	Lines
recipes/lora_finetune_single_device.py	0.00%	28 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (213f386) and HEAD (df8cd1e). Click for more details.

HEAD has 6 uploads less than BASE

Flag BASE (213f386) HEAD (df8cd1e)

9 3

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2238       +/-   ##
===========================================
- Coverage   65.41%   23.93%   -41.49%     
===========================================
  Files         344      357       +13     
  Lines       20658    21153      +495     
===========================================
- Hits        13514     5062     -8452     
- Misses       7144    16091     +8947

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

felipemello1

thanks for the PR! It looks simple and the functions make sense!

I added a few comments/ideas. Please push back on what you disagree.

IMO, to approve this, we would need two things:

Testing, like i suggested in one of the comments. Let me know if you are comfortable running it, otherwise we can help you out
An example of how the config should look like. The UI should play a big factor on this PR.

felipemello1 · 2025-01-15T20:33:59Z

recipes/lora_finetune_single_device.py

@@ -652,6 +675,43 @@ def _loss_step(self, batch: Dict[str, torch.Tensor]) -> torch.Tensor:

        return loss

+    def validate(self, curr_epoch) -> None:


Do we have model.eval() somewhere?

usually we want to set the model to .eval mode, because some layers have different behavior, like dropout.

By doing that, we then require less memory, because we only need the forward pass, which allows us to have a higher batch_size --> faster validation step.

I am not sure about the implications it may have to compile/FSDP. For example, compile will have to create a new graph that doesnt require grad, so compile time will have to increase. If the number of graph breaks increase, we may have to manually change the threshold of maximum number of graph breaks. (there is an example of that in one of our RL recipes)

not all recipes have self._loss_step. We would have to standardize and make sure that they all do, but this requires a different PR,.

IMO, if you have access to >1 GPU, I would encourage you to implement it in lora_distributed with QLoRA config, add .eval(), run it:

with eval + compile + opt_in_bwd + activation ckpt + activation offloading

without eval + compile + opt_in_bwd + activation ckpt + activation offloading

If nothing breaks, I would feel more confident in approving it

Ps: we would also have to add mode.train() in the training loop

@felipemello1 Thanks for the detailed breakdown and suggestions. Should we also unload the model being trained before loading the eval one? Having just one in memory would allow for bigger batch sizes.

That said, I’m currently constrained on time and not very familiar with the implementation details for this. If I were to take this on, it would likely take me a significant amount of time to get it done properly.

Would you be able to take the lead on this?

hey @MaxFrax , completely understandable. Thanks for sharing it.

I dont think that I will have bandwidth soon, but if i do, this PR is a good start.

@Ankur-singh , cc'ing you in case you are looking for more issues to contribute to! :D

Thank you guys!

felipemello1 · 2025-01-15T20:35:16Z

recipes/lora_finetune_single_device.py

+
+            # This bit allows to see the loss for each batch. Not sure about step indexing.
+            log_dict = {
+                "val_loss": current_loss.item(),


I wonder if we should be logging memory/TPS too. If memory is very low, this would show to the user that they can increase bsz. What do you think?

It makes sense. I will definitely look into it.
By the way, when is the previous training batch deallocated from memory? Do I have to deallocate manually? It would be handy to do so before staring the validation step to have more memory available.

recipes/lora_finetune_single_device.py

MaxFrax · 2025-01-16T15:29:40Z

Thanks @felipemello1 ! Some help on the testing side would be much appreciated.
When you say:

An example of how the config should look like. The UI should play a big factor on this PR.
what do you exactly mean?

Should I provide a recipe using the validation dataset? Are we talking about the docs?
Let me know more precisely what I should do, and I'll be happy to look into it.

felipemello1 · 2025-01-16T16:02:48Z

@MaxFrax

Should I provide a recipe using the validation dataset? Are we talking about the docs?
Let me know more precisely what I should do, and I'll be happy to look into it.

Your PR only contains changes to the recipe. I would encourage you to:

make changes to one of the configs too, to illustrate how users would use it
Put the command to launch this config in the description of the PR, under the testing section
share an image of the logs generated in weights and biases under the testing section

…ora_singledevice

…b.com/MaxFrax/torchtune into add_validation_loss_lora_singledevice

MaxFrax · 2025-01-26T18:48:00Z

@MaxFrax

Should I provide a recipe using the validation dataset? Are we talking about the docs?
Let me know more precisely what I should do, and I'll be happy to look into it.

Your PR only contains changes to the recipe. I would encourage you to:

make changes to one of the configs too, to illustrate how users would use it

Put the command to launch this config in the description of the PR, under the testing section

share an image of the logs generated in weights and biases under the testing section

Thanks for the feedback!

I have edited the llama3_2/1B_lora_single_device config and uploaded the W&B screen (I capped the training steps and changed batch sizes to make the screenshot instead of running exactly the committed recipe).
I’ve resolved all the issues raised in the PR review, except for the model evaluation mode and the validation memory logging.

niznik-dev · 2025-04-21T18:03:28Z

I'd like to check on the status of this PR - our team has been testing #2464 and it has been very useful but we make heavy use of single device recipes. Will this move forward or would an implementation of #2464 for the single device recipes be the goal?

felipemello1 · 2025-04-21T18:11:04Z

hey @niznik-dev , thanks for the comment. @MaxFrax did a fantastic job here, but a few follow ups were missing. The #2464 was put up and these last few gaps were addressed, so we landed it.

I believe that @krammnic was going to work on bringing it to other recipes as well. For now, if you want, you can go ahead with how it was implemented in #2464

Adds validation loss to LoRA fine tune single device

e1a0b23

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 8, 2025

felipemello1 self-requested a review January 9, 2025 03:02

ebsmothers reviewed Jan 10, 2025

View reviewed changes

recipes/lora_finetune_single_device.py Outdated Show resolved Hide resolved

MaxFrax and others added 2 commits January 10, 2025 21:55

Fixes execution without validation set

ef80846

Moves validate to separate method; adds call after n steps and early …

df8cd1e

…stopping

RdoubleA reviewed Jan 15, 2025

View reviewed changes

felipemello1 reviewed Jan 15, 2025

View reviewed changes

RdoubleA mentioned this pull request Jan 21, 2025

v0.6.0 tracker #2232

Closed

ebsmothers mentioned this pull request Jan 22, 2025

Documentation for evaluation on a custom dataset for a custom task #2286

Open

MaxFrax added 7 commits January 26, 2025 11:22

Merge remote-tracking branch 'origin/main' into add_validation_loss_l…

bc58767

…ora_singledevice

Merge branch 'add_validation_loss_lora_singledevice' of https://githu…

c112537

…b.com/MaxFrax/torchtune into add_validation_loss_lora_singledevice

Removes duplicate if

555b670

Removes always true check

2ac741e

Groups all validation config under the validation key

61ac4f6

Validation run frequency expressed in epochs instead of steps to user

9ba8832

Fixes wrong arguments behaviours

1c35995

MaxFrax added 3 commits January 26, 2025 12:52

Fixes indexing issues and makes tqdm more readable

a51c0ff

Adds validation set configuration to 1B_lora_single_device

05efbcb

Removes logging val loss for all batches

073af72

ebsmothers mentioned this pull request Feb 23, 2025

GRPO Improvement checklist #2421

Open

15 tasks

bzz mentioned this pull request Mar 6, 2025

Add validation dataset loss to distributed SFT recipies #2464

Merged

13 tasks

msalganik mentioned this pull request Apr 17, 2025

Allow evaluations while finetuning via torchtune niznik-dev/predicting-zygosity#7

Closed

		@@ -652,6 +675,43 @@ def _loss_step(self, batch: Dict[str, torch.Tensor]) -> torch.Tensor:

		return loss

		def validate(self, curr_epoch) -> None:

Adds validation loss to LoRA fine tune single device #2238

Are you sure you want to change the base?

Adds validation loss to LoRA fine tune single device #2238

Conversation

MaxFrax commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Test plan

UX

Uh oh!

pytorch-bot bot commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2238

Uh oh!

facebook-github-bot commented Jan 8, 2025

Action Required

Process

Uh oh!

MaxFrax commented Jan 8, 2025

Uh oh!

facebook-github-bot commented Jan 8, 2025

Uh oh!

felipemello1 commented Jan 9, 2025

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

ebsmothers Jan 10, 2025

Choose a reason for hiding this comment

Uh oh!

MaxFrax Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MaxFrax commented Jan 15, 2025

Uh oh!

felipemello1 commented Jan 15, 2025

Uh oh!

RdoubleA left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RdoubleA Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

felipemello1 commented Jan 15, 2025

Uh oh!

codecov-commenter commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

felipemello1 left a comment

Choose a reason for hiding this comment

Uh oh!

felipemello1 Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

MaxFrax Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

felipemello1 Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

MaxFrax Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MaxFrax commented Jan 16, 2025

Uh oh!

felipemello1 commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxFrax commented Jan 26, 2025

MaxFrax commented Jan 8, 2025 •

edited

Loading

pytorch-bot bot commented Jan 8, 2025 •

edited

Loading

codecov-commenter commented Jan 15, 2025 •

edited

Loading

felipemello1 Jan 17, 2025 •

edited

Loading

felipemello1 commented Jan 16, 2025 •

edited

Loading

felipemello1 commented Apr 21, 2025 •

edited

Loading