Skip to content

Conversation

@olipinski
Copy link
Contributor

@olipinski olipinski commented Mar 5, 2025

What does this PR do?

Allows for setting a separate log_dir from the trainer log_dir.

Fixes #16
Fixes Lightning-AI/pytorch-lightning/issue#20615

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

olipinski and others added 11 commits March 5, 2025 13:18
Signed-off-by: Olaf Lipinski <[email protected]>
Signed-off-by: Olaf Lipinski <[email protected]>
…v helper scripts, separate test_utils.sh from infra_utils.sh
Signed-off-by: Olaf Lipinski <[email protected]>
Signed-off-by: Olaf Lipinski <[email protected]>
…erent trainer fall back modes

- updated schedule generation documentation to include new FTS `log_dir` param ref
- added mlflow to default test extra directive
- added tests for new FTS `log_dir` param
- disabled a flaky (10%) 2D model parallel test with limited marginal utility
…e-based trainer refs

- updated autodoc_typehints config to be less verbose
- added yaml (multi)representer for PretrainedConfig object types
- bumped docker image to use test channel (2.7.0-rc2)
- remove duplicate multi-gpu test directive
@speediedan speediedan marked this pull request as ready for review March 28, 2025 16:37
@speediedan speediedan self-requested a review as a code owner March 28, 2025 16:37
@speediedan
Copy link
Owner

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@speediedan
Copy link
Owner

Thanks for your valuable contribution @olipinski! Apologies for the delayed response; I've had limited bandwidth to dedicate to Finetuning Scheduler in the last few weeks.

I've updated your PR as follows (along with including a convenience rebase of some previously in-flight changes to ensure PyTorch 2.7.0-rc2 is accommodated)

  1. fts.py:

    • Introduced a log_dir parameter to allow specifying a custom directory for artifacts, defaulting to trainer.log_dir or trainer.default_root_dir.
    • Added a log_dir property to determine the directory dynamically.
    • Updated various methods to use the new log_dir property instead of directly accessing trainer.log_dir.
  2. fts_supporters.py:

    • Adjusted methods to utilize the log_dir property for saving schedules and validating configurations.
  3. test_finetuning_scheduler_callback.py:

    • Added a test to validate the behavior of FinetuningScheduler with and without a specified log_dir when using a logger without a save_dir. I used MLFlowLogger as one such test case of the more general pattern.

Arguably, Lightning should handle this scenario by updating its trainer.log_dir resolution logic to accommodate artifact persistence via trainer.log_dir for loggers that do not have a save_dir set (using trainer.default_root_dir) but your PR was an excellent opportunity to refactor and enhance FTS log_dir handling.

Thanks again for your contribution. Feel free to reach out anytime if you have other issues or want to share more about your use case. Best of luck with your work and let me know if you have any questions or further suggestions!

@speediedan speediedan merged commit aa750cf into speediedan:main Mar 28, 2025
12 checks passed
speediedan added a commit that referenced this pull request Mar 29, 2025
…for PretrainedConfig object types, update docker image to use cuda 12.6.3
@olipinski
Copy link
Contributor Author

Thanks for merging, @speediedan! No worries about the delay. I'm glad the PR was useful and led to some additional refactoring and enhancements. Thanks for your continued work on FTS!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow setting log_dir manually, trainer.log_dir is not found with MLFlowLogger FinetuningScheduler does not work with MLFLowLogger

2 participants