-
Notifications
You must be signed in to change notification settings - Fork 50
feat: callback support #805 #875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: callback support #805 #875
Conversation
…o#805) Add PyTorch Lightning-style callbacks with hooks for train start/end, batch end, validation end, checkpoint save, and exceptions. Includes @rank_zero_only decorator for distributed training. Callbacks are passed programmatically to recipe constructors and receive full training context (recipe object, MetricsSample, checkpoint_info, etc.). Signed-off-by: Yuhe Zhang <yuhe@polarr.co>
Signed-off-by: Yuhe Zhang <yuhe@polarr.co>
nemo_automodel/_cli/app.py
Outdated
|
|
||
| # Special case for VLM finetune which uses finetune.py instead of train_ft.py | ||
| if args.domain == "vlm" and args.command == "finetune": | ||
| command = "finetune" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I also fix a bug here; previusly the aliases would make the vlm finetune map to non-existing file
|
Hello! Quick check-in on this PR. I noticed I missed adding callback integration to the If this aligns with your plans, I’m happy to iterate or expand it. If not, no worries. I know this may not be a high priority. Just want to understand the direction so I can adjust accordingly. Thanks! |
|
Thanks a lot @yuhezhang-ai , I will review this today again. I want to make sure that any callback is only consuming information and the callback code does not provide any way to modify the trainer's state. |
Signed-off-by: Yuhe Zhang <yuhe@polarr.co>
Signed-off-by: Yuhe Zhang <yuhe@polarr.co>
Thanks for raising this. I agree callbacks should be observers only (consume info, not modify trainer state). Regarding what we pass into callbacks, I see two reasonable options: Option A (current): Keep passing Option B: Pass a small immutable context object (e.g., is_main_rank, world_size, checkpoint_dir, step/epoch, etc.). This reduces surface area and makes the observer-only intent clearer. Happy to go with whichever direction you prefer. btw, I've merged main and added the callback hooks for the missing one (train_biencoder). |
jgerh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed tech pubs review of .md file. Only two minor edits for our styles.
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Implements PyTorch Lightning-style callback system for Automodel recipes to enable custom integrations (e.g., Customizer metrics reporting, custom monitoring).
Key Features
on_train_start,on_train_batch_end,on_validation_end,on_save_checkpoint,on_exception,on_train_end@rank_zero_onlydecorator for distributed training convenienceMetricsSampleobjects, checkpoint info, etc.Changes
Callbackbase class andCallbackRunnerinnemo_automodel/components/callbacks/examples/llm_finetune/finetune_with_callback.pydocs/guides/callbacks.mdUsage
Validation
Unit tests: All 9 tests pass ✅
python -m unittest tests.unit_tests.recipes.test_callbacks -v Ran 9 tests in 0.006s OKExample run:
Verified callbacks execute at correct lifecycle points with custom log prefixes:
Closes #805
Note: Happy to implement this callback feature! I designed it to be familiar (PyTorch Lightning-style) and practical for real integrations. Free free to provide feedback on the design/API if you'd like any adjustments. (AI assisted with documentation generation, but I personally reviewed and refined everything to ensure quality and accuracy.)