-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Is your feature request related to a problem? Please describe.
We are integrating Customizer to use Automodel now for finetuning and need to support metrics support.
Previously, when we used NeMo for training, this was straightforward because NeMo is built on PyTorch Lightning, which has native callback support. We simply added a NeMoCustomizerCallback to report training progress to our API.
But with Automodel, my understanding is that it doesn't use PyTorch Lightning, so I can't just hook our callback. The simplest approach I can find is to subclass [TrainFinetuneRecipeForNextTokenPrediction](https://github.com/NVIDIA-NeMo/Automodel/blob/main/nemo_automodel/recipes/llm/train_ft.py#L853) to override setup(), log_train_metrics(), log_val_metrics() methods to call our callback, but it doesn't seem to be the perfect solution.
Describe the solution you'd like
Could you add a callback mechanism similar to PyTorch Lightning's callbacks? Ideally with hooks for:
- on_train_start (after setup)
- on_train_batch_end (after each optimizer step)
- on_validation_end (after validation)
- on_save_checkpoint (when checkpoint is saved)
- on_exception (on training failure)
This would help us maintain cleaner integration with Customizer
Describe alternatives you've considered
subclass [TrainFinetuneRecipeForNextTokenPrediction](https://github.com/NVIDIA-NeMo/Automodel/blob/main/nemo_automodel/recipes/llm/train_ft.py#L853) to override setup(), log_train_metrics(), log_val_metrics() methods to call our callback
Additional context
Add any other context or screenshots about the feature request here.