-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Hi, I am currently training a multimodal model using LLaVAModel from
megatron/core/models/multimodal/llava_model.py.
Additionally, sequence packing is enabled (packed_seq_params with qkv_format="thd"). The LLM backbone is a dense model using Grouped Query Attention (GQA). The vision encoder follows the SigLIP-2 backbone, and the vision-language projector is implemented as an MLP. The multimodal model is trained using only data parallelism (DP), without any 4D parallelism (TP, CP, EP, or PP).
While experimenting with different parameter-freezing strategies, I observed a significant difference in iteration time between the following two configurations.
1. Freeze-LM (w/ --freeze-LM)
- LLM: frozen
- Vision encoder (ViT): trainable
- Projector: trainable
2. All-unfreeze
- LLM: trainable
- Vision encoder (ViT): trainable
- Projector: trainable
Surprisingly, the all-unfreeze configuration is significantly faster than the freeze-LM configuration in terms of iteration time.
Iteration Time (TensorBoard)
In our experiments:
| Setting | Iteration Time |
|---|---|
| Freeze-LM (job id: 18854) | ~10s |
| All-unfreeze (job id: 18874) | ~4.3s |
This was somewhat unexpected, as I would have expected the freeze-LM configuration to have similar or faster iteration times compared to the all-unfreeze configuration.
Question
Is there a particular reason why the all-unfreeze configuration results in significantly faster iteration times compared to the freeze-LM configuration?
Could you help clarify why this behavior occurs in the current implementation?
Thanks!