[QUESTION] Unexpected iteration-time difference between freeze-LM and full-unfreeze training in LLaVAModel

Hi, I am currently training a multimodal model using `LLaVAModel` from  
`megatron/core/models/multimodal/llava_model.py`.


Additionally, sequence packing is enabled (packed_seq_params with qkv_format="thd"). The LLM backbone is a dense model using Grouped Query Attention (GQA). The vision encoder follows the SigLIP-2 backbone, and the vision-language projector is implemented as an MLP. The multimodal model is trained using only data parallelism (DP), without any 4D parallelism (TP, CP, EP, or PP).


While experimenting with different parameter-freezing strategies, I observed a significant difference in iteration time between the following two configurations.

### 1. Freeze-LM (w/ `--freeze-LM`)
- LLM: **frozen**
- Vision encoder (ViT): **trainable**
- Projector: **trainable**

### 2. All-unfreeze
- LLM: **trainable**
- Vision encoder (ViT): **trainable**
- Projector: **trainable**

Surprisingly, the **all-unfreeze configuration is significantly faster** than the freeze-LM configuration in terms of iteration time.

### Iteration Time (TensorBoard)

<img width="663" height="411" alt="Image" src="https://github.com/user-attachments/assets/90ff75ae-f964-437f-9110-0cb2cb8f6ce0" />

In our experiments:

| Setting | Iteration Time |
|--------|---------------|
| Freeze-LM (job id: 18854) | ~10s |
| All-unfreeze (job id: 18874) | ~4.3s |

This was somewhat unexpected, as I would have expected the freeze-LM configuration to have similar or faster iteration times compared to the all-unfreeze configuration.

### Question

Is there a particular reason why the all-unfreeze configuration results in significantly faster iteration times compared to the freeze-LM configuration?

Could you help clarify why this behavior occurs in the current implementation?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Unexpected iteration-time difference between freeze-LM and full-unfreeze training in LLaVAModel #3767

1. Freeze-LM (w/ `--freeze-LM`)

2. All-unfreeze

Iteration Time (TensorBoard)

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Setting	Iteration Time
Freeze-LM (job id: 18854)	~10s
All-unfreeze (job id: 18874)	~4.3s

[QUESTION] Unexpected iteration-time difference between freeze-LM and full-unfreeze training in LLaVAModel #3767

Description

1. Freeze-LM (w/ --freeze-LM)

2. All-unfreeze

Iteration Time (TensorBoard)

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Freeze-LM (w/ `--freeze-LM`)