Skip to content

[QUESTION] Unexpected iteration-time difference between freeze-LM and full-unfreeze training in LLaVAModel #3767

@kminsoo

Description

@kminsoo

Hi, I am currently training a multimodal model using LLaVAModel from
megatron/core/models/multimodal/llava_model.py.

Additionally, sequence packing is enabled (packed_seq_params with qkv_format="thd"). The LLM backbone is a dense model using Grouped Query Attention (GQA). The vision encoder follows the SigLIP-2 backbone, and the vision-language projector is implemented as an MLP. The multimodal model is trained using only data parallelism (DP), without any 4D parallelism (TP, CP, EP, or PP).

While experimenting with different parameter-freezing strategies, I observed a significant difference in iteration time between the following two configurations.

1. Freeze-LM (w/ --freeze-LM)

  • LLM: frozen
  • Vision encoder (ViT): trainable
  • Projector: trainable

2. All-unfreeze

  • LLM: trainable
  • Vision encoder (ViT): trainable
  • Projector: trainable

Surprisingly, the all-unfreeze configuration is significantly faster than the freeze-LM configuration in terms of iteration time.

Iteration Time (TensorBoard)

Image

In our experiments:

Setting Iteration Time
Freeze-LM (job id: 18854) ~10s
All-unfreeze (job id: 18874) ~4.3s

This was somewhat unexpected, as I would have expected the freeze-LM configuration to have similar or faster iteration times compared to the all-unfreeze configuration.

Question

Is there a particular reason why the all-unfreeze configuration results in significantly faster iteration times compared to the freeze-LM configuration?

Could you help clarify why this behavior occurs in the current implementation?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions