Skip to content

[BUG] selective activation recompute only decrease little of GPU memory usage during training #1225

@bugm

Description

@bugm

Describe the bug
According to the paper https://arxiv.org/abs/2205.05198. The normal activation memory for a transformed based model in each layer can be calculated as
Image
and with the selective activation recompute, it can be decreased to
Image

with my training set,
Image
with tp =1 and pp =1, I expected when i use --recompute-activations, the GPU memory usage for storing activation should only be about 34 / (34+80) = 30% of that with no activation recompute applied.

Here are some info about the GPU memory usage
with --recompute-activations
Image

without --recompute-activations
Image

I notice the max_memory allocated during training only decreased from 25.52GB to 24.94 GB.

Expected behavior
The max_memory allocated during training should decrease more.

Environment (please complete the following information):

  • Megatron-LM commit ID
  • PyTorch 2.4.1
  • CUDA version 12.5
  • NCCL version 2.20.5

Additional context
According to the formula above, with b = 12 s =1024 h =1024 L= 20 a=16 t=1, the original activation memory should be around 32GB, plus the memory for model states , which is about 7.3 GB for a 0.43B parameters model, which should be around 40GB even not take the temporary buffers and unusable fragment memory into account. That is much bigger than the max_memory allocated without activation recomputing, So I wonder the Megatron-LM has done some optimize here?
And why the max_memory allocated only changes little with/without --recompute-activations (use selective activation as default according to the doc)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions