Skip to content

Commit 5798ce0

Browse files
authored
[megatron] add recompute_granularity none (modelscope#7842)
1 parent 9698fe2 commit 5798ce0

File tree

3 files changed

+5
-3
lines changed

3 files changed

+5
-3
lines changed

docs/source/Megatron-SWIFT/Command-line-parameters.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
- 🔥micro_batch_size: 每个device的批次大小,默认为1。
77
- 🔥global_batch_size: 总批次大小,等价于`micro_batch_size*数据并行大小*梯度累加步数`。默认为16。
88
- 其中,`数据并行大小 (DP) = 总GPU数 / (TP × PP × CP)`
9-
- 🔥recompute_granularity: 重新计算激活的粒度,可选项为'full', 'selective'。其中full代表重新计算整个transformer layer,selective代表只计算transformer layer中的核心注意力部分。通常'selective'是推荐的。默认为'selective'。
9+
- 🔥recompute_granularity: 重新计算激活的粒度,可选项为'full', 'selective' and 'none'(其中'none'为 ms-swift>=3.12.3支持)。其中full代表重新计算整个transformer layer,selective代表只计算transformer layer中的核心注意力部分。通常'selective'是推荐的。默认为'selective'。
1010
- 当你设置为'selective'时,你可以通过指定`--recompute_modules`来选择对哪些部分进行重新计算。
1111
- 🔥recompute_method: 该参数需将recompute_granularity设置为'full'才生效,可选项为'uniform', 'block'。默认为None。
1212
- 🔥recompute_num_layers: 该参数需将recompute_granularity设置为'full'才生效,默认为None。若`recompute_method`设置为uniform,该参数含义为每个均匀划分的重新计算单元的transformer layers数量。例如你可以指定为`--recompute_granularity full --recompute_method uniform --recompute_num_layers 4`。recompute_num_layers越大,显存占用越小,计算成本越大。注意:当前进程中的模型层数需能被`recompute_num_layers`整除。默认为None。

docs/source_en/Megatron-SWIFT/Command-line-parameters.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
- 🔥micro_batch_size: Batch size per device, default is 1.
88
- 🔥global_batch_size: Total batch size, equivalent to `micro_batch_size * data parallel size * gradient accumulation steps`. Default is 16.
99
- Here, `Data Parallelism size (DP) = Total number of GPUs / (TP × PP × CP)`.
10-
- 🔥recompute_granularity: Granularity of activation recomputation, options are 'full', 'selective'. 'full' means recomputing the entire transformer layer, while 'selective' means only recomputing the core attention part of the transformer layer. 'selective' is generally recommended. Default is 'selective'.
10+
- 🔥recompute_granularity: Granularity of activation recomputation, options are 'full', 'selective' and 'none' (the 'none' option requires ms-swift>=3.12.3). 'full' means recomputing the entire transformer layer, while 'selective' means only recomputing the core attention part of the transformer layer. 'selective' is generally recommended. Default is 'selective'.
1111
- When you set it to 'selective', you can specify `--recompute_modules` to choose which parts to recompute.
1212
- 🔥recompute_method: This parameter takes effect only when recompute_granularity is set to 'full', options are 'uniform', 'block'. Default is None.
1313
- 🔥recompute_num_layers: This parameter takes effect only when recompute_granularity is set to 'full'. Default is None. If `recompute_method` is set to uniform, this parameter specifies the number of transformer layers in each uniformly divided recomputation unit. For example, you can specify `--recompute_granularity full --recompute_method uniform --recompute_num_layers 4`. The larger the recompute_num_layers, the smaller the memory usage but higher computation cost. Note: The number of model layers in the current process must be divisible by `recompute_num_layers`. Default is None.

swift/megatron/arguments/megatron_args.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -407,7 +407,7 @@ class MegatronArguments(ExtraMegatronArguments):
407407
# training
408408
micro_batch_size: int = 1
409409
global_batch_size: int = 16
410-
recompute_granularity: Literal['selective', 'full'] = 'selective'
410+
recompute_granularity: Literal['selective', 'full', 'none'] = 'selective'
411411
recompute_method: Literal['uniform', 'block'] = None
412412
recompute_num_layers: Optional[int] = None
413413
recompute_modules: List[str] = field(default_factory=lambda: ['core_attn'])
@@ -725,6 +725,8 @@ def __post_init__(self):
725725
RLHFMegatronArgumentsMixin.__post_init__(self)
726726
MegatronTunerMixin.__post_init__(self)
727727
os.environ.setdefault('CUDA_DEVICE_MAX_CONNECTIONS', '1')
728+
if self.recompute_granularity == 'none':
729+
self.recompute_granularity = None
728730
self._set_default()
729731
self.model_info, self.model_meta = get_model_info_meta(
730732
self.model, model_type=self.model_type, use_hf=self.use_hf, hub_token=self.hub_token)

0 commit comments

Comments
 (0)