-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation
Description
1. 遇到问题的章节 / Affected Chapter
Chapter5.3.4
2. 具体问题描述 / Problem Description
bfloat16的范围应该是能够覆盖和float32一样的范围 (
3. 问题重现材料 / Reproduction Materials
对应代码使用
# ==================== 优化器和训练组件初始化 ====================
# 初始化混合精度训练的梯度缩放器
# 只有在使⽤float16或bfloat16时才启⽤
scaler = torch.cuda.amp.GradScaler(enabled=(args.dtype in ['float16', 'bfloat16']))但是我尝试去掉bfloat16下的GradScaler,使用
scaler = torch.cuda.amp.GradScaler(enabled=(args.dtype in ['float16']))启动命令
python ddp_pretrain.py --batch_size=16 --data_path="data path" --accumulation_steps=32 --gpus=0 --use_swanlab采用默认的
log_interval=100
却遇到了严重的训练问题 (损失在6至7.5左右的值不下降了,困惑度还很高,导致pretrain失败)
改变了学习率调度策略重新训练,问题依旧,并且还伴随一些loss spike的情况
我对此感到困惑,希望能够得到解答,谢谢🙏
确认事项 / Verification
- 此问题未在过往Issue中被报告过 / This issue hasn't been reported before
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation