Skip to content

在A100上加载FusedAdam报错 #44

@giter000

Description

@giter000

您好,我尝试在2张NVIDIA A100-PCIE-40GB的卡上跑代码,直接使用了镜像环境。但是一直在加载FusedAdam时报以下错误,即使重装了apex也没解决,目前还没有找到解决办法:

Total train epochs 10 | Total train iters 286497 |
building Enc-Dec model ...

number of parameters on model parallel rank 1: 5543798784
number of parameters on model parallel rank 0: 5543798784
Traceback (most recent call last):
File "/mnt/finetune_cpm2.py", line 808, in
main()
File "/mnt/finetune_cpm2.py", line 791, in main
model, optimizer, lr_scheduler = setup_model_and_optimizer(args, tokenizer.vocab_size, ds_config, prompt_config)
File "/mnt/utils.py", line 213, in setup_model_and_optimizer
optimizer = get_optimizer(model, args, prompt_config)
File "/mnt/utils.py", line 163, in get_optimizer
optimizer = Adam(param_groups,
File "/opt/conda/lib/python3.8/site-packages/apex/optimizers/fused_adam.py", line 79, in init
raise RuntimeError('apex.optimizers.FusedAdam requires cuda extensions')
RuntimeError: apex.optimizers.FusedAdam requires cuda extensions

请问是否可以在2张NVIDIA A100-PCIE-40GB的卡上跑?镜像中apex环境需要调整什么吗?感谢。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions