Skip to content

Model parameters size is zero when using fabric.sharded_model() context with deepspeed zero-3 strategy #18514

Open
@seraphzl

Description

@seraphzl

Bug description

Hi,

I use fabric with deepspeed zero-3 strategy to shard model among 2 gpus, and get Model params = 0.0 M of model size when using with fabric.sharded_model() context.

import lightning as L

fabric = L.Fabric(accelerator="cuda", strategy='deepspeed_stage_3', precision='bf16-mixed')
fabric.launch()

with fabric.sharded_model():
    net = mymodel()
num_params = sum([param.nelement() for param in net.parameters()])
fabric.print('Model params = %2.1f M' % (num_params / 1000**2))

Without the fabric.sharded_model() context, I get the correct model size as Model params = 13.6 M.
How to solve this issue? Thanks.

What version are you seeing the problem on?

v2.0

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @carmocca @justusschock @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Labels

    fabriclightning.fabric.FabricquestionFurther information is requestedver: 2.0.xwaiting on authorWaiting on user action, correction, or update

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions