Skip to content

Why only one GPU is getting used in the kaggle kernel #20424

Open
@KeesariVigneshwarReddy

Description

@KeesariVigneshwarReddy

Bug description

Screenshot 2024-11-16 201845

i initialized my trainer

trainer = L.Trainer(max_epochs=5,
                    devices=2,
                    strategy='ddp_notebook',
                    num_sanity_val_steps=0,
                    profiler='simple', 
                    default_root_dir="/kaggle/working",  
                    callbacks=[DeviceStatsMonitor(), 
                               StochasticWeightAveraging(swa_lrs=1e-2), 
                               #EarlyStopping(monitor='train_Loss', min_delta=0.001, patience=100, verbose=False, mode='min'),
                              ],
                    enable_progress_bar=True,
                    enable_model_summary=True,
                   )

distributed is initialized for both the GPUs but only one is getting hit.

Also for validation loop the GPU are not in usage

Screenshot 2024-11-16 202120

How can I resolve the situation to use 2 GPUs and fasten my training.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    waiting on authorWaiting on user action, correction, or update

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions