Open
Description
Bug description
i initialized my trainer
trainer = L.Trainer(max_epochs=5,
devices=2,
strategy='ddp_notebook',
num_sanity_val_steps=0,
profiler='simple',
default_root_dir="/kaggle/working",
callbacks=[DeviceStatsMonitor(),
StochasticWeightAveraging(swa_lrs=1e-2),
#EarlyStopping(monitor='train_Loss', min_delta=0.001, patience=100, verbose=False, mode='min'),
],
enable_progress_bar=True,
enable_model_summary=True,
)
distributed is initialized for both the GPUs but only one is getting hit.
Also for validation loop the GPU are not in usage
How can I resolve the situation to use 2 GPUs and fasten my training.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
lantiga commentedon Nov 18, 2024
It actually looks like both GPUs are being used.
The issue with the two utilization indicators may be that one process is CPU bound (e.g. rank 0 doing logging) while the other isn't. The model seems really small and essentially CPU operations dominate.
I suggest you increase the size of the model, or the size of the batch, to bring the actual utilization up.
You can also verify this by passing
barebones=True
to theTrainer
: this should minimize non-model related operations and the two GPUs will probably look more similar.