-
Notifications
You must be signed in to change notification settings - Fork 144
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
For single GPU training, I am using train_yourtts.py. When I switch to multi-gpu, the program could run but didn't show acceleration. I checked the code in distribute.py and found that it only set environment and start parallel processes. It didn't do collection and sync operation. I am wondering if it is by design or I misused the trainer.distribute
To Reproduce
CUDA_VISIBLE_DEVICES=0,1 python -m trainer.distribute --script recipes/vctk/yourtts/train_yourtts.py
Expected behavior
expected two times acceleration but actually the progress is same as single gpu training.
Logs
No response
Environment
{
"CUDA": {
"GPU": [
"NVIDIA H100 80GB HBM3",
"NVIDIA H100 80GB HBM3",
"NVIDIA H100 80GB HBM3",
"NVIDIA H100 80GB HBM3",
"NVIDIA H100 80GB HBM3",
"NVIDIA H100 80GB HBM3",
"NVIDIA H100 80GB HBM3",
"NVIDIA H100 80GB HBM3"
],
"available": true,
"version": "12.1"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "2.1.1+cu121",
"Trainer": "v0.0.34",
"numpy": "1.22.0"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.9.18",
"version": "#99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023"
}
}Additional context
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working