Skip to content

[Bug] trainer.distribute not working #134

@xyyimian

Description

@xyyimian

Describe the bug

For single GPU training, I am using train_yourtts.py. When I switch to multi-gpu, the program could run but didn't show acceleration. I checked the code in distribute.py and found that it only set environment and start parallel processes. It didn't do collection and sync operation. I am wondering if it is by design or I misused the trainer.distribute

To Reproduce

CUDA_VISIBLE_DEVICES=0,1 python -m trainer.distribute --script recipes/vctk/yourtts/train_yourtts.py

Expected behavior

expected two times acceleration but actually the progress is same as single gpu training.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3",
            "NVIDIA H100 80GB HBM3"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.1+cu121",
        "Trainer": "v0.0.34",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.18",
        "version": "#99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023"
    }
}

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions