[CUDA] Multi-GPU for CUDA Version #6138

shiyu1994 · 2023-10-10T15:31:27Z

This is to integrate multi-GPU support for CUDA version, with NCCL.

… nccl-dev

Co-authored-by: Nikita Titov <[email protected]>

… nccl-dev

Co-authored-by: Nikita Titov <[email protected]>

shiyu1994 · 2025-07-24T03:25:55Z

@microsoft-github-policy-service agree

shiyu1994 · 2025-07-24T03:26:43Z

@StrikerRUS Could you help to review this again when you have time? Thanks.

shiyu1994 · 2025-08-31T09:53:40Z

@StrikerRUS Gentlely ping again.

StrikerRUS

@shiyu1994 Thanks a lot for pushing this PR forward!
I found only one left over after resolving merge conflict with #6086.

Also, I believe we should document somewhere that this PR brings not only multi GPU support, but also multi node multi GPU support.

src/treelearner/cuda/cuda_leaf_splits.cu

Co-authored-by: Nikita Titov <[email protected]>

…raining of CUDA version

shiyu1994 · 2025-10-06T05:14:40Z

@shiyu1994 Thanks a lot for pushing this PR forward! I found only one left over after resolving merge conflict with #6086.

Also, I believe we should document somewhere that this PR brings not only multi GPU support, but also multi node multi GPU support.

Documented in 8a0d60a. Could you please check when you have time? @StrikerRUS Thanks.

StrikerRUS

Thanks a lot! I don't have any other comments. Just please run .ci/parameter-generator.py to update docs.

SkqLiao · 2025-10-08T06:00:05Z

@shiyu1994 Hi, and thanks for your contribution! I’d like to ask how to use the multi-GPU feature.

I built LightGBM by pulling PR #6138 and using CMake ./build-python.sh --cuda, and GPU training works fine with a single GPU. However, when I add parameters for multi-GPU, it fails.

Here is my setup:

X, y = generate_data(num_samples, num_features)
lgb_train = lgb.Dataset(X, label=y)

# lgbm_cfg read from file, add multi-gpu feature below
lgbm_cfg["num_gpu"] = 2                  # 2 × RTX 4090
lgbm_cfg["device_type"] = "cuda"

booster = lgb.train(
    lgbm_cfg,
    lgb_train
)

The error I get is:

what():  [CUDA] an illegal memory access was encountered /cache/LightGBM/lightgbm-python/src/io/cuda/cuda_column_data.cpp 181
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /cache/LightGBM/lightgbm-python/include/LightGBM/cuda/cuda_utils.hu 145

If I comment out lgbm_cfg["num_gpu"] = 2, it runs fine on a single GPU.

Could you please help me understand:

What is the correct way to enable multi-GPU in this PR?
Are there any limitations or extra setup (e.g. environment variables, peer access, memory partitioning) needed?

Thank you very much for your time and for this PR!

initialize nccl

ee3923b

shiyu1994 added in progress efficiency feature labels Oct 10, 2023

shiyu1994 requested a review from StrikerRUS October 10, 2023 15:31

shiyu1994 self-assigned this Oct 10, 2023

shiyu1994 requested review from guolinke, jameslamb and jmoralez as code owners October 10, 2023 15:31

shiyu1994 removed the in progress label Oct 10, 2023

shiyu1994 changed the title ~~[CUDA] Multi-GPU for CUDA Version~~ [WIP] [CUDA] Multi-GPU for CUDA Version Oct 10, 2023

shiyu1994 added 9 commits October 26, 2023 11:00

Merge branch 'master' into nccl-dev

82668d0

Merge branch 'master' into nccl-dev

6189cbb

change year in header

f39f877

Merge branch 'master' into nccl-dev

e513662

Merge branch 'nccl-dev' of https://github.com/Microsoft/LightGBM into…

47f3e50

… nccl-dev

add implementation of nccl gbdt

985780f

add nccl topology

35b0ca1

clean up

7d36a14

Merge branch 'master' into nccl-dev

5470d99

shiyu1994 added the in progress label Nov 9, 2023

shiyu1994 added 3 commits November 9, 2023 09:35

clean up

7b47a1e

Merge branch 'nccl-dev' of https://github.com/Microsoft/LightGBM into…

839c375

… nccl-dev

Merge branch 'master' into nccl-dev

8eaf3ad

shiyu1994 changed the title ~~[WIP] [CUDA] Multi-GPU for CUDA Version~~ [CUDA] Multi-GPU for CUDA Version Dec 15, 2023

shiyu1994 closed this Dec 15, 2023

shiyu1994 reopened this Dec 15, 2023

shiyu1994 added 3 commits December 22, 2023 11:37

Merge branch 'master' into nccl-dev

cc72fc8

set nccl info

209e25d

support quantized training with categorical features on cpu

431f967

shiyu1994 and others added 7 commits June 10, 2025 11:51

Update src/treelearner/cuda/cuda_data_partition.cu

aeeeab7

Co-authored-by: Nikita Titov <[email protected]>

Update src/treelearner/cuda/cuda_data_partition.cu

9c0e061

Co-authored-by: Nikita Titov <[email protected]>

Update src/treelearner/cuda/cuda_leaf_splits.cu

40deeb9

Co-authored-by: Nikita Titov <[email protected]>

Merge branch 'master' into nccl-dev

f1b7a7b

Merge branch 'master' into nccl-dev

170fbad

remove WARPSIZE before #6086 is merged

c29082e

Merge branch 'nccl-dev' of https://github.com/microsoft/LightGBM into…

b88fa1c

… nccl-dev

This was referenced Jun 17, 2025

[CUDA] support bagging in multi-GPU mode #6942

Open

[CUDA] support leaf value renewing for CUDA version with multi-GPU #6943

Open

[CUDA] RollbackOneIter is not supported for CUDA version with multi-GPU #6944

Open

Update src/treelearner/cuda/cuda_leaf_splits.cu

594f12c

Co-authored-by: Nikita Titov <[email protected]>

Merge branch 'master' into nccl-dev

66da0db

This was referenced Jul 28, 2025

[GPU] MultiGPU support for LightGBM on a single node? #6928

Closed

[CUDA] [python-package] Regarding the time consumption of training LightGBM with CUDA #6982

Closed

StrikerRUS and others added 4 commits August 3, 2025 21:41

Merge branch 'master' into nccl-dev

f1397c8

Merge branch 'master' into nccl-dev

43dabdf

Merge branch 'master' into nccl-dev

cd8ef4e

update docs

eae6152

StrikerRUS requested changes Aug 31, 2025

View reviewed changes

src/treelearner/cuda/cuda_leaf_splits.cu Outdated Show resolved Hide resolved

jameslamb mentioned this pull request Sep 23, 2025

[ROCm] add ROCm support (pt. 2) #7039

Open

shiyu1994 and others added 3 commits October 6, 2025 12:24

Update src/treelearner/cuda/cuda_leaf_splits.cu

31c46bb

Co-authored-by: Nikita Titov <[email protected]>

Merge branch 'master' into nccl-dev

cc1ed5c

update documentation to indicate supporting of multi-node multi-gpu t…

8a0d60a

…raining of CUDA version

StrikerRUS approved these changes Oct 6, 2025

View reviewed changes

Merge branch 'master' into nccl-dev

124ccd6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Multi-GPU for CUDA Version #6138

[CUDA] Multi-GPU for CUDA Version #6138

shiyu1994 commented Oct 10, 2023

Uh oh!

shiyu1994 commented Jul 24, 2025

Uh oh!

shiyu1994 commented Jul 24, 2025

Uh oh!

shiyu1994 commented Aug 31, 2025

Uh oh!

StrikerRUS left a comment

Uh oh!

Uh oh!

shiyu1994 commented Oct 6, 2025 •

edited

Loading

Uh oh!

StrikerRUS left a comment

Uh oh!

SkqLiao commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[CUDA] Multi-GPU for CUDA Version #6138

Are you sure you want to change the base?

[CUDA] Multi-GPU for CUDA Version #6138

Conversation

shiyu1994 commented Oct 10, 2023

Uh oh!

shiyu1994 commented Jul 24, 2025

Uh oh!

shiyu1994 commented Jul 24, 2025

Uh oh!

shiyu1994 commented Aug 31, 2025

Uh oh!

StrikerRUS left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shiyu1994 commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StrikerRUS left a comment

Choose a reason for hiding this comment

Uh oh!

SkqLiao commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

shiyu1994 commented Oct 6, 2025 •

edited

Loading