Skip to content

Conversation

@shiyu1994
Copy link
Collaborator

This is to integrate multi-GPU support for CUDA version, with NCCL.

@shiyu1994 shiyu1994 requested a review from StrikerRUS October 10, 2023 15:31
@shiyu1994 shiyu1994 self-assigned this Oct 10, 2023
@shiyu1994 shiyu1994 changed the title [CUDA] Multi-GPU for CUDA Version [WIP] [CUDA] Multi-GPU for CUDA Version Oct 10, 2023
@shiyu1994 shiyu1994 changed the title [WIP] [CUDA] Multi-GPU for CUDA Version [CUDA] Multi-GPU for CUDA Version Dec 15, 2023
@shiyu1994 shiyu1994 closed this Dec 15, 2023
@shiyu1994 shiyu1994 reopened this Dec 15, 2023
@shiyu1994
Copy link
Collaborator Author

@microsoft-github-policy-service agree

@shiyu1994
Copy link
Collaborator Author

@StrikerRUS Could you help to review this again when you have time? Thanks.

@shiyu1994
Copy link
Collaborator Author

@StrikerRUS Gentlely ping again.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994 Thanks a lot for pushing this PR forward!
I found only one left over after resolving merge conflict with #6086.

Also, I believe we should document somewhere that this PR brings not only multi GPU support, but also multi node multi GPU support.

@shiyu1994
Copy link
Collaborator Author

shiyu1994 commented Oct 6, 2025

@shiyu1994 Thanks a lot for pushing this PR forward! I found only one left over after resolving merge conflict with #6086.

Also, I believe we should document somewhere that this PR brings not only multi GPU support, but also multi node multi GPU support.

Documented in 8a0d60a. Could you please check when you have time? @StrikerRUS Thanks.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! I don't have any other comments. Just please run .ci/parameter-generator.py to update docs.

@SkqLiao
Copy link

SkqLiao commented Oct 8, 2025

@shiyu1994 Hi, and thanks for your contribution! I’d like to ask how to use the multi-GPU feature.

I built LightGBM by pulling PR #6138 and using CMake ./build-python.sh --cuda, and GPU training works fine with a single GPU. However, when I add parameters for multi-GPU, it fails.

Here is my setup:

X, y = generate_data(num_samples, num_features)
lgb_train = lgb.Dataset(X, label=y)

# lgbm_cfg read from file, add multi-gpu feature below
lgbm_cfg["num_gpu"] = 2                  # 2 × RTX 4090
lgbm_cfg["device_type"] = "cuda"

booster = lgb.train(
    lgbm_cfg,
    lgb_train
)

The error I get is:

what():  [CUDA] an illegal memory access was encountered /cache/LightGBM/lightgbm-python/src/io/cuda/cuda_column_data.cpp 181
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /cache/LightGBM/lightgbm-python/include/LightGBM/cuda/cuda_utils.hu 145

If I comment out lgbm_cfg["num_gpu"] = 2, it runs fine on a single GPU.

Could you please help me understand:

  1. What is the correct way to enable multi-GPU in this PR?
  2. Are there any limitations or extra setup (e.g. environment variables, peer access, memory partitioning) needed?

Thank you very much for your time and for this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants