Skip to content

Conversation

@Labreo
Copy link

@Labreo Labreo commented Sep 19, 2025

What this PR does / why we need it:

This PR introduces a new ClusterTrainingRuntime to support distributed XGBoost training on Kubeflow Trainer. This allows users to easily run distributed XGBoost jobs, expanding the frameworks supported by the project.

The implementation follows the existing MPI-based pattern used by the deepspeed and mlx runtimes. It reuses the mpi policy for a consistent and robust design that requires no changes to the core controller API.

This is a work-in-progress to get early feedback on the runtime definition and Dockerfile structure.

What is included in this PR?

  • New xgboost_distributed.yaml ClusterTrainingRuntime manifest.
  • New Dockerfile for the xgboost-runtime image.
  • requirements.txt with pinned versions for the XGBoost environment.

What is still to come?

  • An example notebook demonstrating how to use the new runtime.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2598

Checklist:

  • Docs included if any changes are user facing

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Labreo Labreo changed the title feat(runtime): Update generated files feat:add xgboost runtime Sep 20, 2025
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Labreo Thanks for this great feature! We're looking forward to this!

/cc @kubeflow/kubeflow-trainer-team
/ok-to-test

@google-oss-prow
Copy link

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-trainer-team.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@Labreo Thanks for this great feature! We're looking forward to this!

/cc @kubeflow/kubeflow-trainer-team
/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Electronic-Waste
Copy link
Member

Any ideas on how can we test this new runtime? @kubeflow/kubeflow-trainer-team

@coveralls
Copy link

Pull Request Test Coverage Report for Build 17851206844

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 55.137%

Totals Coverage Status
Change from base Build 17772927799: 0.0%
Covered Lines: 1084
Relevant Lines: 1966

💛 - Coveralls

@andreyvelich
Copy link
Member

Thanks for this @Labreo!
Please can you explain why do you want to leverage mpirun to start distributed XGBoost training job ? Are there any benefits and what distributed environment is needed to be set for XGBoost ?

Any ideas on how can we test this new runtime? @kubeflow/kubeflow-trainer-team

We can add E2Es tests to verify it.

@Labreo
Copy link
Author

Labreo commented Sep 24, 2025

Hello @andreyvelich .I mostly used mpi since the other runtimes deepspeed and mlx already use the same policy and it would be the easiest to implement. And it wouldn't require any more custom code in the go controller to run. As for the distributed environment as far as I have learnt the requirements of rabit is fulfilled by MPI.Again this is a draft pr and i am very open to feedback.

@andreyvelich
Copy link
Member

As for the distributed environment as far as I have learnt the requirements of rabit is fulfilled by MPI.Again this is a draft pr and i am very open to feedback.

@Labreo Please can you explore how distributed training works in XGBoost these days?
Do you need to distinguish master and worker Pod template like we do for DeepSpeed, or we can just use multiple nodes with the same template as for Torch.

Maybe @nqvuong1998 or @terrytangyuan can help us with that ?

@terrytangyuan
Copy link
Member

terrytangyuan commented Sep 27, 2025

MPI should work. There are references here that might be useful https://github.com/kubeflow/xgboost-operator/tree/master/config/samples

@andreyvelich
Copy link
Member

Hi @Labreo, did you get a chance to work on this ?
We are planning to release Trainer v2.1 in October, and including XGBoost runtime would be really nice!

@nqvuong1998
Copy link

Hi @Labreo , any update for this feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support XGBoost/LightGBM runtime and examples

6 participants