feat:add xgboost runtime #2838

Labreo · 2025-09-19T07:08:13Z

What this PR does / why we need it:

This PR introduces a new ClusterTrainingRuntime to support distributed XGBoost training on Kubeflow Trainer. This allows users to easily run distributed XGBoost jobs, expanding the frameworks supported by the project.

The implementation follows the existing MPI-based pattern used by the deepspeed and mlx runtimes. It reuses the mpi policy for a consistent and robust design that requires no changes to the core controller API.

This is a work-in-progress to get early feedback on the runtime definition and Dockerfile structure.

What is included in this PR?

New xgboost_distributed.yaml ClusterTrainingRuntime manifest.
New Dockerfile for the xgboost-runtime image.
requirements.txt with pinned versions for the XGBoost environment.

What is still to come?

An example notebook demonstrating how to use the new runtime.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #2598

Checklist:

Docs included if any changes are user facing

Signed-off-by: Labreo <[email protected]>

google-oss-prow · 2025-09-19T07:08:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Electronic-Waste

@Labreo Thanks for this great feature! We're looking forward to this!

/cc @kubeflow/kubeflow-trainer-team
/ok-to-test

google-oss-prow · 2025-09-22T13:49:19Z

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-trainer-team.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@Labreo Thanks for this great feature! We're looking forward to this!

/cc @kubeflow/kubeflow-trainer-team
/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Electronic-Waste · 2025-09-22T13:50:42Z

Any ideas on how can we test this new runtime? @kubeflow/kubeflow-trainer-team

coveralls · 2025-09-22T13:53:55Z

Pull Request Test Coverage Report for Build 17851206844

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 55.137%

Totals
Change from base Build 17772927799:	0.0%
Covered Lines:	1084
Relevant Lines:	1966

💛 - Coveralls

andreyvelich · 2025-09-22T16:29:24Z

Thanks for this @Labreo!
Please can you explain why do you want to leverage mpirun to start distributed XGBoost training job ? Are there any benefits and what distributed environment is needed to be set for XGBoost ?

Any ideas on how can we test this new runtime? @kubeflow/kubeflow-trainer-team

We can add E2Es tests to verify it.

Labreo · 2025-09-24T02:46:20Z

Hello @andreyvelich .I mostly used mpi since the other runtimes deepspeed and mlx already use the same policy and it would be the easiest to implement. And it wouldn't require any more custom code in the go controller to run. As for the distributed environment as far as I have learnt the requirements of rabit is fulfilled by MPI.Again this is a draft pr and i am very open to feedback.

andreyvelich · 2025-09-24T11:28:32Z

As for the distributed environment as far as I have learnt the requirements of rabit is fulfilled by MPI.Again this is a draft pr and i am very open to feedback.

@Labreo Please can you explore how distributed training works in XGBoost these days?
Do you need to distinguish master and worker Pod template like we do for DeepSpeed, or we can just use multiple nodes with the same template as for Torch.

Maybe @nqvuong1998 or @terrytangyuan can help us with that ?

terrytangyuan · 2025-09-27T02:40:21Z

MPI should work. There are references here that might be useful https://github.com/kubeflow/xgboost-operator/tree/master/config/samples

andreyvelich · 2025-10-08T00:44:56Z

Hi @Labreo, did you get a chance to work on this ?
We are planning to release Trainer v2.1 in October, and including XGBoost runtime would be really nice!

nqvuong1998 · 2025-10-28T03:24:55Z

Hi @Labreo , any update for this feature?

feat(runtime): Update generated files

d5a069f

Signed-off-by: Labreo <[email protected]>

google-oss-prow bot added the do-not-merge/work-in-progress label Sep 19, 2025

google-oss-prow bot requested review from jinchihe and kuizhiqing September 19, 2025 07:08

google-oss-prow bot added the size/M label Sep 19, 2025

Labreo changed the title ~~feat(runtime): Update generated files~~ feat:add xgboost runtime Sep 20, 2025

Electronic-Waste reviewed Sep 22, 2025

View reviewed changes

google-oss-prow bot added the ok-to-test label Sep 22, 2025

google-oss-prow bot temporarily deployed to gpu-tests September 22, 2025 13:49 Inactive

Labreo temporarily deployed to gpu-tests September 22, 2025 13:49 — with GitHub Actions Inactive

google-oss-prow bot deployed to gpu-tests September 22, 2025 13:49 Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat:add xgboost runtime #2838

feat:add xgboost runtime #2838

Uh oh!

Labreo commented Sep 19, 2025 •

edited

Loading

Uh oh!

google-oss-prow bot commented Sep 19, 2025

Uh oh!

Electronic-Waste left a comment

Uh oh!

google-oss-prow bot commented Sep 22, 2025

Uh oh!

Electronic-Waste commented Sep 22, 2025

Uh oh!

coveralls commented Sep 22, 2025

Uh oh!

andreyvelich commented Sep 22, 2025

Uh oh!

Labreo commented Sep 24, 2025

Uh oh!

andreyvelich commented Sep 24, 2025

Uh oh!

terrytangyuan commented Sep 27, 2025 •

edited

Loading

Uh oh!

andreyvelich commented Oct 8, 2025

Uh oh!

nqvuong1998 commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

feat:add xgboost runtime #2838

Are you sure you want to change the base?

feat:add xgboost runtime #2838

Uh oh!

Conversation

Labreo commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Sep 19, 2025

Uh oh!

Electronic-Waste left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Sep 22, 2025

Uh oh!

Electronic-Waste commented Sep 22, 2025

Uh oh!

coveralls commented Sep 22, 2025

Pull Request Test Coverage Report for Build 17851206844

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

andreyvelich commented Sep 22, 2025

Uh oh!

Labreo commented Sep 24, 2025

Uh oh!

andreyvelich commented Sep 24, 2025

Uh oh!

terrytangyuan commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreyvelich commented Oct 8, 2025

Uh oh!

nqvuong1998 commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Labreo commented Sep 19, 2025 •

edited

Loading

terrytangyuan commented Sep 27, 2025 •

edited

Loading