Skip to content

feat(runtimes): Add XGBoost runtime(KEP-2598)#3200

Open
Krishna-kg732 wants to merge 3 commits intokubeflow:masterfrom
Krishna-kg732:xgboost-runtime-implementation
Open

feat(runtimes): Add XGBoost runtime(KEP-2598)#3200
Krishna-kg732 wants to merge 3 commits intokubeflow:masterfrom
Krishna-kg732:xgboost-runtime-implementation

Conversation

@Krishna-kg732
Copy link
Contributor

@Krishna-kg732 Krishna-kg732 commented Feb 12, 2026

What this PR does

Implements the XGBoost runtime plugin for Kubeflow Trainer V2, as proposed in KEP-2598. This plugin enables distributed XGBoost training using Rabit/Collective coordination by automatically injecting DMLC environment variables into trainer containers.

Changes

New Files

  • pkg/runtime/framework/plugins/xgboost/xgboost.go — Plugin implementing EnforceMLPolicyPlugin and CustomValidationPlugin. Injects DMLC_TRACKER_URI, DMLC_TRACKER_PORT, DMLC_TASK_ID, DMLC_NUM_WORKER env vars and auto-derives numWorkersPerNode from GPU resources (1 worker per GPU, or 1 per node for CPU).
  • pkg/runtime/framework/plugins/xgboost/xgboost_test.go — Unit tests covering EnforceMLPolicy (nil guards, single/multi-node CPU, GPU resources, numNodes override) and Validate (reserved DMLC_* env name rejection).

Modified Files

  • pkg/apis/trainer/v1alpha1/trainingruntime_types.go — Added XGBoostMLPolicySource struct, XGBoost field to MLPolicySource, and updated CEL mutual exclusion validation rule.
  • pkg/constants/constants.go — Added XGBoost/Rabit constants and XGBoostReservedEnvNames set.
  • pkg/runtime/framework/plugins/registry.go — Registered the XGBoost plugin.
  • pkg/runtime/framework/plugins/plainml/plainml.go — Added XGBoost to the PlainML fallback guard.
  • pkg/runtime/framework/core/framework_test.go — Updated TestNew to include XGBoost in expected plugin lists.
  • pkg/util/testing/wrapper.go — Added XGBoostPolicy() test helper.

How was this tested?

  • go test ./pkg/runtime/framework/plugins/xgboost/... ✅ (9 test cases)
  • go test ./pkg/runtime/framework/core/ -run TestNew
  • go test ./pkg/runtime/framework/plugins/... ✅ (all plugins pass)

TODO (follow-up PRs)

  • Add E2E tests
  • Add ClusterTrainingRuntime YAML manifests
  • Add example notebook

/kind feature
/area runtime

Copilot AI review requested due to automatic review settings February 12, 2026 04:24
@google-oss-prow
Copy link

@Krishna-kg732: The label(s) area/runtime cannot be applied, because the repository doesn't have them.

Details

In response to this:

What this PR does

Adds the XGBoost runtime plugin scaffold to the Trainer V2 framework. This is the foundational PR for KEP-2598: XGBoost Runtime — it introduces the plugin structure and API types without the full implementation, which will follow in a subsequent PR.

Changes

New Files

  • pkg/runtime/framework/plugins/xgboost/xgboost.go — Plugin scaffold implementing EnforceMLPolicyPlugin with a stub EnforceMLPolicy (Rabit env injection TODO)

Modified Files

  • pkg/apis/trainer/v1alpha1/trainingruntime_types.go — Added XGBoostMLPolicySource struct and XGBoost field to MLPolicySource
  • pkg/constants/constants.go — Added XGBoost/Rabit constants (DMLC_TRACKER_URI, DMLC_TRACKER_PORT, DMLC_TASK_ID, DMLC_NUM_WORKER) and reserved env set
  • pkg/runtime/framework/plugins/registry.go — Registered the XGBoost plugin
  • pkg/runtime/framework/plugins/plainml/plainml.go — Added XGBoost to the PlainML fallback guard

What's NOT in this PR (intentionally)

  • EnforceMLPolicy implementation (Rabit env var injection) — will be in a follow-up PR
  • Unit tests and E2E tests — will accompany the implementation PR
  • ClusterTrainingRuntime YAML manifests

How was this tested?

  • go build ./pkg/runtime/framework/plugins/...
  • go vet ./pkg/runtime/framework/plugins/xgboost/...

/kind feature
/area runtime

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@Krishna-kg732 Krishna-kg732 changed the title feat(runtime): Add XGBoost runtime plugin scaffold (KEP-2598) feat(runtime): Add XGBoost runtime(KEP-2598) Feb 12, 2026
@Krishna-kg732 Krishna-kg732 changed the title feat(runtime): Add XGBoost runtime(KEP-2598) feat(runtimes): Add XGBoost runtime(KEP-2598) Feb 12, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an initial XGBoost runtime plugin scaffold to the Trainer V2 runtime framework (per KEP-2598), along with the API wiring and constants needed to support a future Rabit env var injection implementation.

Changes:

  • Introduces an xgboost runtime plugin scaffold implementing EnforceMLPolicyPlugin (stubbed behavior for now).
  • Extends the TrainingRuntime API (MLPolicySource) with an xgboost policy source and updates the “only one policy” validation rule.
  • Adds XGBoost/Rabit-related env var constants and registers the plugin in the runtime plugin registry (and updates PlainML fallback guard).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/runtime/framework/plugins/xgboost/xgboost.go New XGBoost plugin scaffold (EnforceMLPolicy stub + plugin name/factory).
pkg/runtime/framework/plugins/registry.go Registers the XGBoost plugin in the plugin factory registry.
pkg/runtime/framework/plugins/plainml/plainml.go Ensures PlainML no-ops when XGBoost (and JAX) ML policy sources are configured.
pkg/constants/constants.go Adds Rabit/XGBoost env var constants + reserved env name set.
pkg/apis/trainer/v1alpha1/trainingruntime_types.go Adds XGBoostMLPolicySource + MLPolicySource.XGBoost, and updates ML policy exclusivity validation.

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
@Krishna-kg732 Krishna-kg732 force-pushed the xgboost-runtime-implementation branch from 729c8be to 49c768a Compare February 12, 2026 04:33
@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Feb 14, 2026
@Krishna-kg732 Krishna-kg732 force-pushed the xgboost-runtime-implementation branch from a57f8a6 to 985eaf4 Compare February 14, 2026 05:09
Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
@Krishna-kg732 Krishna-kg732 force-pushed the xgboost-runtime-implementation branch from 985eaf4 to e5c552e Compare February 14, 2026 05:10
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Feb 16, 2026
@coveralls
Copy link

coveralls commented Feb 16, 2026

Pull Request Test Coverage Report for Build 22064651978

Details

  • 77 of 84 (91.67%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+1.2%) to 57.176%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/runtime/framework/plugins/registry.go 0 1 0.0%
pkg/runtime/framework/plugins/xgboost/xgboost.go 72 78 92.31%
Totals Coverage Status
Change from base Build 22051165353: 1.2%
Covered Lines: 1466
Relevant Lines: 2564

💛 - Coveralls

@Krishna-kg732 Krishna-kg732 force-pushed the xgboost-runtime-implementation branch from 1d98811 to 7ec359f Compare February 16, 2026 13:16
Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
@Krishna-kg732 Krishna-kg732 force-pushed the xgboost-runtime-implementation branch from 7ec359f to 38e1f5a Compare February 16, 2026 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants