feat: customize R generation to include GPUs by vsoch · Pull Request #3325 · kubeflow/trainer

vsoch · 2026-03-13T11:13:38Z

What this PR does / why we need it:

Flux detection of GPUs depends on hwloc plugins, and a newer version. To get around any edge cases where the dependencies are missing, we can easily generate the R from the expected resource spec (cores and gpus). We can also add a shared memory mount to ensure the job MPI gets all available shared memory of the host. The current default is 64M (automatic from container runtime) and it can have implications for MPI performance.

This will close #3321

I would like to test this with GPUs on AWS today before we do any kind of merge. Thank you!

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #3321

Checklist:

Docs included if any changes are user facing

Flux detection of GPUs depends on hwloc plugins, and a newer version. To get around any edge cases where the dependencies are missing, we can easily generate the R from the expected resource spec (cores and gpus). We can also add a shared memory mount to ensure the job MPI gets all available shared memory of the host. The current default is 64M (automatic from container runtime) and it can have implications for MPI performance. Signed-off-by: vsoch <vsoch@users.noreply.github.com>

google-oss-prow · 2026-03-13T11:13:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Updates the Flux runtime plugin to generate a Flux resource file (flux R encode) based on derived per-node CPU/GPU topology and introduces a new in-memory EmptyDir volume intended for shared memory.

Changes:

Parameterize templates/entrypoint.sh so flux R encode can receive a computed resource spec (cores/GPU ranges).
Refactor Flux entrypoint generation to build explicit Flux flags (-N/-n and optional -g) and derive an Rspec string for resource encoding.
Add a new shared-memory volume constant and include a memory-backed EmptyDir volume in Flux view volumes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
pkg/runtime/framework/plugins/flux/templates/entrypoint.sh	Accepts a formatted resource spec argument for `flux R encode`.
pkg/runtime/framework/plugins/flux/flux.go	Builds `Rspec`/flags for Flux execution and adds a memory-backed EmptyDir volume to the pod spec.
pkg/constants/constants.go	Introduces `FluxMemoryVolumeName` constant for the new shared-memory volume.

You can also share your feedback on Copilot code review. Take the survey.

pkg/runtime/framework/plugins/flux/flux.go

+	memoryVolumeAC := corev1ac.Volume().
+		WithName(constants.FluxMemoryVolumeName).
+		WithEmptyDir(corev1ac.EmptyDirVolumeSource().
+			WithMedium(corev1.StorageMediumMemory))
 	fluxVolumeAC := corev1ac.Volume().


pkg/runtime/framework/plugins/flux/flux.go

 	} else {
-		tasks = fmt.Sprintf("-N %d -n %d", nodes, *info.RuntimePolicy.MLPolicySource.Flux.NumProcPerNode*nodes)
+		tasks = *info.RuntimePolicy.MLPolicySource.Flux.NumProcPerNode
 	}


pkg/constants/constants.go

 	// Path for Flux curve path
 	FluxCurveVolumePath = "/curve"

+	// Ensure MPI has full memory of the host


pkg/runtime/framework/plugins/flux/templates/entrypoint.sh

 # Generate host resources
 hosts=$(cat ${configroot}/etc/flux/system/hostlist)
-flux R encode --hosts=${hosts} --local > /tmp/R
+flux R encode --hosts=${hosts} %s > /tmp/R


pkg/runtime/framework/plugins/flux/flux.go

Also update volume to be added to container. This still is pending testing on AWS! Signed-off-by: vsoch <vsoch@users.noreply.github.com>

vsoch · 2026-03-13T17:12:09Z

@andreyvelich I'm going for a quick run and will test with GPU when I am back! Would you like an example added to the examples/flux directory that uses GPUs? My plan is to test on AWS, lammps with GPU.

andreyvelich · 2026-03-13T17:18:23Z

@andreyvelich I'm going for a quick run and will test with GPU when I am back! Would you like an example added to the examples/flux directory that uses GPUs? My plan is to test on AWS, lammps with GPU.

Sure, let's add another example in Flux subdirectory.

vsoch · 2026-03-13T22:54:45Z

GPU working is a go!

I have some changes I will push shortly, and looks like I need to look into those failing tests.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 13, 2026 11:13

google-oss-prow bot requested review from akshaychitneni and jinchihe March 13, 2026 11:13

google-oss-prow bot added the size/M label Mar 13, 2026

Copilot started reviewing on behalf of vsoch March 13, 2026 11:14 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

andreyvelich reviewed Mar 13, 2026

View reviewed changes

pkg/runtime/framework/plugins/flux/flux.go Outdated Show resolved Hide resolved

review: use generateRange function for gpus

b499369

Also update volume to be added to container. This still is pending testing on AWS! Signed-off-by: vsoch <vsoch@users.noreply.github.com>

review and add gpu trainer example

764a765

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

vsoch force-pushed the add-r-generation branch from 378d38a to 764a765 Compare March 13, 2026 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: customize R generation to include GPUs#3325

feat: customize R generation to include GPUs#3325
vsoch wants to merge 3 commits intokubeflow:masterfrom
converged-computing:add-r-generation

vsoch commented Mar 13, 2026 •

edited

Loading

Uh oh!

google-oss-prow bot commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

vsoch commented Mar 13, 2026

Uh oh!

andreyvelich commented Mar 13, 2026

Uh oh!

vsoch commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vsoch commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

vsoch commented Mar 13, 2026

Uh oh!

andreyvelich commented Mar 13, 2026

Uh oh!

vsoch commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vsoch commented Mar 13, 2026 •

edited

Loading