Skip to content

feat: customize R generation to include GPUs#3325

Open
vsoch wants to merge 3 commits intokubeflow:masterfrom
converged-computing:add-r-generation
Open

feat: customize R generation to include GPUs#3325
vsoch wants to merge 3 commits intokubeflow:masterfrom
converged-computing:add-r-generation

Conversation

@vsoch
Copy link
Contributor

@vsoch vsoch commented Mar 13, 2026

What this PR does / why we need it:

Flux detection of GPUs depends on hwloc plugins, and a newer version. To get around any edge cases where the dependencies are missing, we can easily generate the R from the expected resource spec (cores and gpus). We can also add a shared memory mount to ensure the job MPI gets all available shared memory of the host. The current default is 64M (automatic from container runtime) and it can have implications for MPI performance.

This will close #3321

I would like to test this with GPUs on AWS today before we do any kind of merge. Thank you!

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #3321

Checklist:

  • Docs included if any changes are user facing

Flux detection of GPUs depends on hwloc plugins, and a newer
version. To get around any edge cases where the dependencies are missing,
we can easily generate the R from the expected resource spec (cores
and gpus). We can also add a shared memory mount to ensure the job
MPI gets all available shared memory of the host. The current
default is 64M (automatic from container runtime) and it can have
implications for MPI performance.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 13, 2026 11:13
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Flux runtime plugin to generate a Flux resource file (flux R encode) based on derived per-node CPU/GPU topology and introduces a new in-memory EmptyDir volume intended for shared memory.

Changes:

  • Parameterize templates/entrypoint.sh so flux R encode can receive a computed resource spec (cores/GPU ranges).
  • Refactor Flux entrypoint generation to build explicit Flux flags (-N/-n and optional -g) and derive an Rspec string for resource encoding.
  • Add a new shared-memory volume constant and include a memory-backed EmptyDir volume in Flux view volumes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
pkg/runtime/framework/plugins/flux/templates/entrypoint.sh Accepts a formatted resource spec argument for flux R encode.
pkg/runtime/framework/plugins/flux/flux.go Builds Rspec/flags for Flux execution and adds a memory-backed EmptyDir volume to the pod spec.
pkg/constants/constants.go Introduces FluxMemoryVolumeName constant for the new shared-memory volume.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +298 to 302
memoryVolumeAC := corev1ac.Volume().
WithName(constants.FluxMemoryVolumeName).
WithEmptyDir(corev1ac.EmptyDirVolumeSource().
WithMedium(corev1.StorageMediumMemory))
fluxVolumeAC := corev1ac.Volume().
} else {
tasks = fmt.Sprintf("-N %d -n %d", nodes, *info.RuntimePolicy.MLPolicySource.Flux.NumProcPerNode*nodes)
tasks = *info.RuntimePolicy.MLPolicySource.Flux.NumProcPerNode
}
// Path for Flux curve path
FluxCurveVolumePath = "/curve"

// Ensure MPI has full memory of the host
# Generate host resources
hosts=$(cat ${configroot}/etc/flux/system/hostlist)
flux R encode --hosts=${hosts} --local > /tmp/R
flux R encode --hosts=${hosts} %s > /tmp/R
Also update volume to be added to container. This still
is pending testing on AWS!

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Contributor Author

vsoch commented Mar 13, 2026

@andreyvelich I'm going for a quick run and will test with GPU when I am back! Would you like an example added to the examples/flux directory that uses GPUs? My plan is to test on AWS, lammps with GPU.

@andreyvelich
Copy link
Member

@andreyvelich I'm going for a quick run and will test with GPU when I am back! Would you like an example added to the examples/flux directory that uses GPUs? My plan is to test on AWS, lammps with GPU.

Sure, let's add another example in Flux subdirectory.

@vsoch
Copy link
Contributor Author

vsoch commented Mar 13, 2026

GPU working is a go!

image

I have some changes I will push shortly, and looks like I need to look into those failing tests.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch force-pushed the add-r-generation branch from 378d38a to 764a765 Compare March 13, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feat] optimizations for flux

3 participants