Skip to content

nvidia-k8s-device-plugin: add ldcache parsing for aarch64 patch #501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sky1122
Copy link
Contributor

@sky1122 sky1122 commented May 12, 2025

Description of changes:
add patch to fix ldcache parsing for aarch64

k8s-device-plugin carries its own nvidia-container-toolkit for now and uses nvidia-ctk to generate the CDI specifications.

The architecture flag for aarch64 is currently missing from the supported architecture flags list. This omission causes the getEntries function to exclude all libraries found on aarch64 hosts. As a result helper programs like nvidia-ctk are unable to generate CDI specifications for the aarch64 architecture.

Testing done:

  • testing with part of this PR

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@sky1122
Copy link
Contributor Author

sky1122 commented May 13, 2025

successful build on my build machine, but fail on here. Looking why the patch applied errored.

@sky1122
Copy link
Contributor Author

sky1122 commented May 13, 2025

one comment added, that's why patch fail. Will re-generate the patch.

@sky1122 sky1122 force-pushed the add-two-patches-k8s-device-plugin branch from 7675c81 to 4d5836b Compare May 14, 2025 00:20
@sky1122
Copy link
Contributor Author

sky1122 commented May 14, 2025

forget to commit the changes, will forced pushed again

@sky1122 sky1122 force-pushed the add-two-patches-k8s-device-plugin branch from 4d5836b to d6500a9 Compare May 14, 2025 00:28
@sky1122
Copy link
Contributor Author

sky1122 commented May 14, 2025

when doing the git changes lost one spec file changes, will force pushed again

@sky1122 sky1122 force-pushed the add-two-patches-k8s-device-plugin branch 2 times, most recently from 8a18963 to a1e3f78 Compare May 14, 2025 00:53
k8s-device-plugin carries its own nvidia-container-toolkit and uses
nvidia-ctk to generate the CDI specifications.

The architecture flag for aarch64 is currently missing from the
supported architecture flags list. This omission causes the getEntries
function to exclude all libraries found on aarch64 hosts. As a result
helper programs like nvidia-ctk are unable to generate CDI
specifications for the aarch64 architecture.

This fix adds the missing aarch64 architecture flag, using the same
value as defined in libnvidia-container[1], which maintains a more
comprehensive list of supported architectures.

[1]: https://github.com/NVIDIA/libnvidia-container/blob/a198166e1c1166f4847598438115ea97dacc7a92/src/ldcache.h#L21

Signed-off-by: Jingwei Wang <[email protected]>
@sky1122 sky1122 force-pushed the add-two-patches-k8s-device-plugin branch from a1e3f78 to cca2da5 Compare May 14, 2025 17:53
@sky1122 sky1122 changed the title nvidia-k8s-device-plugin: add two patches nvidia-k8s-device-plugin: add ldcache parsing for aarch64 patch May 14, 2025
@sky1122
Copy link
Contributor Author

sky1122 commented May 14, 2025

force pushed to adopt the new changes for patch

@arnaldo2792
Copy link
Contributor

The previous patch to generate the devices was removed. Instead of generating the specs for the CDI device through the device plugin, we will use the generate-cdi-specs.service systemd unit to provide them. This aligns with what the GPU operator does to provide the same CDI specifications.

@arnaldo2792 arnaldo2792 marked this pull request as ready for review May 14, 2025 22:48
@@ -0,0 +1,49 @@
From be4ba83b821eea9050eefdb7e67df2d757c3795a Mon Sep 17 00:00:00 2001
Copy link
Contributor

@ytsssun ytsssun May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this one! Q - Do we have plan to upstream this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are already working on fixes 🎉 !

NVIDIA/nvidia-container-toolkit#1046

@sky1122 sky1122 merged commit 1f90504 into bottlerocket-os:develop May 15, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants