Skip to content

fix: use configurable GPU resource name instead of hardcoded nvidia.com/gpu#7237

Open
nuthalapativarun wants to merge 18 commits intoflyteorg:masterfrom
nuthalapativarun:fix/gpu-resource-name-hardcoding-6746
Open

fix: use configurable GPU resource name instead of hardcoded nvidia.com/gpu#7237
nuthalapativarun wants to merge 18 commits intoflyteorg:masterfrom
nuthalapativarun:fix/gpu-resource-name-hardcoding-6746

Conversation

@nuthalapativarun
Copy link
Copy Markdown

Tracking issue

Closes #6746

Why are the changes needed?

Two code paths hardcoded "nvidia.com/gpu" when building Kubernetes ResourceList entries from Flyte GPU resource requirements. This hardcoding took precedence over the GpuResourceName field already configurable in K8sPluginConfig (via the gpu-resource-name Helm/propeller config value), making it impossible to use alternative GPU resource types such as amd.com/gpu or intel.com/gpu.

The affected paths were:

  • flyteplugins/go/tasks/pluginmachinery/flytek8s/utils.goToK8sResourceList used the package-level ResourceNvidiaGPU constant.
  • flytepropeller/pkg/controller/nodes/task/taskexec_context.goconvertTaskResourcesToRequirements used utils.ResourceNvidiaGPU.

Note that SanitizeGPUResourceRequirements and ApplyResourceOverrides in container_helper.go already correctly read from config.GetK8sPluginConfig().GpuResourceName. This PR brings the two remaining call sites into alignment.

What changes were proposed in this pull request?

  • flyteplugins/.../flytek8s/utils.go: Import flytek8s/config and replace ResourceNvidiaGPU with config.GetK8sPluginConfig().GpuResourceName in ToK8sResourceList.
  • flytepropeller/.../taskexec_context.go: Import flytek8s/config and replace utils.ResourceNvidiaGPU with flytek8sConfig.GetK8sPluginConfig().GpuResourceName in convertTaskResourcesToRequirements. Remove now-unused flytepropeller/pkg/utils import.
  • Tests: Update both test files to assert against config.GetK8sPluginConfig().GpuResourceName rather than the hardcoded constant, confirming the behavior is config-driven.

How was this patch tested?

  • go test ./go/tasks/pluginmachinery/flytek8s/... in flyteplugins/ — passes.
  • go test ./pkg/controller/nodes/task/... in flytepropeller/ — passes.

Labels

  • fixed

  • I updated the documentation accordingly.

  • All new and existing tests passed.

  • All commits are signed-off.

@Sovietaced
Copy link
Copy Markdown
Member

I've fixed this in our company's fork and I believe it was more involved than this due to handling of pod templates and needing to normalize the dummy GPU resource name to the configured one. If you can wait a week or two I can probably pull that work into master.

@Sovietaced
Copy link
Copy Markdown
Member

I've fixed this in our company's fork and I believe it was more involved than this due to handling of pod templates and needing to normalize the dummy GPU resource name to the configured one. If you can wait a week or two I can probably pull that work into master.

Oh this looks like some drive by AI assisted pull request so I'll wait to review this until I get back from vacation.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.95%. Comparing base (e938b03) to head (1ee6ed9).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7237      +/-   ##
==========================================
- Coverage   56.96%   56.95%   -0.01%     
==========================================
  Files         931      931              
  Lines       58246    58247       +1     
==========================================
- Hits        33178    33173       -5     
- Misses      22014    22020       +6     
  Partials     3054     3054              
Flag Coverage Δ
unittests-datacatalog 53.51% <ø> (ø)
unittests-flyteadmin 53.10% <ø> (-0.04%) ⬇️
unittests-flytecopilot 43.06% <ø> (ø)
unittests-flytectl 64.14% <ø> (ø)
unittests-flyteidl 75.71% <ø> (ø)
unittests-flyteplugins 60.17% <100.00%> (ø)
unittests-flytepropeller 53.71% <100.00%> (+<0.01%) ⬆️
unittests-flytestdlib 62.61% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nuthalapativarun nuthalapativarun force-pushed the fix/gpu-resource-name-hardcoding-6746 branch from 18bd4af to 3522587 Compare April 19, 2026 08:53
@kumare3
Copy link
Copy Markdown
Contributor

kumare3 commented Apr 19, 2026

This has been greatly improved in v2, check the v2 branch - close
To release

@nuthalapativarun
Copy link
Copy Markdown
Author

Thanks for the context, @Sovietaced — happy to expand the scope to cover pod template normalization and dummy GPU resource name handling as well. If you can point me to the relevant code paths (or describe what your fork's fix looks like at a high level), I can incorporate that before your review. No rush given the vacation timeline.

@kumare3 — do you mean this should target the v2 branch instead of master, or that the issue is already resolved there and this PR should be closed? Happy to follow whichever path makes more sense for the project.

@nuthalapativarun nuthalapativarun force-pushed the fix/gpu-resource-name-hardcoding-6746 branch from c59e019 to 315c993 Compare April 23, 2026 15:12
@github-actions github-actions Bot added the flyte label May 1, 2026
nuthalapativarun and others added 18 commits May 1, 2026 12:19
…om/gpu

Two code paths hardcoded "nvidia.com/gpu" when mapping Flyte GPU resource
requirements to Kubernetes ResourceList entries, ignoring the
GpuResourceName field available in K8sPluginConfig. This prevented users
from using alternative GPU resource types (e.g. amd.com/gpu,
intel.com/gpu) even when correctly configured via the gpu-resource-name
Helm value.

- flyteplugins: ToK8sResourceList now reads GpuResourceName from
  config.GetK8sPluginConfig() instead of using the ResourceNvidiaGPU
  constant directly.
- flytepropeller: convertTaskResourcesToRequirements now reads
  GpuResourceName from flytek8sConfig.GetK8sPluginConfig() instead of
  utils.ResourceNvidiaGPU.
- Update tests in both packages to assert against the config-driven key.

Closes flyteorg#6746

Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
- Fix gci import ordering in taskexec_context_test.go (flytek8sConfig must
  come before io/mocks alphabetically)
- Set GpuResourceName explicitly in TestBuildResourceRayCustomK8SPod to
  match resourceRequirements fixture which uses nvidia.com/gpu; empty
  K8sPluginConfig caused key mismatch after ToK8sResourceList was updated
  to use configurable GPU resource name

Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
…lyteorg#7193)

Signed-off-by: Fabio Grätz <fabio@cusp.ai>
Co-authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
…g#7256)

Bumps the go_modules group with 1 update in the / directory: [github.com/jackc/pgx/v5](https://github.com/jackc/pgx).
Bumps the go_modules group with 1 update in the /datacatalog directory: [github.com/jackc/pgx/v5](https://github.com/jackc/pgx).
Bumps the go_modules group with 1 update in the /flyteadmin directory: [github.com/jackc/pgx/v5](https://github.com/jackc/pgx).
Bumps the go_modules group with 1 update in the /flytestdlib directory: [github.com/jackc/pgx/v5](https://github.com/jackc/pgx).

Updates `github.com/jackc/pgx/v5` from 5.9.0 to 5.9.2
- [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md)
- [Commits](jackc/pgx@v5.9.0...v5.9.2)

Updates `github.com/jackc/pgx/v5` from 5.9.0 to 5.9.2
- [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md)
- [Commits](jackc/pgx@v5.9.0...v5.9.2)

Updates `github.com/jackc/pgx/v5` from 5.9.0 to 5.9.2
- [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md)
- [Commits](jackc/pgx@v5.9.0...v5.9.2)

Updates `github.com/jackc/pgx/v5` from 5.9.0 to 5.9.2
- [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md)
- [Commits](jackc/pgx@v5.9.0...v5.9.2)

---
updated-dependencies:
- dependency-name: github.com/jackc/pgx/v5
  dependency-version: 5.9.2
  dependency-type: indirect
  dependency-group: go_modules
- dependency-name: github.com/jackc/pgx/v5
  dependency-version: 5.9.2
  dependency-type: indirect
  dependency-group: go_modules
- dependency-name: github.com/jackc/pgx/v5
  dependency-version: 5.9.2
  dependency-type: direct:production
  dependency-group: go_modules
- dependency-name: github.com/jackc/pgx/v5
  dependency-version: 5.9.2
  dependency-type: direct:production
  dependency-group: go_modules
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Added comments regarding database secret usage and password path.

Signed-off-by: Sam <78538841+spwoodcock@users.noreply.github.com>
Co-authored-by: Jason Parraga <Sovietaced@gmail.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Spyros Trigazis <spyros.trigazis@verda.com>
Co-authored-by: Jason Parraga <Sovietaced@gmail.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.6.0 to 2.6.3.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](urllib3/urllib3@2.6.0...2.6.3)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-version: 2.6.3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Flyte-Bot <admin@flyte.org>
Signed-off-by: Jason Parraga <sovietaced@gmail.com>
Co-authored-by: Sovietaced <Sovietaced@users.noreply.github.com>
Co-authored-by: Jason Parraga <sovietaced@gmail.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Jason Parraga <sovietaced@gmail.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: madiyar-wayve <madiyar.aitzhanov@wayve.ai>
Co-authored-by: Jason Parraga <Sovietaced@gmail.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Configures the Probot DCO app to permit individual and third-party
remediation commits, enabling sign-off backfill for branches with
historical commits that lack matching Signed-off-by lines.

Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Muskan Kumari <er.muskan09@gmail.com>
Co-authored-by: Kevin Su <pingsutw@gmail.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
…7311)

During a recent issue w/ woven, we noticed the informer cache was off causing for pods to not have their finalizers cleared. There have been rare instances across customer clusters in which pods aren't able to terminate due to not having their finalizers cleared.

This change clears finalizers using Patch (merge patch) instead of Update as to reduce instances of stale state in the informer cache causing conflicts when updating the pod. Also adding a metric to track failures when clearing finalizers.

ran in sandbox + dogfood

managed-cluster-all

Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F).
- [x] To be upstreamed to OSS

ref: https://linear.app/unionai/issue/BB-6030/finalizers-preventing-pods-from-terminating

* [ ] Added tests
* [ ] Ran a deploy dry run and shared the terraform plan
* [ ] Added logging and metrics
* [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list)
* [ ] Updated documentation

(cherry picked from commit 515ffb6)

Signed-off-by: Paul Dittamo <pvdittamo@gmail.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
* added: update to contributing file
Signed-off-by:amaechi hope amaechihope20@gmail.com

Signed-off-by: amaechi hope <amaechihope20@gmail.com>

* update typo
Signed-off-by: amaechi hope amaechihope20@gmail.com

Signed-off-by: amaechi hope <amaechihope20@gmail.com>

* uodate text content#
Signed-off-by:  amaechi hope amaechihope20@gmail.com

Signed-off-by: amaechi hope <amaechihope20@gmail.com>

* hotfix: update email address

Signed-off-by: amaechi hope <amaechihope20@gmail.com>

---------

Signed-off-by: amaechi hope <amaechihope20@gmail.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Kevin Liao <q85292542000@gmail.com>
Co-authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
@nuthalapativarun nuthalapativarun force-pushed the fix/gpu-resource-name-hardcoding-6746 branch from 091f5cd to 7885723 Compare May 1, 2026 19:20
@nuthalapativarun nuthalapativarun requested a review from ppiegaze as a code owner May 1, 2026 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Unable to use GPUs besides "nvidia.com/gpu" due to hardcoding