fix: use configurable GPU resource name instead of hardcoded nvidia.com/gpu#7237
fix: use configurable GPU resource name instead of hardcoded nvidia.com/gpu#7237nuthalapativarun wants to merge 18 commits intoflyteorg:masterfrom
Conversation
|
I've fixed this in our company's fork and I believe it was more involved than this due to handling of pod templates and needing to normalize the dummy GPU resource name to the configured one. If you can wait a week or two I can probably pull that work into master. |
Oh this looks like some drive by AI assisted pull request so I'll wait to review this until I get back from vacation. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7237 +/- ##
==========================================
- Coverage 56.96% 56.95% -0.01%
==========================================
Files 931 931
Lines 58246 58247 +1
==========================================
- Hits 33178 33173 -5
- Misses 22014 22020 +6
Partials 3054 3054
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
18bd4af to
3522587
Compare
|
This has been greatly improved in v2, check the v2 branch - close |
|
Thanks for the context, @Sovietaced — happy to expand the scope to cover pod template normalization and dummy GPU resource name handling as well. If you can point me to the relevant code paths (or describe what your fork's fix looks like at a high level), I can incorporate that before your review. No rush given the vacation timeline. @kumare3 — do you mean this should target the v2 branch instead of master, or that the issue is already resolved there and this PR should be closed? Happy to follow whichever path makes more sense for the project. |
c59e019 to
315c993
Compare
…om/gpu Two code paths hardcoded "nvidia.com/gpu" when mapping Flyte GPU resource requirements to Kubernetes ResourceList entries, ignoring the GpuResourceName field available in K8sPluginConfig. This prevented users from using alternative GPU resource types (e.g. amd.com/gpu, intel.com/gpu) even when correctly configured via the gpu-resource-name Helm value. - flyteplugins: ToK8sResourceList now reads GpuResourceName from config.GetK8sPluginConfig() instead of using the ResourceNvidiaGPU constant directly. - flytepropeller: convertTaskResourcesToRequirements now reads GpuResourceName from flytek8sConfig.GetK8sPluginConfig() instead of utils.ResourceNvidiaGPU. - Update tests in both packages to assert against the config-driven key. Closes flyteorg#6746 Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
- Fix gci import ordering in taskexec_context_test.go (flytek8sConfig must come before io/mocks alphabetically) - Set GpuResourceName explicitly in TestBuildResourceRayCustomK8SPod to match resourceRequirements fixture which uses nvidia.com/gpu; empty K8sPluginConfig caused key mismatch after ToK8sResourceList was updated to use configurable GPU resource name Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
…lyteorg#7193) Signed-off-by: Fabio Grätz <fabio@cusp.ai> Co-authored-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
…g#7256) Bumps the go_modules group with 1 update in the / directory: [github.com/jackc/pgx/v5](https://github.com/jackc/pgx). Bumps the go_modules group with 1 update in the /datacatalog directory: [github.com/jackc/pgx/v5](https://github.com/jackc/pgx). Bumps the go_modules group with 1 update in the /flyteadmin directory: [github.com/jackc/pgx/v5](https://github.com/jackc/pgx). Bumps the go_modules group with 1 update in the /flytestdlib directory: [github.com/jackc/pgx/v5](https://github.com/jackc/pgx). Updates `github.com/jackc/pgx/v5` from 5.9.0 to 5.9.2 - [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md) - [Commits](jackc/pgx@v5.9.0...v5.9.2) Updates `github.com/jackc/pgx/v5` from 5.9.0 to 5.9.2 - [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md) - [Commits](jackc/pgx@v5.9.0...v5.9.2) Updates `github.com/jackc/pgx/v5` from 5.9.0 to 5.9.2 - [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md) - [Commits](jackc/pgx@v5.9.0...v5.9.2) Updates `github.com/jackc/pgx/v5` from 5.9.0 to 5.9.2 - [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md) - [Commits](jackc/pgx@v5.9.0...v5.9.2) --- updated-dependencies: - dependency-name: github.com/jackc/pgx/v5 dependency-version: 5.9.2 dependency-type: indirect dependency-group: go_modules - dependency-name: github.com/jackc/pgx/v5 dependency-version: 5.9.2 dependency-type: indirect dependency-group: go_modules - dependency-name: github.com/jackc/pgx/v5 dependency-version: 5.9.2 dependency-type: direct:production dependency-group: go_modules - dependency-name: github.com/jackc/pgx/v5 dependency-version: 5.9.2 dependency-type: direct:production dependency-group: go_modules ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Added comments regarding database secret usage and password path. Signed-off-by: Sam <78538841+spwoodcock@users.noreply.github.com> Co-authored-by: Jason Parraga <Sovietaced@gmail.com> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Spyros Trigazis <spyros.trigazis@verda.com> Co-authored-by: Jason Parraga <Sovietaced@gmail.com> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.6.0 to 2.6.3. - [Release notes](https://github.com/urllib3/urllib3/releases) - [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst) - [Commits](urllib3/urllib3@2.6.0...2.6.3) --- updated-dependencies: - dependency-name: urllib3 dependency-version: 2.6.3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Flyte-Bot <admin@flyte.org> Signed-off-by: Jason Parraga <sovietaced@gmail.com> Co-authored-by: Sovietaced <Sovietaced@users.noreply.github.com> Co-authored-by: Jason Parraga <sovietaced@gmail.com> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Jason Parraga <sovietaced@gmail.com> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: madiyar-wayve <madiyar.aitzhanov@wayve.ai> Co-authored-by: Jason Parraga <Sovietaced@gmail.com> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Configures the Probot DCO app to permit individual and third-party remediation commits, enabling sign-off backfill for branches with historical commits that lack matching Signed-off-by lines. Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Muskan Kumari <er.muskan09@gmail.com> Co-authored-by: Kevin Su <pingsutw@gmail.com> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
…7311) During a recent issue w/ woven, we noticed the informer cache was off causing for pods to not have their finalizers cleared. There have been rare instances across customer clusters in which pods aren't able to terminate due to not having their finalizers cleared. This change clears finalizers using Patch (merge patch) instead of Update as to reduce instances of stale state in the informer cache causing conflicts when updating the pod. Also adding a metric to track failures when clearing finalizers. ran in sandbox + dogfood managed-cluster-all Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [x] To be upstreamed to OSS ref: https://linear.app/unionai/issue/BB-6030/finalizers-preventing-pods-from-terminating * [ ] Added tests * [ ] Ran a deploy dry run and shared the terraform plan * [ ] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation (cherry picked from commit 515ffb6) Signed-off-by: Paul Dittamo <pvdittamo@gmail.com> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
* added: update to contributing file Signed-off-by:amaechi hope amaechihope20@gmail.com Signed-off-by: amaechi hope <amaechihope20@gmail.com> * update typo Signed-off-by: amaechi hope amaechihope20@gmail.com Signed-off-by: amaechi hope <amaechihope20@gmail.com> * uodate text content# Signed-off-by: amaechi hope amaechihope20@gmail.com Signed-off-by: amaechi hope <amaechihope20@gmail.com> * hotfix: update email address Signed-off-by: amaechi hope <amaechihope20@gmail.com> --------- Signed-off-by: amaechi hope <amaechihope20@gmail.com> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Kevin Liao <q85292542000@gmail.com> Co-authored-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
091f5cd to
7885723
Compare
Tracking issue
Closes #6746
Why are the changes needed?
Two code paths hardcoded
"nvidia.com/gpu"when building KubernetesResourceListentries from Flyte GPU resource requirements. This hardcoding took precedence over theGpuResourceNamefield already configurable inK8sPluginConfig(via thegpu-resource-nameHelm/propeller config value), making it impossible to use alternative GPU resource types such asamd.com/gpuorintel.com/gpu.The affected paths were:
flyteplugins/go/tasks/pluginmachinery/flytek8s/utils.go—ToK8sResourceListused the package-levelResourceNvidiaGPUconstant.flytepropeller/pkg/controller/nodes/task/taskexec_context.go—convertTaskResourcesToRequirementsusedutils.ResourceNvidiaGPU.Note that
SanitizeGPUResourceRequirementsandApplyResourceOverridesincontainer_helper.goalready correctly read fromconfig.GetK8sPluginConfig().GpuResourceName. This PR brings the two remaining call sites into alignment.What changes were proposed in this pull request?
flyteplugins/.../flytek8s/utils.go: Importflytek8s/configand replaceResourceNvidiaGPUwithconfig.GetK8sPluginConfig().GpuResourceNameinToK8sResourceList.flytepropeller/.../taskexec_context.go: Importflytek8s/configand replaceutils.ResourceNvidiaGPUwithflytek8sConfig.GetK8sPluginConfig().GpuResourceNameinconvertTaskResourcesToRequirements. Remove now-unusedflytepropeller/pkg/utilsimport.config.GetK8sPluginConfig().GpuResourceNamerather than the hardcoded constant, confirming the behavior is config-driven.How was this patch tested?
go test ./go/tasks/pluginmachinery/flytek8s/...inflyteplugins/— passes.go test ./pkg/controller/nodes/task/...inflytepropeller/— passes.Labels
fixed
I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.