feat: move external cronjob to run timely cleanup to goroutine+configable for internval of cleanup#934
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR removes the maas-api-key-cleanup CronJob and associated NetworkPolicy from kustomize resources and overlays, deletes the related image param and replacement entries, updates deployment and test params, and adjusts controller image-key mappings. The API server now runs an internal 15-minute ticker goroutine calling store.DeleteExpiredEphemeral(ctx) and defers store.Close() until after shutdown/waitgroup completion. TODO comments were added around the /internal/v1/api-keys/cleanup handler and the ephemeral cleanup service. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
d0e9818 to
0a1a951
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/hack/cleanup-odh.sh:
- Line 118: cleanup_maas_resources() currently only deletes the newer
NetworkPolicy name (maas-authorino-allow) and omits the legacy
maas-api-cleanup-restrict; update the teardown to also remove the older policy
by adding a deletion command for the NetworkPolicy named
"maas-api-cleanup-restrict" (similar to the existing kubectl delete
networkpolicy maas-authorino-allow -n "$ns" --ignore-not-found 2>/dev/null ||
true) so older installs cannot leave stale objects behind.
In `@maas-api/cmd/main.go`:
- Around line 124-140: The periodic cleanup goroutine can call
store.DeleteExpiredEphemeral while the store is being closed, causing a shutdown
race; modify the lifecycle so the worker stops before store teardown by
introducing a cancelable child context or a dedicated shutdown channel and
waiting for the goroutine to exit (e.g., with a sync.WaitGroup or <-done) prior
to calling store.Close/teardown; ensure the goroutine listens for that
cancellation (in addition to ctx.Done()) and returns, and then use that join
point to guarantee no further calls to DeleteExpiredEphemeral occur after store
shutdown.
- Line 244: The POST route registration internalRoutes.POST("/api-keys/cleanup",
apiKeyHandler.CleanupExpiredEphemeralKeys) exposes an unauthenticated
maintenance endpoint; remove this route registration from main.go (delete the
call that binds CleanupExpiredEphemeralKeys) if cleanup is handled in-process,
or alternatively wrap it with proper authentication/authorization middleware
that verifies callers before invoking apiKeyHandler.CleanupExpiredEphemeralKeys
(e.g., attach existing auth middleware or a check function on internalRoutes) to
eliminate the unauthenticated access vector.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 4860570f-df2a-43be-be82-e9b1626259d7
📒 Files selected for processing (16)
.github/hack/cleanup-odh.shdeployment/base/maas-api/core/cronjob-cleanup.yamldeployment/base/maas-api/core/kustomization.yamldeployment/base/maas-api/core/networkpolicy-cleanup.yamldeployment/base/maas-api/overlays/tls/cronjob-cleanup-patch.yamldeployment/base/maas-api/overlays/tls/kustomization.yamldeployment/base/maas-api/overlays/tls/networkpolicy-cleanup-patch.yamldeployment/base/maas-controller/manager/manager.yamldeployment/components/shared-patches/kustomization.yamldeployment/overlays/odh/params.envmaas-api/cmd/main.gomaas-api/deploy/overlays/odh/params.envmaas-api/internal/api_keys/handler.gomaas-api/internal/api_keys/service.gomaas-controller/pkg/platform/tenantreconcile/constants.gotest/e2e/scripts/local-deploy.sh
💤 Files with no reviewable changes (12)
- deployment/base/maas-api/overlays/tls/cronjob-cleanup-patch.yaml
- deployment/base/maas-api/core/cronjob-cleanup.yaml
- deployment/base/maas-api/core/networkpolicy-cleanup.yaml
- deployment/base/maas-controller/manager/manager.yaml
- maas-api/deploy/overlays/odh/params.env
- deployment/components/shared-patches/kustomization.yaml
- deployment/base/maas-api/overlays/tls/networkpolicy-cleanup-patch.yaml
- maas-controller/pkg/platform/tenantreconcile/constants.go
- deployment/overlays/odh/params.env
- test/e2e/scripts/local-deploy.sh
- deployment/base/maas-api/core/kustomization.yaml
- deployment/base/maas-api/overlays/tls/kustomization.yaml
| internalRoutes := router.Group("/internal/v1") | ||
| internalRoutes.POST("/api-keys/validate", apiKeyHandler.ValidateAPIKeyHandler) | ||
| internalRoutes.POST("/api-keys/cleanup", apiKeyHandler.CleanupExpiredEphemeralKeys) | ||
| internalRoutes.POST("/api-keys/cleanup", apiKeyHandler.CleanupExpiredEphemeralKeys) // TODO: consider remove endpoint if not public access |
There was a problem hiding this comment.
Unauthenticated cleanup route is a security gap (CWE-306, CWE-862).
Severity: Major.
Exploit scenario: any actor with network reachability to this service can repeatedly invoke /internal/v1/api-keys/cleanup, forcing untrusted maintenance operations and avoidable DB load.
If cleanup is now in-process, remove this route; otherwise gate it with explicit authn/authz.
Suggested patch (remove route)
- internalRoutes.POST("/api-keys/cleanup", apiKeyHandler.CleanupExpiredEphemeralKeys) // TODO: consider remove endpoint if not public accessAs per coding guidelines **: REVIEW PRIORITIES: Security vulnerabilities.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| internalRoutes.POST("/api-keys/cleanup", apiKeyHandler.CleanupExpiredEphemeralKeys) // TODO: consider remove endpoint if not public access |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@maas-api/cmd/main.go` at line 244, The POST route registration
internalRoutes.POST("/api-keys/cleanup",
apiKeyHandler.CleanupExpiredEphemeralKeys) exposes an unauthenticated
maintenance endpoint; remove this route registration from main.go (delete the
call that binds CleanupExpiredEphemeralKeys) if cleanup is handled in-process,
or alternatively wrap it with proper authentication/authorization middleware
that verifies callers before invoking apiKeyHandler.CleanupExpiredEphemeralKeys
(e.g., attach existing auth middleware or a check function on internalRoutes) to
eliminate the unauthenticated access vector.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
maas-api/cmd/main.go (2)
244-244:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUnauthenticated cleanup endpoint enables unauthorized maintenance actions (CWE-306, CWE-862).
Severity: Major.
Exploit scenario: Line 244 exposes/internal/v1/api-keys/cleanupunder unauthenticated internal routes (Line 241), so any reachable in-cluster actor can trigger cleanup repeatedly, forcing DB work and premature ephemeral-key invalidation.Suggested fix (remove route if goroutine is the source of truth)
- internalRoutes.POST("/api-keys/cleanup", apiKeyHandler.CleanupExpiredEphemeralKeys) // TODO: consider remove endpoint if not public accessAs per coding guidelines
**: REVIEW PRIORITIES: Security vulnerabilities (provide severity, exploit scenario, and remediation code).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@maas-api/cmd/main.go` at line 244, The POST route registration internalRoutes.POST("/api-keys/cleanup", apiKeyHandler.CleanupExpiredEphemeralKeys) exposes an unauthenticated maintenance action; remove or protect it: either delete this route and call apiKeyHandler.CleanupExpiredEphemeralKeys from a trusted background goroutine/ticker that runs inside the service, or wrap the route with an authentication/authorization middleware that ensures only an internal, authenticated caller can invoke it (e.g., validate mutual TLS or internal service token) and log/rate-limit invocations; update tests and docs to reflect the chosen approach.
124-140:⚠️ Potential issue | 🟠 Major | ⚡ Quick winShutdown race: cleanup worker can call store during teardown.
Line 124 starts a long-lived worker, but
store.Close()is deferred separately (Line 118) and the worker is not joined before teardown. This can triggerDeleteExpiredEphemeralwhile the store is closing, causing nondeterministic shutdown failures.Suggested fix
- defer func() { - if err := store.Close(); err != nil { - log.Error("Failed to close token store", "error", err) - } - }() + cleanupCtx, stopCleanup := context.WithCancel(ctx) + cleanupDone := make(chan struct{}) go func() { + defer close(cleanupDone) ticker := time.NewTicker(15 * time.Minute) defer ticker.Stop() for { select { case <-ticker.C: - count, err := store.DeleteExpiredEphemeral(ctx) + count, err := store.DeleteExpiredEphemeral(cleanupCtx) if err != nil { log.Error("Failed to cleanup expired ephemeral keys", "error", err) } else if count > 0 { log.Info("Ephemeral key cleanup completed", "deletedCount", count) } - case <-ctx.Done(): // just to not wait gracefully + case <-cleanupCtx.Done(): return } } }() + defer func() { + stopCleanup() + <-cleanupDone + if err := store.Close(); err != nil { + log.Error("Failed to close token store", "error", err) + } + }()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@maas-api/cmd/main.go` around lines 124 - 140, The background cleanup goroutine may call store.DeleteExpiredEphemeral while store.Close() is running, causing a race; stop the worker and wait for it to exit before closing the store by wiring its lifecycle to the same cancellable context/stop mechanism and joining it (e.g., use a sync.WaitGroup, an errgroup, or a dedicated stop channel) so that the goroutine listening on ctx.Done() returns deterministically; ensure the goroutine checks the provided context (the same one used for shutdown) and the main routine waits for that goroutine to finish before calling store.Close() (references: the anonymous goroutine that creates ticker, DeleteExpiredEphemeral, ctx.Done(), and store.Close()).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@maas-api/cmd/main.go`:
- Line 244: The POST route registration internalRoutes.POST("/api-keys/cleanup",
apiKeyHandler.CleanupExpiredEphemeralKeys) exposes an unauthenticated
maintenance action; remove or protect it: either delete this route and call
apiKeyHandler.CleanupExpiredEphemeralKeys from a trusted background
goroutine/ticker that runs inside the service, or wrap the route with an
authentication/authorization middleware that ensures only an internal,
authenticated caller can invoke it (e.g., validate mutual TLS or internal
service token) and log/rate-limit invocations; update tests and docs to reflect
the chosen approach.
- Around line 124-140: The background cleanup goroutine may call
store.DeleteExpiredEphemeral while store.Close() is running, causing a race;
stop the worker and wait for it to exit before closing the store by wiring its
lifecycle to the same cancellable context/stop mechanism and joining it (e.g.,
use a sync.WaitGroup, an errgroup, or a dedicated stop channel) so that the
goroutine listening on ctx.Done() returns deterministically; ensure the
goroutine checks the provided context (the same one used for shutdown) and the
main routine waits for that goroutine to finish before calling store.Close()
(references: the anonymous goroutine that creates ticker,
DeleteExpiredEphemeral, ctx.Done(), and store.Close()).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: fced3fba-f33b-4a6a-9725-04eadd7e603b
📒 Files selected for processing (2)
maas-api/cmd/main.gomaas-controller/pkg/platform/tenantreconcile/constants.go
196a598 to
bd0e922
Compare
|
Hi @zdtsw , there is a lint failure that may need your attention. Thanks! |
thanks! fixed lint |
- save resource, no spin up job every 15mins - less image to pull down for disconnect - should cleanup endpoint later if needed - less security concern as one less networkpolicy and patch wors Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
- for race condition case, use sync.WaitGroup to let goroutine finish Signed-off-by: Wen Zhou <wenzhou@redhat.com>
This reverts commit 08e67f1.
- either pass as flag --cleanup-internval to main or env variable CLEANUP_INTERNVAL_MINUTES in deployment for customized value - if not set any or set to '0', use default 15mins - if set to a negative value, skip cleanup goroutine totally Signed-off-by: Wen Zhou <wenzhou@redhat.com>
08e67f1 to
62c0be9
Compare
|
@zdtsw: The following test has Succeeded: OCI Artifact Browser URLInspecting Test Artifacts ManuallyTo inspect your test artifacts manually, follow these steps:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:maas-group-test-qc86q |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jland-redhat, zdtsw The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…onJob Remove stale CronJob patching, deploy params, and RBAC from opendatahub-io#934 fallout. Add PruneLegacyCleanupResources on tenant reconcile, run maas-controller CI on every PR, and update docs/E2E for CLEANUP_INTERVAL_MINUTES.
…able for internval of cleanup (opendatahub-io#934) <!--- Provide a general summary of your changes in the Title above --> ## Description <!--- Describe your changes in detail --> - save resource, no spin up job every 15mins - less image to pull down for disconnect - should cleanup endpoint later if needed - less security concern as one less networkpolicy and patch - make internval configable: - either pass as flag `--cleanup-internval `to main or env variable `CLEANUP_INTERNVAL_MINUTES` in deployment for customized value - if not set any or set to '0', use default 15mins - if set to a negative value, skip cleanup goroutine totally ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] The commits are squashed in a cohesive manner and have meaningful messages. - [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [ ] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * API key cleanup moved from external scheduled jobs into a built-in background task within the service. * Removed external cleanup scheduling and related network restrictions; deployment wiring simplified and related image/parameter removed. * Added internal TODOs and small runtime shutdown sequencing improvements. * **Chores** * Cleaned up deployment overlays, patches, and params to remove now-unused cleanup artifacts. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Wen Zhou <wenzhou@redhat.com>
## Promotion: main → stable Automated promotion of **10 commit(s)** from `main` to `stable`. | Detail | Value | | --- | --- | | Promotion | `main` → `stable` | | Commits to merge | **10** | | Conflict check | ✅ Passed | ### Commits included ```` 81628d8 feat: add pre-auth ext_proc stage for body-based model routing (#948) 5127624 fix: use tiny-gpt2 URI for premium simulator in CI (#961) 3bb5b77 fix: route subscription latency to odh monitoring stack (#906) 06ffb2d fix: revert "feat: move external cronjob to run timely cleanup to goroutin… (#953) 44613ee feat: move external cronjob to run timely cleanup to goroutine+configable for internval of cleanup (#934) a7f20f0 fix: update ODH version in scripts to get up-to-date TLS feature (#945) dfb6bb3 test: enable OIDC by default in E2E and expand test coverage (3→16 tests) (#679) 37e9be9 fix: remove namespace field from model refs in GET /subscriptions response (#915) ae511cc feat: alert on authorino maas-api metadata evaluator failures (#895) d1cfced docs: add payload processing namespace configuration reference (#822) ```` ---⚠️ **Merge this PR with a merge commit** (do not squash or rebase).
## Promotion: stable → rhoai Automated promotion of **11 commit(s)** from `stable` to `rhoai`. | Detail | Value | | --- | --- | | Promotion | `stable` → `rhoai` | | Commits to merge | **11** | | Conflict check | ✅ Passed | ### Commits included ```` 81628d8 feat: add pre-auth ext_proc stage for body-based model routing (#948) 5127624 fix: use tiny-gpt2 URI for premium simulator in CI (#961) 3bb5b77 fix: route subscription latency to odh monitoring stack (#906) 06ffb2d fix: revert "feat: move external cronjob to run timely cleanup to goroutin… (#953) 44613ee feat: move external cronjob to run timely cleanup to goroutine+configable for internval of cleanup (#934) a7f20f0 fix: update ODH version in scripts to get up-to-date TLS feature (#945) dfb6bb3 test: enable OIDC by default in E2E and expand test coverage (3→16 tests) (#679) 37e9be9 fix: remove namespace field from model refs in GET /subscriptions response (#915) ae511cc feat: alert on authorino maas-api metadata evaluator failures (#895) d1cfced docs: add payload processing namespace configuration reference (#822) ```` ---⚠️ **Merge this PR with a merge commit** (do not squash or rebase).
…onJob Remove stale CronJob patching, deploy params, and RBAC from opendatahub-io#934 fallout. Add PruneLegacyCleanupResources on tenant reconcile, run maas-controller CI on every PR, and update docs/E2E for CLEANUP_INTERVAL_MINUTES.
Description
--cleanup-internvalto main or env variableCLEANUP_INTERNVAL_MINUTESindeployment for customized value
How Has This Been Tested?
Merge criteria:
Summary by CodeRabbit
Refactor
Chores