chore: add SAR Admin cache by yu-teo · Pull Request #786 · opendatahub-io/models-as-a-service

yu-teo · 2026-04-21T18:31:06Z

https://redhat.atlassian.net/browse/RHOAIENG-54975

Description

This PR introduces new files:

maas-api/internal/auth/cached_admin_checker.go — CachedAdminChecker wrapping SARAdminChecker with TTL-based cache (30s), sync.RWMutex+map for thread safety, lazy eviction of expired entries, and Prometheus counters (sar_cache_hits_total, sar_cache_misses_total)
maas-api/internal/auth/cached_admin_checker_test.go — 15 tests covering cache hits/misses, TTL expiry, group-order invariance, fail-closed guards, metrics, concurrent access, eviction, and constructor panics

As well as, I modified files:

maas-api/internal/config/cluster_config.go — AdminChecker field changed to *auth.CachedAdminChecker, wraps SARAdminChecker with 30s TTL
maas-api/cmd/main.go — added /metrics endpoint via promhttp.Handler()
maas-api/go.mod / go.sum — prometheus/client_golang promoted to direct dependency

How Has This Been Tested?

Newly introduced tests provide coverage.

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

New Features
- Prometheus metrics endpoint added (GET /metrics).
Performance
- Admin authorization checks cached with a 30s TTL (shorter negative TTL) to reduce repeated lookups.
Bug Fixes
- Admin-check failures are now surfaced as errors and cause handlers to return 500 instead of being treated as non-admin.
Tests
- Comprehensive tests covering caching, TTL expiry, concurrency, error handling, and metrics counters.

This reverts commit 3a1a1f6.

openshift-ci · 2026-04-21T18:31:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yu-teo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [yu-teo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-21T18:31:22Z

Warning

Rate limit exceeded

@yu-teo has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 58 minutes and 59 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 58 minutes and 59 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: ccb7e067-0380-45ae-9fa5-0762cd3a1c71

📥 Commits

Reviewing files that changed from the base of the PR and between eb97650 and 9939a59.

📒 Files selected for processing (7)

maas-api/cmd/main.go
maas-api/internal/api_keys/handler.go
maas-api/internal/auth/cached_admin_checker.go
maas-api/internal/auth/cached_admin_checker_test.go
maas-api/internal/config/cluster_config.go
maas-api/internal/config/config.go
maas-api/internal/constant/const.go

📝 Walkthrough

Walkthrough

This change adds a Prometheus metrics endpoint GET /metrics to the HTTP router, promotes Prometheus modules from indirect to direct in go.mod, introduces a TTL-based in-memory caching wrapper CachedAdminChecker around the existing SAR-based admin checker (with Prometheus hit/miss counters and clock injection), integrates the cached checker into ClusterConfig using a 30s positive TTL and 2s negative TTL, updates admin-checking APIs to return (bool, error) and propagates errors to callers, and adds comprehensive tests covering caching semantics, concurrency, TTL expiry, eviction, and metrics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Security & Implementation Issues

CWE-200: Unauthenticated metrics endpoint exposure

The /metrics route is exposed without authentication. Metrics can disclose operational and environment details. Action: restrict access (bind to management interface, require mTLS/Bearer token, or firewall rule).

CWE-399: Unbounded cache growth (DoS vector)

Cache is a plain map keyed by username+groups and is unbounded; many distinct principals can exhaust memory. Action: enforce a maximum size (LRU or capped map), evict by policy, or add memory-pressure safeguards and instrumentation.

CWE-327: Cache key ambiguity with NUL delimiter

Cache keys are constructed by joining username and groups with NUL; if inputs may contain NULs this can lead to collisions. Action: canonicalize safely (length-prefix, JSON/base64 encoding) or validate/reject NUL-containing inputs.

CWE-770: Prometheus collector registration collisions

Counters are registered on the provided/global registry and may cause duplicate-registration panics across instances or tests. Action: use an instance-local Registerer, register metrics idempotently, or detect AlreadyRegistered and reuse safely (already partially implemented but ensure test/process isolation).

CWE-613 / Operational: Eviction only on write path; stale entries retained

Expired entries are removed only during writes, allowing expired entries to persist if writes are rare. Action: check TTL on reads and optionally run periodic background cleanup.

Behavioral correctness (actionable): negativeTTL is much shorter than positive TTL (2s vs 30s) — ensure this asymmetry is intentional; document trade-offs and potential for brief false-negative windows. Constructor panics on nil delegate or non-positive TTLs — ensure callers cannot supply runtime-configurable zero/negative values or handle panics accordingly.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'chore: add SAR Admin cache' accurately describes the main change: introducing a caching layer for the SAR-based admin checker.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@maas-api/cmd/main.go`:
- Around line 141-144: The /metrics endpoint is currently public in
registerHandlers; restrict access by adding an authorization or internal-only
guard: implement a metricsAuthMiddleware (as sketched) that reads a configured
scrape token (from cfg or cluster) and performs a constant-time compare, then
register the metrics route with that middleware (e.g., router.GET("/metrics",
metricsAuthMiddleware(cfg.PromToken), gin.WrapH(promhttp.Handler()))), or
alternatively move the route off the public router onto an internal-only
listener; ensure the token/config key is plumbed from config.Config/cluster and
return 403 on failure.

In `@maas-api/internal/auth/cached_admin_checker.go`:
- Around line 112-116: The cacheKey implementation in function cacheKey
currently concatenates user.Username and user.Groups with NULs which can be
ambiguous if username/groups contain NULs; change to an unambiguous,
length-prefixed or structured encoding: build the key from token.UserContext by
prefixing each component with its length (e.g., "<len>|<username>" then for each
group "<len>|<group>") or serialize a stable structure (e.g., JSON or gob) of
{username, groups} after sorting groups; update cacheKey to sort user.Groups (as
it already does), then produce and return the length-prefixed or serialized
string to ensure no collisions between different combinations of username and
groups.
- Around line 73-90: Replace the current "check cache -> release RLock -> call
c.delegate.IsAdmin -> write cache" pattern with singleflight to coalesce
duplicate inflight misses: add a golang.org/x/sync/singleflight.Group field
(e.g., c.sf) to the struct, and in IsAdmin use the existing read-locked fast
path unchanged; on miss call c.sf.Do(key, func() (interface{}, error) { res :=
c.delegate.IsAdmin(ctx, user); return res, nil }) to ensure only one
delegate.IsAdmin runs per key, then in the Do caller convert the returned
interface{} to bool, acquire c.mu.Lock(), write cache[key] = cacheEntry{isAdmin:
result, expiresAt: now.Add(c.ttl)}, call c.evictExpiredLocked(now) and unlock;
keep c.hits.Inc() only on cache hit and return the boolean result from the
singleflight result. Ensure you capture "now" consistently and use the same
key/user values when calling c.delegate.IsAdmin inside the singleflight closure.
- Around line 48-61: The metric registration in NewCachedAdminChecker (the
creation of the hits and misses counters on CachedAdminChecker) must be changed
to explicitly construct prometheus.NewCounter(...) and register them against the
provided reg using a helper that handles prometheus.AlreadyRegisteredError (e.g.
registerOrReuseCounter), returning the existing collector when already
registered instead of panicking; implement registerOrReuseCounter(reg
prometheus.Registerer, c prometheus.Counter) prometheus.Counter that attempts
reg.Register(c), on AlreadyRegisteredError extracts and returns the existing
counter, otherwise panics on other errors, and use it to set
CachedAdminChecker.hits and .misses instead of
promauto.With(reg).NewCounter(...).

In `@maas-api/internal/config/cluster_config.go`:
- Around line 113-115: The current use of auth.NewCachedAdminChecker(sarChecker,
30*time.Second, nil) stores positive admin responses for 30s (adminCheckerVal),
allowing revoked users to retain admin access; change this by either configuring
the cache to only store deny results (do not cache positive authorizations) or
wire Kubernetes RoleBinding/ClusterRoleBinding and MaaSAuthPolicy informer
events to immediately invalidate affected cache entries on change by calling the
cache invalidation API on the cached checker (the instance returned by
auth.NewCachedAdminChecker) whenever
RoleBinding/ClusterRoleBinding/MaaSAuthPolicy updates affect
subscriptionNamespace or the subject; locate sarChecker
(auth.NewSARAdminChecker) and adminCheckerVal to implement the chosen fix.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 3c68d26c-3297-47f4-b3f4-da55cb6fe299

📥 Commits

Reviewing files that changed from the base of the PR and between e746008 and cac09c5.

📒 Files selected for processing (5)

maas-api/cmd/main.go
maas-api/go.mod
maas-api/internal/auth/cached_admin_checker.go
maas-api/internal/auth/cached_admin_checker_test.go
maas-api/internal/config/cluster_config.go

coderabbitai · 2026-04-21T18:38:02Z

 func registerHandlers(ctx context.Context, log *logger.Logger, router *gin.Engine, cfg *config.Config, cluster *config.ClusterConfig, store api_keys.MetadataStore) error {
 	router.GET("/health", handlers.NewHealthHandler().HealthCheck)
+	router.GET("/metrics", gin.WrapH(promhttp.Handler()))



⚠️ Potential issue | 🟠 Major

Gate /metrics; it is currently public on the API router.

CWE-200. Any caller that can reach this API can scrape process/runtime metrics and SAR cache hit/miss activity. Exploit scenario: a tenant or external client polls /metrics to infer admin-check traffic patterns and service internals. Put this route on an internal-only listener or require a Prometheus scrape credential.

Remediation sketch

- router.GET("/metrics", gin.WrapH(promhttp.Handler())) + metricsRoutes := router.Group("/metrics", metricsAuthMiddleware(cfg.MetricsBearerToken)) + metricsRoutes.GET("", gin.WrapH(promhttp.Handler()))

func metricsAuthMiddleware(expected string) gin.HandlerFunc { return func(c *gin.Context) { token := strings.TrimPrefix(c.GetHeader("Authorization"), "Bearer ") if expected == "" || subtle.ConstantTimeCompare([]byte(token), []byte(expected)) != 1 { c.AbortWithStatus(http.StatusForbidden) return } c.Next() } }

As per coding guidelines, “REVIEW PRIORITIES: 1. Security vulnerabilities.”

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@maas-api/cmd/main.go` around lines 141 - 144, The /metrics endpoint is currently public in registerHandlers; restrict access by adding an authorization or internal-only guard: implement a metricsAuthMiddleware (as sketched) that reads a configured scrape token (from cfg or cluster) and performs a constant-time compare, then register the metrics route with that middleware (e.g., router.GET("/metrics", metricsAuthMiddleware(cfg.PromToken), gin.WrapH(promhttp.Handler()))), or alternatively move the route off the public router onto an internal-only listener; ensure the token/config key is plumbed from config.Config/cluster and return 403 on failure.

This finding doesn't need a fix. The /metrics endpoint is placed alongside /health on the main router, but in production this service runs inside a Kubernetes cluster where access is controlled at the network level via NetworkPolicy — the same pattern already used for the /internal/v1/* routes (lines 178-182). Those internal routes have no auth middleware either; the comment on line 177 says // Internal routes (no auth required - called by Authorino / CronJob).

Adding a scrape token would complicate the Prometheus scrape config for no real security gain — the network boundary is the access control layer. This is the standard pattern in Kubernetes-native services.

@yu-teo, understood — NetworkPolicy as the enforcement boundary for internal/metrics endpoints is a standard Kubernetes-native pattern, and the existing /internal/v1/* routes already establish that precedent in this codebase. I'll note this for future reviews.

( (
\ ( \
\ \
\ \\
\ '.\
\ \
\ \
.-' \
/ \ ~ curious about your setup ~
`--.______\

📝 Learnings were identified and not saved (knowledge base disabled). Enable

coderabbitai · 2026-04-21T18:38:02Z

+	c.mu.RLock()
+	entry, ok := c.cache[key]
+	c.mu.RUnlock()
+
+	if ok && now.Before(entry.expiresAt) {
+		c.hits.Inc()
+		return entry.isAdmin
+	}
+
+	result := c.delegate.IsAdmin(ctx, user)
+
+	c.mu.Lock()
+	c.cache[key] = cacheEntry{
+		isAdmin:   result,
+		expiresAt: now.Add(c.ttl),
+	}
+	c.evictExpiredLocked(now)
+	c.mu.Unlock()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Verify x/sync is available and whether duplicate-miss suppression already exists elsewhere. rg -nP 'golang\.org/x/sync|singleflight' -C3 rg -nP 'delegate\.IsAdmin\s*\(|type CachedAdminChecker' -C3

Repository: opendatahub-io/models-as-a-service

Length of output: 5962

🏁 Script executed:

cat -n maas-api/internal/auth/cached_admin_checker.go

Repository: opendatahub-io/models-as-a-service

Length of output: 3420

Use singleflight to suppress duplicate cache misses for the same key.

The RLock is released at line 75 before the delegate call at line 82. Concurrent requests for the same key with cache misses or expired entries will all call c.delegate.IsAdmin(), issuing multiple Kubernetes SAR requests instead of coalescing them. This defeats the stated load-shedding goal under bursty traffic.

Remediation sketch

Add golang.org/x/sync/singleflight (already in dependencies at v0.19.0) to suppress duplicate in-flight requests:

type CachedAdminChecker struct { delegate adminChecker ttl time.Duration now func() time.Time + inflight singleflight.Group mu sync.RWMutex cache map[string]cacheEntry

- result := c.delegate.IsAdmin(ctx, user) + value, _, _ := c.inflight.Do(key, func() (any, error) { + // Re-check after waiting for any earlier caller. + now := c.now() + c.mu.RLock() + entry, ok := c.cache[key] + c.mu.RUnlock() + if ok && now.Before(entry.expiresAt) { + c.hits.Inc() + return entry.isAdmin, nil + } + + result := c.delegate.IsAdmin(ctx, user) + c.mu.Lock() + c.cache[key] = cacheEntry{isAdmin: result, expiresAt: now.Add(c.ttl)} + c.evictExpiredLocked(now) + c.mu.Unlock() + c.misses.Inc() + return result, nil + }) + return value.(bool) - - c.mu.Lock() - c.cache[key] = cacheEntry{ - isAdmin: result, - expiresAt: now.Add(c.ttl), - } - c.evictExpiredLocked(now) - c.mu.Unlock() - - c.misses.Inc() - return result

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@maas-api/internal/auth/cached_admin_checker.go` around lines 73 - 90, Replace the current "check cache -> release RLock -> call c.delegate.IsAdmin -> write cache" pattern with singleflight to coalesce duplicate inflight misses: add a golang.org/x/sync/singleflight.Group field (e.g., c.sf) to the struct, and in IsAdmin use the existing read-locked fast path unchanged; on miss call c.sf.Do(key, func() (interface{}, error) { res := c.delegate.IsAdmin(ctx, user); return res, nil }) to ensure only one delegate.IsAdmin runs per key, then in the Do caller convert the returned interface{} to bool, acquire c.mu.Lock(), write cache[key] = cacheEntry{isAdmin: result, expiresAt: now.Add(c.ttl)}, call c.evictExpiredLocked(now) and unlock; keep c.hits.Inc() only on cache hit and return the boolean result from the singleflight result. Ensure you capture "now" consistently and use the same key/user values when calling c.delegate.IsAdmin inside the singleflight closure.

This finding doesn't need a fix. The current behavior (two concurrent misses for the same user both make SAR calls) was a deliberate design choice, not an oversight.
The duplicate-call window is narrow — it only occurs on the first request for a user or the instant after TTL expiry. Once either call completes and writes the cache, all subsequent requests hit the fast path. SAR calls are lightweight local Kubernetes API calls, not expensive external requests.
Adding singleflight would introduce:

An interface{} type assertion in the hot path

Context cancellation semantics where one caller's cancelled context could affect waiters

More code complexity for a scenario that self-resolves within milliseconds

@yu-teo, understood — the trade-off is well-reasoned. The narrow duplicate-call window, the lightweight nature of local Kubernetes SAR calls, and the added complexity of singleflight (especially around context cancellation propagation to waiters) are all valid justifications for keeping the current design. I'll record this for future reviews.

📝 Learnings were identified and not saved (knowledge base disabled). Enable

coderabbitai · 2026-04-21T18:38:02Z

+func cacheKey(user *token.UserContext) string {
+	sorted := make([]string, len(user.Groups))
+	copy(sorted, user.Groups)
+	slices.Sort(sorted)
+	return user.Username + "\x00" + strings.Join(sorted, "\x00")


⚠️ Potential issue | 🔴 Critical

Make the authorization cache key unambiguous.

CWE-20 / CWE-863. NUL-delimited concatenation can collide if token-derived usernames or groups contain \x00; e.g. username a with groups ["b","c"] collides with username a\x00b and groups ["c"]. An attacker controlling token claims could receive another principal’s cached admin result. Use length-prefixing or structured encoding.

Remediation sketch

import ( "context" + "strconv" "slices" "strings" "sync" "time" @@ func cacheKey(user *token.UserContext) string { sorted := make([]string, len(user.Groups)) copy(sorted, user.Groups) slices.Sort(sorted) - return user.Username + "\x00" + strings.Join(sorted, "\x00") + + var b strings.Builder + writePart := func(value string) { + b.WriteString(strconv.Itoa(len(value))) + b.WriteByte(':') + b.WriteString(value) + b.WriteByte('|') + } + writePart(user.Username) + for _, group := range sorted { + writePart(group) + } + return b.String() }

As per coding guidelines, “REVIEW PRIORITIES: 1. Security vulnerabilities.”

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@maas-api/internal/auth/cached_admin_checker.go` around lines 112 - 116, The cacheKey implementation in function cacheKey currently concatenates user.Username and user.Groups with NULs which can be ambiguous if username/groups contain NULs; change to an unambiguous, length-prefixed or structured encoding: build the key from token.UserContext by prefixing each component with its length (e.g., "<len>|<username>" then for each group "<len>|<group>") or serialize a stable structure (e.g., JSON or gob) of {username, groups} after sorting groups; update cacheKey to sort user.Groups (as it already does), then produce and return the length-prefixed or serialized string to ensure no collisions between different combinations of username and groups.

This finding doesn't need a fix. Kubernetes usernames and group names cannot contain NUL bytes — they are validated by the API server and follow DNS/RBAC naming conventions (alphanumeric, -, ., @, :). A NUL separator is unambiguous for this domain.
Adding length-prefixed encoding or JSON serialization would add complexity and allocations to a hot path for a collision scenario that cannot occur with valid Kubernetes identities.

@yu-teo, that's a reasonable rebuttal for the specific deployment context. I'll concede the finding here.

One nuance worth noting for the record: groups, like users, are represented by a string, and that string has no format requirements, other than that the prefix system: is reserved — the Kubernetes API server itself imposes no formal character-set restriction on user/group strings produced by authenticator modules (OIDC, webhook, etc.). The DNS/RBAC naming conventions apply to Kubernetes object names, not to authentication identity strings.

That said, if the identity providers in this deployment are known to produce well-constrained strings (OpenShift OAuth, etc.), the NUL separator is indeed safe in practice, and the complexity trade-off of length-prefixed encoding is not justified. I'll drop the finding.

📝 Learnings were identified and not saved (knowledge base disabled). Enable

coderabbitai · 2026-04-21T18:38:02Z

+	// Results are cached for 30s to reduce K8s API server load under high traffic.
+	sarChecker := auth.NewSARAdminChecker(clientset, subscriptionNamespace)
+	adminCheckerVal := auth.NewCachedAdminChecker(sarChecker, 30*time.Second, nil)


⚠️ Potential issue | 🟠 Major

Add invalidation before caching positive admin authorization.

CWE-613 / CWE-863. A user removed from the admin RBAC binding can keep cached admin privileges for up to 30 seconds and continue API-key admin actions. Remediate by either caching only deny results, or wiring RoleBinding/ClusterRoleBinding/MaaSAuthPolicy informer events to invalidate affected cache entries immediately.

As per coding guidelines, “REVIEW PRIORITIES: 1. Security vulnerabilities.”

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@maas-api/internal/config/cluster_config.go` around lines 113 - 115, The current use of auth.NewCachedAdminChecker(sarChecker, 30*time.Second, nil) stores positive admin responses for 30s (adminCheckerVal), allowing revoked users to retain admin access; change this by either configuring the cache to only store deny results (do not cache positive authorizations) or wire Kubernetes RoleBinding/ClusterRoleBinding and MaaSAuthPolicy informer events to immediately invalidate affected cache entries on change by calling the cache invalidation API on the cached checker (the instance returned by auth.NewCachedAdminChecker) whenever RoleBinding/ClusterRoleBinding/MaaSAuthPolicy updates affect subscriptionNamespace or the subject; locate sarChecker (auth.NewSARAdminChecker) and adminCheckerVal to implement the chosen fix.

This finding doesn't need a fix. A 30-second cache TTL for positive admin results is the intended behavior, explicitly called out in the acceptance criteria:
"Given a user's RBAC permissions change, When the cache TTL expires, Then the new permissions are reflected in subsequent admin checks."
The acceptance criteria accept a TTL-bounded delay for permission changes — that's the fundamental tradeoff of caching. A 30-second window is consistent with the Authorino authorization cache TTLs already used elsewhere in this project (60s default in maasauthpolicy_controller.go).

@yu-teo, understood — the 30-second TTL for positive admin results is an explicit, accepted tradeoff per the acceptance criteria, and the delay window is consistent with the existing Authorino authorization cache TTLs (60s) elsewhere in the project. I'll note this for future reviews.

📝 Learnings were identified and not saved (knowledge base disabled). Enable

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

maas-api/internal/auth/cached_admin_checker.go (1)
104-110: Full-map scan on every miss.

evictExpiredLocked runs under the write lock on every cache miss and walks the entire map. With N distinct principals within a TTL window, each miss is O(N) while blocking all readers. Fine at current scale, but consider amortizing (e.g. evict at most K entries per call, or only run eviction every M misses / once per TTL via a timestamp) if the admin population grows. Not a blocker.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maas-api/internal/auth/cached_admin_checker.go` around lines 104 - 110,
evictExpiredLocked currently iterates the entire c.cache map under the write
lock on every miss, causing O(N) work and blocking readers; change
evictExpiredLocked (method on CachedAdminChecker) to amortize work by either (a)
evicting at most K entries per call (track progress with an iterator index or by
stopping after K deletions), or (b) run full eviction only once per TTL using a
lastEviction timestamp or a miss counter (increment a miss counter on cache
misses and only perform eviction when counter % M == 0), and ensure the method
still removes expired entries from c.cache but performs bounded work per
invocation while holding the write lock.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@maas-api/internal/auth/cached_admin_checker.go`:
- Around line 69-71: Before calling the delegate and before persisting the
result, check ctx.Err() and short-circuit if the context is canceled: in
cached_admin_checker.go, at the same place you validate c and user (the if c ==
nil || user == nil || user.Username == "" block) add a guard that returns false
without touching the cache when ctx.Err() != nil; similarly, when handling the
delegate response, skip writing the negative/positive result to the cache if
ctx.Err() != nil (i.e., do not persist results derived from a canceled context).
Ensure you reference the existing delegate invocation and the cache-write code
paths (the delegate call and the cache write block) and only avoid caching when
ctx.Err() != nil.
- Around line 85-96: The current CachedAdminChecker calls
c.delegate.IsAdmin(ctx, user) which returns only bool, causing transient SAR
failures to be cached as "not admin"; change the delegate contract (the IsAdmin
method used by CachedAdminChecker) to return (bool, error), and update
CachedAdminChecker.IsAdmin to: 1) treat any non-nil error or ctx.Err()!=nil as a
non-cacheable transient failure (do not write to c.cache and return the error up
to caller), 2) when the delegate returns successfully, cache true with the
existing long TTL (e.g., c.ttl) but cache false with a short negative-TTL (e.g.,
1–2s) to implement asymmetric TTLs; update references to c.cache, cacheEntry
creation, c.misses/c.hits handling and any callers to propagate the error where
appropriate.

---

Nitpick comments:
In `@maas-api/internal/auth/cached_admin_checker.go`:
- Around line 104-110: evictExpiredLocked currently iterates the entire c.cache
map under the write lock on every miss, causing O(N) work and blocking readers;
change evictExpiredLocked (method on CachedAdminChecker) to amortize work by
either (a) evicting at most K entries per call (track progress with an iterator
index or by stopping after K deletions), or (b) run full eviction only once per
TTL using a lastEviction timestamp or a miss counter (increment a miss counter
on cache misses and only perform eviction when counter % M == 0), and ensure the
method still removes expired entries from c.cache but performs bounded work per
invocation while holding the write lock.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 2af9b4aa-e3f9-46ae-b3f8-f327062ff6bd

📥 Commits

Reviewing files that changed from the base of the PR and between cac09c5 and 6fb985c.

📒 Files selected for processing (3)

maas-api/internal/auth/cached_admin_checker.go
maas-api/internal/auth/cached_admin_checker_test.go
maas-api/internal/config/cluster_config.go

🚧 Files skipped from review as they are similar to previous changes (2)

maas-api/internal/config/cluster_config.go
maas-api/internal/auth/cached_admin_checker_test.go

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@maas-api/internal/api_keys/handler.go`:
- Around line 117-123: The error path currently returns HTTP 500 when
h.isAuthorizedForKey(ctx, user, tok.Username) errors, which leaks the existence
of keys during SAR/API-server outages; change the handler to treat authErr the
same as unauthorized by logging the error but responding with the same 404 used
for non-authorized access (i.e., return the not-found response used when
authorized == false). Update the code around the h.isAuthorizedForKey call (and
the analogous block at the other occurrence around lines 298-304) so both
authErr and authorized==false follow the identical 404 response path while still
logging the internal error for operators.

In `@maas-api/internal/auth/cached_admin_checker.go`:
- Around line 32-33: The SAR result cache (cache map[string]cacheEntry with mu
sync.RWMutex) is unbounded and each miss can take a write lock and scan the map;
change it to a bounded cache to avoid DoS: introduce a maxSize constant and
either implement an LRU (e.g., using container/list or
github.com/hashicorp/golang-lru) keyed by the same cache keys or, if saturated,
skip inserting new entries; update code paths that read/insert into cache to use
RLock for lookups, only take Lock when inserting/evicting, check current size
before inserting, and on saturation either evict the least-recently-used entry
or return without caching (ensure cacheEntry and all functions that access cache
/ mu are updated accordingly).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 59bd8e05-bf27-4e3e-9737-cb39fe783cb1

📥 Commits

Reviewing files that changed from the base of the PR and between 6fb985c and eb97650.

📒 Files selected for processing (7)

maas-api/internal/api_keys/handler.go
maas-api/internal/api_keys/handler_test.go
maas-api/internal/auth/cached_admin_checker.go
maas-api/internal/auth/cached_admin_checker_test.go
maas-api/internal/auth/sar_admin_checker.go
maas-api/internal/auth/sar_admin_checker_test.go
maas-api/internal/config/cluster_config.go

rhods-ci-bot · 2026-04-22T19:24:35Z

@yu-teo: The following test has Succeeded:

OCI Artifact Browser URL

View in Artifact Browser

Inspecting Test Artifacts Manually

To inspect your test artifacts manually, follow these steps:

Install ORAS (see the ORAS installation guide).
Download artifacts with the following commands:

mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:maas-group-test-vmg4s

Yuriy Teodorovych added 30 commits March 25, 2026 13:42

add provider as "test"

3a1a1f6

Revert "add provider as "test""

192c530

This reverts commit 3a1a1f6.

Merge remote-tracking branch 'upstream/main'

580f9df

Merge remote-tracking branch 'upstream/main'

75b4a49

Merge remote-tracking branch 'upstream/main'

cb63f60

Merge remote-tracking branch 'upstream/main'

8b3c6c1

Merge remote-tracking branch 'upstream/main'

5694f11

Merge remote-tracking branch 'upstream/main'

8b4443f

Merge remote-tracking branch 'upstream/main'

4959fb7

Merge remote-tracking branch 'upstream/main'

a27b033

Merge remote-tracking branch 'upstream/main'

67695fd

Merge remote-tracking branch 'upstream/main'

45bd2cb

Merge remote-tracking branch 'upstream/main'

0ccc5eb

Merge remote-tracking branch 'upstream/main'

e42ad0a

Merge remote-tracking branch 'upstream/main'

2a2a1e4

Merge remote-tracking branch 'upstream/main'

f439e9f

Merge remote-tracking branch 'upstream/main'

c099a19

Merge remote-tracking branch 'upstream/main'

6f2c694

Merge remote-tracking branch 'upstream/main'

db1a715

Merge remote-tracking branch 'upstream/main'

e6be668

Merge remote-tracking branch 'upstream/main'

010360a

Merge remote-tracking branch 'upstream/main'

7969fb2

Merge remote-tracking branch 'upstream/main'

925364c

Merge remote-tracking branch 'upstream/main'

f0509d1

Merge remote-tracking branch 'upstream/main'

9723ba3

Merge remote-tracking branch 'upstream/main'

f554f27

Merge remote-tracking branch 'upstream/main'

487897a

Merge remote-tracking branch 'upstream/main'

8c1e001

Merge remote-tracking branch 'upstream/main'

4bb4753

Merge remote-tracking branch 'upstream/main'

c0537af

Yuriy Teodorovych added 10 commits April 15, 2026 13:02

Merge remote-tracking branch 'upstream/main'

527a703

Merge remote-tracking branch 'upstream/main'

5c4b78e

Merge remote-tracking branch 'upstream/main'

9f8fa45

Merge remote-tracking branch 'upstream/main'

3ea597b

Merge remote-tracking branch 'upstream/main'

1147a3b

Merge remote-tracking branch 'upstream/main'

0d8c6b5

Merge remote-tracking branch 'upstream/main'

c18a43e

Create cached_admin_checker.go

9d9a764

add tests

6af9da2

change main && improve tests

cac09c5

openshift-ci Bot requested review from EgorLu and yossiovadia April 21, 2026 18:31

openshift-ci Bot added the approved label Apr 21, 2026

Yuriy Teodorovych added 3 commits April 21, 2026 14:36

switch to using k8s.io/utils/clock/testing

03a8e73

Merge remote-tracking branch 'upstream/main'

f27c5a8

Merge branch 'main' into yt-sad-admin-cache

7361ec8

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Yuriy Teodorovych added 4 commits April 21, 2026 14:49

fix linter && address codeRabbit review

79e4466

fix lint

4723d8f

Merge remote-tracking branch 'upstream/main'

6d65fe7

Merge branch 'main' into yt-sad-admin-cache

6fb985c

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread maas-api/internal/auth/cached_admin_checker.go

Comment thread maas-api/internal/auth/cached_admin_checker.go Outdated

use asymmetric caching

eb97650

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread maas-api/internal/api_keys/handler.go

Comment thread maas-api/internal/auth/cached_admin_checker.go

Yuriy Teodorovych added 3 commits April 22, 2026 14:27

add sarCacheMaxSize ** change the handler to address codeRabbit review

fdac5f1

Merge remote-tracking branch 'upstream/main'

7113161

Merge branch 'main' into yt-sad-admin-cache

9939a59

Conversation

yu-teo commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented Apr 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Estimated code review effort

Security & Implementation Issues

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yu-teo Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yu-teo Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yu-teo Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yu-teo Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rhods-ci-bot commented Apr 22, 2026

OCI Artifact Browser URL

Inspecting Test Artifacts Manually

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

yu-teo commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

coderabbitai Bot Apr 21, 2026 •

edited

Loading

coderabbitai Bot Apr 21, 2026 •

edited

Loading

coderabbitai Bot Apr 21, 2026 •

edited

Loading

coderabbitai Bot Apr 21, 2026 •

edited

Loading