Skip to content

chore: add SAR Admin cache#786

Open
yu-teo wants to merge 55 commits intoopendatahub-io:mainfrom
yu-teo:yt-sad-admin-cache
Open

chore: add SAR Admin cache#786
yu-teo wants to merge 55 commits intoopendatahub-io:mainfrom
yu-teo:yt-sad-admin-cache

Conversation

@yu-teo
Copy link
Copy Markdown
Contributor

@yu-teo yu-teo commented Apr 21, 2026

https://redhat.atlassian.net/browse/RHOAIENG-54975

Description

This PR introduces new files:

  • maas-api/internal/auth/cached_admin_checker.go — CachedAdminChecker wrapping SARAdminChecker with TTL-based cache (30s), sync.RWMutex+map for thread safety, lazy eviction of expired entries, and Prometheus counters (sar_cache_hits_total, sar_cache_misses_total)
  • maas-api/internal/auth/cached_admin_checker_test.go — 15 tests covering cache hits/misses, TTL expiry, group-order invariance, fail-closed guards, metrics, concurrent access, eviction, and constructor panics

As well as, I modified files:

  • maas-api/internal/config/cluster_config.go — AdminChecker field changed to *auth.CachedAdminChecker, wraps SARAdminChecker with 30s TTL
  • maas-api/cmd/main.go — added /metrics endpoint via promhttp.Handler()
  • maas-api/go.mod / go.sum — prometheus/client_golang promoted to direct dependency

How Has This Been Tested?

Newly introduced tests provide coverage.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Prometheus metrics endpoint added (GET /metrics).
  • Performance

    • Admin authorization checks cached with a 30s TTL (shorter negative TTL) to reduce repeated lookups.
  • Bug Fixes

    • Admin-check failures are now surfaced as errors and cause handlers to return 500 instead of being treated as non-admin.
  • Tests

    • Comprehensive tests covering caching, TTL expiry, concurrency, error handling, and metrics counters.

Yuriy Teodorovych added 30 commits March 25, 2026 13:42
@openshift-ci openshift-ci Bot requested review from EgorLu and yossiovadia April 21, 2026 18:31
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 21, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yu-teo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

Warning

Rate limit exceeded

@yu-teo has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 58 minutes and 59 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 58 minutes and 59 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: ccb7e067-0380-45ae-9fa5-0762cd3a1c71

📥 Commits

Reviewing files that changed from the base of the PR and between eb97650 and 9939a59.

📒 Files selected for processing (7)
  • maas-api/cmd/main.go
  • maas-api/internal/api_keys/handler.go
  • maas-api/internal/auth/cached_admin_checker.go
  • maas-api/internal/auth/cached_admin_checker_test.go
  • maas-api/internal/config/cluster_config.go
  • maas-api/internal/config/config.go
  • maas-api/internal/constant/const.go
📝 Walkthrough

Walkthrough

This change adds a Prometheus metrics endpoint GET /metrics to the HTTP router, promotes Prometheus modules from indirect to direct in go.mod, introduces a TTL-based in-memory caching wrapper CachedAdminChecker around the existing SAR-based admin checker (with Prometheus hit/miss counters and clock injection), integrates the cached checker into ClusterConfig using a 30s positive TTL and 2s negative TTL, updates admin-checking APIs to return (bool, error) and propagates errors to callers, and adds comprehensive tests covering caching semantics, concurrency, TTL expiry, eviction, and metrics.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Security & Implementation Issues

CWE-200: Unauthenticated metrics endpoint exposure

  • The /metrics route is exposed without authentication. Metrics can disclose operational and environment details. Action: restrict access (bind to management interface, require mTLS/Bearer token, or firewall rule).

CWE-399: Unbounded cache growth (DoS vector)

  • Cache is a plain map keyed by username+groups and is unbounded; many distinct principals can exhaust memory. Action: enforce a maximum size (LRU or capped map), evict by policy, or add memory-pressure safeguards and instrumentation.

CWE-327: Cache key ambiguity with NUL delimiter

  • Cache keys are constructed by joining username and groups with NUL; if inputs may contain NULs this can lead to collisions. Action: canonicalize safely (length-prefix, JSON/base64 encoding) or validate/reject NUL-containing inputs.

CWE-770: Prometheus collector registration collisions

  • Counters are registered on the provided/global registry and may cause duplicate-registration panics across instances or tests. Action: use an instance-local Registerer, register metrics idempotently, or detect AlreadyRegistered and reuse safely (already partially implemented but ensure test/process isolation).

CWE-613 / Operational: Eviction only on write path; stale entries retained

  • Expired entries are removed only during writes, allowing expired entries to persist if writes are rare. Action: check TTL on reads and optionally run periodic background cleanup.

Behavioral correctness (actionable): negativeTTL is much shorter than positive TTL (2s vs 30s) — ensure this asymmetry is intentional; document trade-offs and potential for brief false-negative windows. Constructor panics on nil delegate or non-positive TTLs — ensure callers cannot supply runtime-configurable zero/negative values or handle panics accordingly.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'chore: add SAR Admin cache' accurately describes the main change: introducing a caching layer for the SAR-based admin checker.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@maas-api/cmd/main.go`:
- Around line 141-144: The /metrics endpoint is currently public in
registerHandlers; restrict access by adding an authorization or internal-only
guard: implement a metricsAuthMiddleware (as sketched) that reads a configured
scrape token (from cfg or cluster) and performs a constant-time compare, then
register the metrics route with that middleware (e.g., router.GET("/metrics",
metricsAuthMiddleware(cfg.PromToken), gin.WrapH(promhttp.Handler()))), or
alternatively move the route off the public router onto an internal-only
listener; ensure the token/config key is plumbed from config.Config/cluster and
return 403 on failure.

In `@maas-api/internal/auth/cached_admin_checker.go`:
- Around line 112-116: The cacheKey implementation in function cacheKey
currently concatenates user.Username and user.Groups with NULs which can be
ambiguous if username/groups contain NULs; change to an unambiguous,
length-prefixed or structured encoding: build the key from token.UserContext by
prefixing each component with its length (e.g., "<len>|<username>" then for each
group "<len>|<group>") or serialize a stable structure (e.g., JSON or gob) of
{username, groups} after sorting groups; update cacheKey to sort user.Groups (as
it already does), then produce and return the length-prefixed or serialized
string to ensure no collisions between different combinations of username and
groups.
- Around line 73-90: Replace the current "check cache -> release RLock -> call
c.delegate.IsAdmin -> write cache" pattern with singleflight to coalesce
duplicate inflight misses: add a golang.org/x/sync/singleflight.Group field
(e.g., c.sf) to the struct, and in IsAdmin use the existing read-locked fast
path unchanged; on miss call c.sf.Do(key, func() (interface{}, error) { res :=
c.delegate.IsAdmin(ctx, user); return res, nil }) to ensure only one
delegate.IsAdmin runs per key, then in the Do caller convert the returned
interface{} to bool, acquire c.mu.Lock(), write cache[key] = cacheEntry{isAdmin:
result, expiresAt: now.Add(c.ttl)}, call c.evictExpiredLocked(now) and unlock;
keep c.hits.Inc() only on cache hit and return the boolean result from the
singleflight result. Ensure you capture "now" consistently and use the same
key/user values when calling c.delegate.IsAdmin inside the singleflight closure.
- Around line 48-61: The metric registration in NewCachedAdminChecker (the
creation of the hits and misses counters on CachedAdminChecker) must be changed
to explicitly construct prometheus.NewCounter(...) and register them against the
provided reg using a helper that handles prometheus.AlreadyRegisteredError (e.g.
registerOrReuseCounter), returning the existing collector when already
registered instead of panicking; implement registerOrReuseCounter(reg
prometheus.Registerer, c prometheus.Counter) prometheus.Counter that attempts
reg.Register(c), on AlreadyRegisteredError extracts and returns the existing
counter, otherwise panics on other errors, and use it to set
CachedAdminChecker.hits and .misses instead of
promauto.With(reg).NewCounter(...).

In `@maas-api/internal/config/cluster_config.go`:
- Around line 113-115: The current use of auth.NewCachedAdminChecker(sarChecker,
30*time.Second, nil) stores positive admin responses for 30s (adminCheckerVal),
allowing revoked users to retain admin access; change this by either configuring
the cache to only store deny results (do not cache positive authorizations) or
wire Kubernetes RoleBinding/ClusterRoleBinding and MaaSAuthPolicy informer
events to immediately invalidate affected cache entries on change by calling the
cache invalidation API on the cached checker (the instance returned by
auth.NewCachedAdminChecker) whenever
RoleBinding/ClusterRoleBinding/MaaSAuthPolicy updates affect
subscriptionNamespace or the subject; locate sarChecker
(auth.NewSARAdminChecker) and adminCheckerVal to implement the chosen fix.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 3c68d26c-3297-47f4-b3f4-da55cb6fe299

📥 Commits

Reviewing files that changed from the base of the PR and between e746008 and cac09c5.

📒 Files selected for processing (5)
  • maas-api/cmd/main.go
  • maas-api/go.mod
  • maas-api/internal/auth/cached_admin_checker.go
  • maas-api/internal/auth/cached_admin_checker_test.go
  • maas-api/internal/config/cluster_config.go

Comment thread maas-api/cmd/main.go
Comment on lines 141 to 144
func registerHandlers(ctx context.Context, log *logger.Logger, router *gin.Engine, cfg *config.Config, cluster *config.ClusterConfig, store api_keys.MetadataStore) error {
router.GET("/health", handlers.NewHealthHandler().HealthCheck)
router.GET("/metrics", gin.WrapH(promhttp.Handler()))

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Gate /metrics; it is currently public on the API router.

CWE-200. Any caller that can reach this API can scrape process/runtime metrics and SAR cache hit/miss activity. Exploit scenario: a tenant or external client polls /metrics to infer admin-check traffic patterns and service internals. Put this route on an internal-only listener or require a Prometheus scrape credential.

Remediation sketch
-	router.GET("/metrics", gin.WrapH(promhttp.Handler()))
+	metricsRoutes := router.Group("/metrics", metricsAuthMiddleware(cfg.MetricsBearerToken))
+	metricsRoutes.GET("", gin.WrapH(promhttp.Handler()))
func metricsAuthMiddleware(expected string) gin.HandlerFunc {
	return func(c *gin.Context) {
		token := strings.TrimPrefix(c.GetHeader("Authorization"), "Bearer ")
		if expected == "" || subtle.ConstantTimeCompare([]byte(token), []byte(expected)) != 1 {
			c.AbortWithStatus(http.StatusForbidden)
			return
		}
		c.Next()
	}
}

As per coding guidelines, “REVIEW PRIORITIES: 1. Security vulnerabilities.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maas-api/cmd/main.go` around lines 141 - 144, The /metrics endpoint is
currently public in registerHandlers; restrict access by adding an authorization
or internal-only guard: implement a metricsAuthMiddleware (as sketched) that
reads a configured scrape token (from cfg or cluster) and performs a
constant-time compare, then register the metrics route with that middleware
(e.g., router.GET("/metrics", metricsAuthMiddleware(cfg.PromToken),
gin.WrapH(promhttp.Handler()))), or alternatively move the route off the public
router onto an internal-only listener; ensure the token/config key is plumbed
from config.Config/cluster and return 403 on failure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This finding doesn't need a fix. The /metrics endpoint is placed alongside /health on the main router, but in production this service runs inside a Kubernetes cluster where access is controlled at the network level via NetworkPolicy — the same pattern already used for the /internal/v1/* routes (lines 178-182). Those internal routes have no auth middleware either; the comment on line 177 says // Internal routes (no auth required - called by Authorino / CronJob).

Adding a scrape token would complicate the Prometheus scrape config for no real security gain — the network boundary is the access control layer. This is the standard pattern in Kubernetes-native services.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yu-teo, understood — NetworkPolicy as the enforcement boundary for internal/metrics endpoints is a standard Kubernetes-native pattern, and the existing /internal/v1/* routes already establish that precedent in this codebase. I'll note this for future reviews.

( (
\ ( \
\ \
\ \\
\ '.\
\ \
\ \
.-' \
/ \ ~ curious about your setup ~
`--.
______\


📝 Learnings were identified and not saved (knowledge base disabled). Enable

Comment thread maas-api/internal/auth/cached_admin_checker.go Outdated
Comment on lines +73 to +90
c.mu.RLock()
entry, ok := c.cache[key]
c.mu.RUnlock()

if ok && now.Before(entry.expiresAt) {
c.hits.Inc()
return entry.isAdmin
}

result := c.delegate.IsAdmin(ctx, user)

c.mu.Lock()
c.cache[key] = cacheEntry{
isAdmin: result,
expiresAt: now.Add(c.ttl),
}
c.evictExpiredLocked(now)
c.mu.Unlock()
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify x/sync is available and whether duplicate-miss suppression already exists elsewhere.
rg -nP 'golang\.org/x/sync|singleflight' -C3
rg -nP 'delegate\.IsAdmin\s*\(|type CachedAdminChecker' -C3

Repository: opendatahub-io/models-as-a-service

Length of output: 5962


🏁 Script executed:

cat -n maas-api/internal/auth/cached_admin_checker.go

Repository: opendatahub-io/models-as-a-service

Length of output: 3420


Use singleflight to suppress duplicate cache misses for the same key.

The RLock is released at line 75 before the delegate call at line 82. Concurrent requests for the same key with cache misses or expired entries will all call c.delegate.IsAdmin(), issuing multiple Kubernetes SAR requests instead of coalescing them. This defeats the stated load-shedding goal under bursty traffic.

Remediation sketch

Add golang.org/x/sync/singleflight (already in dependencies at v0.19.0) to suppress duplicate in-flight requests:

 type CachedAdminChecker struct {
 	delegate adminChecker
 	ttl      time.Duration
 	now      func() time.Time
+	inflight singleflight.Group
 
 	mu    sync.RWMutex
 	cache map[string]cacheEntry
-	result := c.delegate.IsAdmin(ctx, user)
+	value, _, _ := c.inflight.Do(key, func() (any, error) {
+		// Re-check after waiting for any earlier caller.
+		now := c.now()
+		c.mu.RLock()
+		entry, ok := c.cache[key]
+		c.mu.RUnlock()
+		if ok && now.Before(entry.expiresAt) {
+			c.hits.Inc()
+			return entry.isAdmin, nil
+		}
+
+		result := c.delegate.IsAdmin(ctx, user)
+		c.mu.Lock()
+		c.cache[key] = cacheEntry{isAdmin: result, expiresAt: now.Add(c.ttl)}
+		c.evictExpiredLocked(now)
+		c.mu.Unlock()
+		c.misses.Inc()
+		return result, nil
+	})
+	return value.(bool)
-
-	c.mu.Lock()
-	c.cache[key] = cacheEntry{
-		isAdmin:   result,
-		expiresAt: now.Add(c.ttl),
-	}
-	c.evictExpiredLocked(now)
-	c.mu.Unlock()
-
-	c.misses.Inc()
-	return result
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maas-api/internal/auth/cached_admin_checker.go` around lines 73 - 90, Replace
the current "check cache -> release RLock -> call c.delegate.IsAdmin -> write
cache" pattern with singleflight to coalesce duplicate inflight misses: add a
golang.org/x/sync/singleflight.Group field (e.g., c.sf) to the struct, and in
IsAdmin use the existing read-locked fast path unchanged; on miss call
c.sf.Do(key, func() (interface{}, error) { res := c.delegate.IsAdmin(ctx, user);
return res, nil }) to ensure only one delegate.IsAdmin runs per key, then in the
Do caller convert the returned interface{} to bool, acquire c.mu.Lock(), write
cache[key] = cacheEntry{isAdmin: result, expiresAt: now.Add(c.ttl)}, call
c.evictExpiredLocked(now) and unlock; keep c.hits.Inc() only on cache hit and
return the boolean result from the singleflight result. Ensure you capture "now"
consistently and use the same key/user values when calling c.delegate.IsAdmin
inside the singleflight closure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This finding doesn't need a fix. The current behavior (two concurrent misses for the same user both make SAR calls) was a deliberate design choice, not an oversight.
The duplicate-call window is narrow — it only occurs on the first request for a user or the instant after TTL expiry. Once either call completes and writes the cache, all subsequent requests hit the fast path. SAR calls are lightweight local Kubernetes API calls, not expensive external requests.
Adding singleflight would introduce:

  • An interface{} type assertion in the hot path
  • Context cancellation semantics where one caller's cancelled context could affect waiters
  • More code complexity for a scenario that self-resolves within milliseconds

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yu-teo, understood — the trade-off is well-reasoned. The narrow duplicate-call window, the lightweight nature of local Kubernetes SAR calls, and the added complexity of singleflight (especially around context cancellation propagation to waiters) are all valid justifications for keeping the current design. I'll record this for future reviews.


📝 Learnings were identified and not saved (knowledge base disabled). Enable

Comment on lines +112 to +116
func cacheKey(user *token.UserContext) string {
sorted := make([]string, len(user.Groups))
copy(sorted, user.Groups)
slices.Sort(sorted)
return user.Username + "\x00" + strings.Join(sorted, "\x00")
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Make the authorization cache key unambiguous.

CWE-20 / CWE-863. NUL-delimited concatenation can collide if token-derived usernames or groups contain \x00; e.g. username a with groups ["b","c"] collides with username a\x00b and groups ["c"]. An attacker controlling token claims could receive another principal’s cached admin result. Use length-prefixing or structured encoding.

Remediation sketch
 import (
 	"context"
+	"strconv"
 	"slices"
 	"strings"
 	"sync"
 	"time"
@@
 func cacheKey(user *token.UserContext) string {
 	sorted := make([]string, len(user.Groups))
 	copy(sorted, user.Groups)
 	slices.Sort(sorted)
-	return user.Username + "\x00" + strings.Join(sorted, "\x00")
+
+	var b strings.Builder
+	writePart := func(value string) {
+		b.WriteString(strconv.Itoa(len(value)))
+		b.WriteByte(':')
+		b.WriteString(value)
+		b.WriteByte('|')
+	}
+	writePart(user.Username)
+	for _, group := range sorted {
+		writePart(group)
+	}
+	return b.String()
 }

As per coding guidelines, “REVIEW PRIORITIES: 1. Security vulnerabilities.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maas-api/internal/auth/cached_admin_checker.go` around lines 112 - 116, The
cacheKey implementation in function cacheKey currently concatenates
user.Username and user.Groups with NULs which can be ambiguous if
username/groups contain NULs; change to an unambiguous, length-prefixed or
structured encoding: build the key from token.UserContext by prefixing each
component with its length (e.g., "<len>|<username>" then for each group
"<len>|<group>") or serialize a stable structure (e.g., JSON or gob) of
{username, groups} after sorting groups; update cacheKey to sort user.Groups (as
it already does), then produce and return the length-prefixed or serialized
string to ensure no collisions between different combinations of username and
groups.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This finding doesn't need a fix. Kubernetes usernames and group names cannot contain NUL bytes — they are validated by the API server and follow DNS/RBAC naming conventions (alphanumeric, -, ., @, :). A NUL separator is unambiguous for this domain.
Adding length-prefixed encoding or JSON serialization would add complexity and allocations to a hot path for a collision scenario that cannot occur with valid Kubernetes identities.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yu-teo, that's a reasonable rebuttal for the specific deployment context. I'll concede the finding here.

One nuance worth noting for the record: groups, like users, are represented by a string, and that string has no format requirements, other than that the prefix system: is reserved — the Kubernetes API server itself imposes no formal character-set restriction on user/group strings produced by authenticator modules (OIDC, webhook, etc.). The DNS/RBAC naming conventions apply to Kubernetes object names, not to authentication identity strings.

That said, if the identity providers in this deployment are known to produce well-constrained strings (OpenShift OAuth, etc.), the NUL separator is indeed safe in practice, and the complexity trade-off of length-prefixed encoding is not justified. I'll drop the finding.


📝 Learnings were identified and not saved (knowledge base disabled). Enable

Comment on lines +113 to +115
// Results are cached for 30s to reduce K8s API server load under high traffic.
sarChecker := auth.NewSARAdminChecker(clientset, subscriptionNamespace)
adminCheckerVal := auth.NewCachedAdminChecker(sarChecker, 30*time.Second, nil)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add invalidation before caching positive admin authorization.

CWE-613 / CWE-863. A user removed from the admin RBAC binding can keep cached admin privileges for up to 30 seconds and continue API-key admin actions. Remediate by either caching only deny results, or wiring RoleBinding/ClusterRoleBinding/MaaSAuthPolicy informer events to invalidate affected cache entries immediately.

As per coding guidelines, “REVIEW PRIORITIES: 1. Security vulnerabilities.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maas-api/internal/config/cluster_config.go` around lines 113 - 115, The
current use of auth.NewCachedAdminChecker(sarChecker, 30*time.Second, nil)
stores positive admin responses for 30s (adminCheckerVal), allowing revoked
users to retain admin access; change this by either configuring the cache to
only store deny results (do not cache positive authorizations) or wire
Kubernetes RoleBinding/ClusterRoleBinding and MaaSAuthPolicy informer events to
immediately invalidate affected cache entries on change by calling the cache
invalidation API on the cached checker (the instance returned by
auth.NewCachedAdminChecker) whenever
RoleBinding/ClusterRoleBinding/MaaSAuthPolicy updates affect
subscriptionNamespace or the subject; locate sarChecker
(auth.NewSARAdminChecker) and adminCheckerVal to implement the chosen fix.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This finding doesn't need a fix. A 30-second cache TTL for positive admin results is the intended behavior, explicitly called out in the acceptance criteria:
"Given a user's RBAC permissions change, When the cache TTL expires, Then the new permissions are reflected in subsequent admin checks."
The acceptance criteria accept a TTL-bounded delay for permission changes — that's the fundamental tradeoff of caching. A 30-second window is consistent with the Authorino authorization cache TTLs already used elsewhere in this project (60s default in maasauthpolicy_controller.go).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yu-teo, understood — the 30-second TTL for positive admin results is an explicit, accepted tradeoff per the acceptance criteria, and the delay window is consistent with the existing Authorino authorization cache TTLs (60s) elsewhere in the project. I'll note this for future reviews.


📝 Learnings were identified and not saved (knowledge base disabled). Enable

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
maas-api/internal/auth/cached_admin_checker.go (1)

104-110: Full-map scan on every miss.

evictExpiredLocked runs under the write lock on every cache miss and walks the entire map. With N distinct principals within a TTL window, each miss is O(N) while blocking all readers. Fine at current scale, but consider amortizing (e.g. evict at most K entries per call, or only run eviction every M misses / once per TTL via a timestamp) if the admin population grows. Not a blocker.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maas-api/internal/auth/cached_admin_checker.go` around lines 104 - 110,
evictExpiredLocked currently iterates the entire c.cache map under the write
lock on every miss, causing O(N) work and blocking readers; change
evictExpiredLocked (method on CachedAdminChecker) to amortize work by either (a)
evicting at most K entries per call (track progress with an iterator index or by
stopping after K deletions), or (b) run full eviction only once per TTL using a
lastEviction timestamp or a miss counter (increment a miss counter on cache
misses and only perform eviction when counter % M == 0), and ensure the method
still removes expired entries from c.cache but performs bounded work per
invocation while holding the write lock.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@maas-api/internal/auth/cached_admin_checker.go`:
- Around line 69-71: Before calling the delegate and before persisting the
result, check ctx.Err() and short-circuit if the context is canceled: in
cached_admin_checker.go, at the same place you validate c and user (the if c ==
nil || user == nil || user.Username == "" block) add a guard that returns false
without touching the cache when ctx.Err() != nil; similarly, when handling the
delegate response, skip writing the negative/positive result to the cache if
ctx.Err() != nil (i.e., do not persist results derived from a canceled context).
Ensure you reference the existing delegate invocation and the cache-write code
paths (the delegate call and the cache write block) and only avoid caching when
ctx.Err() != nil.
- Around line 85-96: The current CachedAdminChecker calls
c.delegate.IsAdmin(ctx, user) which returns only bool, causing transient SAR
failures to be cached as "not admin"; change the delegate contract (the IsAdmin
method used by CachedAdminChecker) to return (bool, error), and update
CachedAdminChecker.IsAdmin to: 1) treat any non-nil error or ctx.Err()!=nil as a
non-cacheable transient failure (do not write to c.cache and return the error up
to caller), 2) when the delegate returns successfully, cache true with the
existing long TTL (e.g., c.ttl) but cache false with a short negative-TTL (e.g.,
1–2s) to implement asymmetric TTLs; update references to c.cache, cacheEntry
creation, c.misses/c.hits handling and any callers to propagate the error where
appropriate.

---

Nitpick comments:
In `@maas-api/internal/auth/cached_admin_checker.go`:
- Around line 104-110: evictExpiredLocked currently iterates the entire c.cache
map under the write lock on every miss, causing O(N) work and blocking readers;
change evictExpiredLocked (method on CachedAdminChecker) to amortize work by
either (a) evicting at most K entries per call (track progress with an iterator
index or by stopping after K deletions), or (b) run full eviction only once per
TTL using a lastEviction timestamp or a miss counter (increment a miss counter
on cache misses and only perform eviction when counter % M == 0), and ensure the
method still removes expired entries from c.cache but performs bounded work per
invocation while holding the write lock.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 2af9b4aa-e3f9-46ae-b3f8-f327062ff6bd

📥 Commits

Reviewing files that changed from the base of the PR and between cac09c5 and 6fb985c.

📒 Files selected for processing (3)
  • maas-api/internal/auth/cached_admin_checker.go
  • maas-api/internal/auth/cached_admin_checker_test.go
  • maas-api/internal/config/cluster_config.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • maas-api/internal/config/cluster_config.go
  • maas-api/internal/auth/cached_admin_checker_test.go

Comment thread maas-api/internal/auth/cached_admin_checker.go
Comment thread maas-api/internal/auth/cached_admin_checker.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@maas-api/internal/api_keys/handler.go`:
- Around line 117-123: The error path currently returns HTTP 500 when
h.isAuthorizedForKey(ctx, user, tok.Username) errors, which leaks the existence
of keys during SAR/API-server outages; change the handler to treat authErr the
same as unauthorized by logging the error but responding with the same 404 used
for non-authorized access (i.e., return the not-found response used when
authorized == false). Update the code around the h.isAuthorizedForKey call (and
the analogous block at the other occurrence around lines 298-304) so both
authErr and authorized==false follow the identical 404 response path while still
logging the internal error for operators.

In `@maas-api/internal/auth/cached_admin_checker.go`:
- Around line 32-33: The SAR result cache (cache map[string]cacheEntry with mu
sync.RWMutex) is unbounded and each miss can take a write lock and scan the map;
change it to a bounded cache to avoid DoS: introduce a maxSize constant and
either implement an LRU (e.g., using container/list or
github.com/hashicorp/golang-lru) keyed by the same cache keys or, if saturated,
skip inserting new entries; update code paths that read/insert into cache to use
RLock for lookups, only take Lock when inserting/evicting, check current size
before inserting, and on saturation either evict the least-recently-used entry
or return without caching (ensure cacheEntry and all functions that access cache
/ mu are updated accordingly).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 59bd8e05-bf27-4e3e-9737-cb39fe783cb1

📥 Commits

Reviewing files that changed from the base of the PR and between 6fb985c and eb97650.

📒 Files selected for processing (7)
  • maas-api/internal/api_keys/handler.go
  • maas-api/internal/api_keys/handler_test.go
  • maas-api/internal/auth/cached_admin_checker.go
  • maas-api/internal/auth/cached_admin_checker_test.go
  • maas-api/internal/auth/sar_admin_checker.go
  • maas-api/internal/auth/sar_admin_checker_test.go
  • maas-api/internal/config/cluster_config.go

Comment thread maas-api/internal/api_keys/handler.go
Comment thread maas-api/internal/auth/cached_admin_checker.go
@rhods-ci-bot
Copy link
Copy Markdown

@yu-teo: The following test has Succeeded:

OCI Artifact Browser URL

View in Artifact Browser

Inspecting Test Artifacts Manually

To inspect your test artifacts manually, follow these steps:

  1. Install ORAS (see the ORAS installation guide).
  2. Download artifacts with the following commands:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:maas-group-test-vmg4s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants