[sync] upstream llm-d main branch b745c47 [2026-03-26]#40
[sync] upstream llm-d main branch b745c47 [2026-03-26]#40zdtsw merged 32 commits intoopendatahub-io:mainfrom
Conversation
Replaced by centralized nightly checks in llm-d/llm-d-infra#83. The nightly-org-checks workflow now runs typo, link, and upstream checks across all repos in both llm-d and llm-d-incubation orgs. Signed-off-by: Andrew Anderson <andy@clubanderson.com>
…lm-d#863) Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com>
* docs: recommend one-shot setup and document InferencePool CRD race condition In docs/developer-guide/development.md: - Recommend 'CREATE_CLUSTER=true make deploy-wva-emulated-on-kind' as the preferred one-shot setup to avoid a CRD timing race - Preserve the two-step alternative with a note pointing to Known Setup Issues - Add 'Known Setup Issues' section with symptom, cause, and two remediation options for the InferencePool CRD-not-found error Signed-off-by: Srujan Reddy <srjnreddy33@gmail.com> * dummy commit --------- Signed-off-by: Srujan Reddy <srjnreddy33@gmail.com>
…scalingv2 (llm-d#871) Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com>
* Basic model tuner functionality * queue analyzer engine skeleton * tuner integration init * initial implementation of computeAllVariantCapacities * Register queueing model metrics queries for arrival rate and max batch size * integration v2 * wire queueing model analyzer to engine v1 * handle error cases * set go version to 1.24.0 * code refactoring * add comments and rearrange code * fix openshift ci e2e gate for fork (llm-d#843) * fix openshift ci e2e gate for fork Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> * clarify openshift e2e triggers Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> * update comments to be explicit Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> --------- Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> * document wva release process (llm-d#845) * document wva release process Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> * nit Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> --------- Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> * wire queueing model analyzer to engine v1 * set go version to 1.24.0 * add QueueingModelScalingConfig struct * add QueueingModelConfigMapName constant and helper * add config map functionality that lets user set config for queue model * rename QueueingModel config types and functions to QMAnalyzer Rename types, struct fields, and functions to use the QMAnalyzer naming convention instead of QueueingModel for config-related code in internal/config and internal/controller packages, plus callers in internal/engines/saturation. * Renames the internal analyzer config type QueueingModelConfig to QMConfig for consistency with the QMAnalyzer naming used in the config/controller layer. * guess slo using all variants of model * added unit tests for parameters and utils * fix lint-and-test errors * add model_name filtering to dispatch rate query and use 1m rate window * added pod based tuning and weighted sum of metrics * move queueing model configmap to deploy/ directory --------- Signed-off-by: vishakha-ramani <92736776+vishakha-ramani@users.noreply.github.com> Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com> Co-authored-by: tantawi <tantawi@us.ibm.com> Co-authored-by: Mohammed Munir Abdi <abdimamy@gmail.com>
* feat: add priority, roles, multi-analyzer scoring, and GreedyByScore optimizer
Implement five interconnected enhancements to the V2 saturation engine:
1. Per-model priority config with "{modelID}#{namespace}" lookup and
default fallback via resolveSaturationConfig()
2. P/D role types (prefill/decode/both) on VariantReplicaState,
VariantDecision, and VariantCapacity, injected from deployment
labels (llm-d.ai/role)
3. Per-role capacity aggregation in the V2 analyzer via
aggregateByRole(), with scheduler queue demand split by role:
prefill gets inputTokens, decode gets inputTokens+outputTokens
4. Multi-analyzer scoring infrastructure with AnalyzerScoreConfig
and composite score = priority * sum(requiredCapacity_i * score_i)
5. Rename GreedyBySaturationOptimizer to GreedyByScoreOptimizer with
score-based priority ordering, per-role work units for P/D
disaggregated models, and proportional P/D balancing
* refactor: demand-proportional P/D distribution in GreedyByScoreOptimizer
Replace post-hoc proportionalPDBalance() with demand-proportional
allocation integrated into the fair-share algorithm itself.
Previously, disaggregated models created separate per-role work units
that competed independently in fair-share, then a post-processing step
patched up the P:D ratio based on initial replica counts.
Now each model enters fair-share as a single entity. When allocating
replicas, allocateByRole() distributes them between prefill/decode
roles proportional to their per-role RequiredCapacity. If one role
can't use GPUs (e.g. accelerator exhausted), subsequent iterations
let the other role absorb its share naturally.
Removed: proportionalPDBalance, initTargetsForRole, mergeWorkTargets
Added: allocateByRole, allocateToVariants, roleDemands on modelWork
* fix: prevent role absorption in demand-proportional P/D allocation
When one role cannot fully allocate (e.g., accelerator exhausted), consume
its unallocated share from remaining so it does not overflow to other roles
in subsequent fair-share iterations.
* refactor: use Analyzers list for V2 selection with per-analyzer threshold overrides
Replace analyzerName-based V2 detection with IsV2() that checks for
Analyzers list (new-style) or analyzerName (backward compat). Add
per-analyzer ScaleUpThreshold/ScaleDownBoundary overrides on
AnalyzerScoreConfig with global fallback via EffectiveScaleUpThreshold/
EffectiveScaleDownBoundary methods.
* refactor: convert interfaces tests to Ginkgo
Convert saturation_scaling_test.go from standard testing to Ginkgo
Describe/It/Expect style with DescribeTable for validation cases.
Add suite_test.go for Ginkgo test runner bootstrap.
* refactor: unify enforcer to single EnforcePolicyOnDecisions entry point
Remove V1-specific EnforcePolicy (map-based) and its helpers (applyScaleToZero,
ensureMinimumReplicas). The V1 path now converts targets to decisions first,
then uses EnforcePolicyOnDecisions — the same path as V2. This eliminates the
dependency on VariantSaturationAnalysis for cost data in the enforcer.
* refactor: move SaturationScalingConfig types from interfaces to config package
Move SaturationScalingConfig, AnalyzerScoreConfig, and related constants
(DefaultPriority, DefaultScaleUpThreshold, DefaultScaleDownBoundary) from
internal/interfaces/ to internal/config/ where they belong alongside the
rest of the configuration types. Update all imports across 15 files.
) * 🐛 Enable scale-from-zero on CKS and OCP with KEDA support - Remove environment skip in scale_from_zero_test.go — test now runs on all platforms (KEDA must be pre-installed on the cluster) - Add retry logic to detect_inference_pool_api_group() to handle the race where InferencePool instances haven't been created yet after helmfile deploy - Make deploy_keda() skip helm install when KEDA CRD already exists (pre-installed on OCP via CMA operator, on CKS via helm) - Remove environment guard on SCALER_BACKEND=keda — supported everywhere Signed-off-by: Andy Anderson <andy@clubanderson.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com> * 🐛 Increase deploy wait timeout from 60s to 600s for model loading The kubectl wait --timeout=60s for all deployments in the llm-d namespace was too short for model-serving pods (vLLM) that need to download and load large models (e.g. Meta-Llama-3.1-8B) into GPU memory. This caused both OCP and CKS nightly E2E to fail at the "Deploy guide via WVA install.sh" step. Default is now 600s (10 min), overridable via DEPLOY_WAIT_TIMEOUT env var. The vLLM startupProbe already allows up to 30 minutes. Signed-off-by: Andrew Anderson <andy@clubanderson.com> * 🐛 Address Copilot review feedback on KEDA and scale-from-zero - deploy_keda(): Check operator pods + APIService, not just CRD, to avoid false skip when stale CRD remains after prior uninstall - detect_inference_pool_api_group(): Implement actual namespace-first then cluster-wide fallback (comment said fallback but code didn't) - Pin KEDA chart version (KEDA_CHART_VERSION, default 2.19.0) for reproducible installs - Fix ENABLE_SCALE_TO_ZERO default inconsistency in helm --set - Add Skip guard in scale-from-zero test for non-KEDA environments where HPA rejects minReplicas=0 - Fix misleading comment that said scale-from-zero requires KEDA - Document per-environment KEDA_NAMESPACE values in suite_test.go Signed-off-by: Andrew Anderson <andy@clubanderson.com> --------- Signed-off-by: Andy Anderson <andy@clubanderson.com> Signed-off-by: Andrew Anderson <andy@clubanderson.com>
…-d#864) * api: add minReplicas and maxReplicas to VariantAutoscalingSpec Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com> * api: add behavior field to VariantAutoscalingConfigSpec for HPA scaling policies Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com> * test: fix VA fixtures for maxReplicas validation and add CRD field tests for minReplicas, maxReplicas, and behavior Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com> * refactor(api): change default maxReplicas from 10 to 2 Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com> * refactor(api): remove behavior field to align with release plan Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com> --------- Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com>
* feat: add min/max replicas as VA annotations with optimizer integration Add per-variant min/max replica bounds via VA annotations (wva.llmd.ai/min-replicas, wva.llmd.ai/max-replicas) and integrate them into both V1 and V2 scaling paths: - Parse bounds from VA annotations in BuildVariantStates - Respect maxReplicas in CostAwareOptimizer (spillover to next variant) - Respect minReplicas in costAwareScaleDown (hard floor per variant) - Respect maxReplicas in GreedyBySaturationOptimizer allocateForModel - Respect min/max in V1 limiter allocateForDecision - Clamp targets in V1 CalculateSaturationTargets - Disable scale-to-zero enforcement when any variant has minReplicas > 0 - Propagate bounds through VariantDecision for observability * refactor: use VA spec fields for min/max replicas instead of annotations Remove annotation-based min/max replica bounds (wva.llmd.ai/min-replicas, wva.llmd.ai/max-replicas) and read directly from VA spec.MinReplicas and spec.MaxReplicas fields added in llm-d#864. This eliminates the annotation parsing layer and aligns with the CRD as the single source of truth. * fix(e2e): align VA min/max replicas with test expectations - Set explicit MinReplicas=1 and MaxReplicas=10 in VA builder defaults (was implicit MinReplicas via kubebuilder default and MaxReplicas=2) - Add VAOption functional options (WithMinReplicas, WithMaxReplicas) for tests that need custom replica bounds - Scale-to-zero and scale-from-zero tests now create VAs with MinReplicas=0 so the engine allows scaling to zero replicas - MaxReplicas raised from 2 to 10 to match HPA maxReplicas and avoid artificially capping scale-up in load tests * fix: address PR review — enforce minReplicas in GreedyByScore scale-down - Pass stateMap to costAwareScaleDown in GreedyByScoreOptimizer so minReplicas is respected during scale-down (was missing) - Update doc comments: "VA annotation" → "VA spec field" in VariantDecision, VariantReplicaState, and saturation analyzer - Add tests verifying mixed-minReplicas behavior: variant with minReplicas=0 scales to zero while sibling with minReplicas>0 is preserved (CostAware and GreedyByScore)
…eAffinity (llm-d#882) * Auto discover accelerator name from deployment nodeSelector/nodeAffinity * address comments * add back comment * add comment * nil check * fix lint/merge error
Signed-off-by: Sum Huynh <31661254+shuynh2017@users.noreply.github.com>
…lm-d#894) * api: change NumReplicas to *int32 pointer for scale-to-zero support Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com> Made-with: Cursor Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com> * fix: dereference *int32 NumReplicas in fmt prints and revert unrelated go fmt changes Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com> --------- Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com>
unparseable is a valid English variant of unparsable, used intentionally in configmap_helpers.go. Signed-off-by: Andrew Anderson <andy@clubanderson.com>
* stablizationWindow recommendation * address comment
Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com> Signed-off-by: Vivek Karunai Kiri Ragavan <60690810+vivekk16@users.noreply.github.com>
…d#886) Adds a third value for SCALER_BACKEND in deploy/install.sh: - 'none': skips all scaler backend deployment (no Prometheus Adapter install, no KEDA install, no APIService patching) Use case: llmd benchmark scripts that install WVA on GPU clusters where KEDA is already pre-installed. Previously, the default SCALER_BACKEND=prometheus-adapter would deploy Prometheus Adapter and patch the v1beta1.external.metrics.k8s.io APIService away from KEDA, breaking HPA scaling for all cluster workloads. Changes: - deploy/install.sh: add 'none' branch in main deploy flow, verify_deployment, and print_summary; update SCALER_BACKEND comment and env var docs - Makefile: update SCALER_BACKEND inline comment to document all three values Usage: SCALER_BACKEND=none make deploy-wva-emulated-on-kind Signed-off-by: Srujan Reddy <srjnreddy33@gmail.com>
…in "config" for OCP (llm-d#852) * fix: cert name and make mount for optional - keep only one replica for deployment via Kustomize - add serect for SA as optional for mount - if our EPP disable --metrics-endpoint-auth then no need it - by default GIE set true, so we need it otherwise will failed to get metrics - fix cert from CM openshift-service-ca.crt - add missing RBAC on inferencepool and deployment/scale - add missing rolebinding on controller for the monitoring view Signed-off-by: Wen Zhou <wenzhou@redhat.com> Signed-off-by: Wen Zhou <wenzhou@redhat.com> * update: code review - remove replica set in the deployment - make mount not optional - use existing path in the pod, skip configmap and mount Signed-off-by: Wen Zhou <wenzhou@redhat.com> * update: code review - remove comments Signed-off-by: Wen Zhou <wenzhou@redhat.com> --------- Signed-off-by: Wen Zhou <wenzhou@redhat.com>
…for new VAs (llm-d#922) * fix: resolve accelerator bootstrapping deadlock for new VAs in multi-VA E2E test New VAs without prior metrics get stuck in applySaturationDecisions() because acceleratorName cannot be resolved from empty VA status or currentAllocations. This prevents HPA metric emission, creating a 3+ minute blind period where the HPA cannot scale the deployment. Add fallback accelerator resolution from deployment nodeSelector/nodeAffinity and VA label. Pre-load guidellm image into Kind cluster to eliminate runtime pull delays. Increase load job timeout from 5 to 8 minutes to account for tokenizer download on first run. * fix: use burst load for multi-VA test to trigger simulator KV cache tracking The simulator only tracks KV cache for /v1/completions requests. guidellm defaults to /v1/chat/completions, which bypasses KV cache tracking entirely. This causes avgSpareKv to remain high (0.8) despite active load, preventing the saturation engine from triggering scale-up. Switch to burst load (curl) targeting /v1/completions directly, matching the pattern used by the working smoke scale-up test. Use 2400 prompts with 400 output tokens to sustain load across multiple engine cycles. * fix: wait for scale-up instead of job completion in multi-VA E2E test The burst load jobs send 2400 requests at ~42s each, which takes ~84 minutes to complete — far exceeding the 10-minute test timeout. On Kind the lower network latency masks this, but on OpenShift the jobs always time out. Match the proven smoke test pattern: verify load jobs are running, wait for the saturation engine to detect load and scale up VA-A, then check the cost preference assertion.
… is provided (llm-d#925) Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com>
* skip multiple va test Signed-off-by: Lionel Villard <villard@us.ibm.com> * remove Focus Signed-off-by: Lionel Villard <villard@us.ibm.com> --------- Signed-off-by: Lionel Villard <villard@us.ibm.com>
* Add shuynh2017 to the OWNERS list * update go to 1.25.0 * update workflows docker docs * fix e2e test * Update .github/workflows/ci-e2e-openshift.yaml Co-authored-by: Mohammed Munir Abdi <abdimamy@gmail.com> Signed-off-by: Sum Huynh <31661254+shuynh2017@users.noreply.github.com> * Update .github/workflows/ci-e2e-openshift.yaml Co-authored-by: Mohammed Munir Abdi <abdimamy@gmail.com> Signed-off-by: Sum Huynh <31661254+shuynh2017@users.noreply.github.com> * fix format * fix format --------- Signed-off-by: Sum Huynh <31661254+shuynh2017@users.noreply.github.com> Co-authored-by: Mohammed Munir Abdi <abdimamy@gmail.com>
Add a check-code-changes job to the OpenShift E2E workflow that detects documentation-only PRs and skips the entire E2E pipeline, saving GPU runner time. Uses the same dorny/paths-filter config as ci-pr-checks.yaml. Comment triggers (/ok-to-test, /retest) and workflow_dispatch always run. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Unify /ok-to-test as single trigger for all E2E suites Replace /trigger-e2e-full, /test-e2e-full, and /test-full with /ok-to-test in ci-pr-checks.yaml so maintainers use one command to trigger both the full Kind E2E and OpenShift E2E workflows. /retest remains as an OpenShift-only re-run command. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address review comments: fix permission wording and clarify concurrency - Docs: change "Maintainers/admins only" to "Users with write access" to match the actual workflow permission check (admin/maintain/write) - Clarify concurrency header comment to note workflow_dispatch falls back to run_id since no PR number is available Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lionel Villard <villard@us.ibm.com> --------- Signed-off-by: Lionel Villard <villard@us.ibm.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…m-d#909) Signed-off-by: Vivek Karunai Kiri Ragavan <vkarunai@redhat.com>
# Conflicts: # .github/workflows/ci-e2e-openshift.yaml # .github/workflows/ci-pr-checks.yaml # Dockerfile # OWNERS # api/v1alpha1/variantautoscaling_types.go # api/v1alpha1/zz_generated.deepcopy.go # deploy/install.sh # docs/developer-guide/testing.md # internal/controller/configmap_bootstrap.go # test/e2e/fixtures/va_builder.go # test/e2e/scale_from_zero_test.go
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
📝 WalkthroughWalkthroughThe PR introduces Go 1.25 support, upgrades Kubernetes autoscaling API from v1 to v2, adds min/max replica bounds to VariantAutoscaling CRD, and implements a queueing-model analyzer for SLO-driven capacity analysis. It refactors saturation configuration from Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Actionable Issues1. Queueing Model Analyzer Parameter Storage Race ConditionFile: The CWE-366: Improper Locking or Synchronization 2. Accelerator Resolution Precedence Not DocumentedFiles:
Risk: Silent acceptance of conflicting infrastructure intent. 3. MinReplicas/MaxReplicas Validation Gap in CRDFile: CEL validation rule allows CWE-99: Improper Control of Resource Identifiers 4.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
test/e2e/parallel_load_scaleup_test.go (1)
369-379:⚠️ Potential issue | 🟠 MajorUse a local optimized replica count inside this poll.
scaledOptimizedsurvives acrossEventuallyiterations and is only overwritten on non-nil reads. IfNumReplicasdrops back to nil on a later reconcile, this assertion reuses the previous value and can pass on stale state.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/parallel_load_scaleup_test.go` around lines 369 - 379, The poll uses a persisted variable scaledOptimized that can carry stale values across Eventually iterations; reset or use a local variable inside the poll closure so each iteration starts fresh: declare a new local optimizedCount (or reassign scaledOptimized = 0) immediately before reading va.Status.DesiredOptimizedAlloc.NumReplicas, set it only when NumReplicas != nil, use that local variable in GinkgoWriter.Printf and in the g.Expect assertion (referencing scaledOptimized / optimizedCount and va.Status.DesiredOptimizedAlloc.NumReplicas) so a nil read won't reuse a previous non-nil value.internal/controller/configmap_helpers.go (1)
37-49:⚠️ Potential issue | 🟠 MajorDecode these ConfigMaps in strict mode.
Both parsers use
yaml.Unmarshal, which silently drops unknown keys.QueueingModelScalingConfig.Validate()accepts zero values, so a typo can turn into a no-op config that is still counted as valid and applied cluster-wide. Use ayaml.DecoderwithKnownFields(true)beforeValidate().As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".Suggested fix
+import "strings" ... - if err := yaml.Unmarshal([]byte(yamlStr), &qmConfig); err != nil { + dec := yaml.NewDecoder(strings.NewReader(yamlStr)) + dec.KnownFields(true) + if err := dec.Decode(&qmConfig); err != nil { logger.Error(err, "Failed to parse queueing model config entry", "key", key) continue }Also applies to: 60-70
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/configmap_helpers.go` around lines 37 - 49, The code is using yaml.Unmarshal to decode ConfigMap entries into config.SaturationScalingConfig (and similarly for QueueingModelScalingConfig), which silently ignores unknown fields and can let typos produce zero-value/no-op configs that still pass Validate(); replace the unmarshal call with a yaml.Decoder created from the entry bytes, call decoder.KnownFields(true) before Decode to enforce strict decoding, handle and log decode errors, then proceed to call satConfig.Validate() (and the analogous Validate() for QueueingModelScalingConfig) and only count/store the config if both decode and validate succeed.internal/engines/pipeline/enforcer.go (1)
147-169:⚠️ Potential issue | 🟠 MajorMinimum-replica recovery ignores
MinReplicas/MaxReplicas.Lines 147-159 can pick the cheapest variant even when its
MaxReplicasis0, and Line 163 hardcodes the fallback to1instead of the configured minimum. Either this violates an explicit bound, or a later bounds clamp puts the model back at zero and defeats the no-scale-to-zero policy. Filter out variants that cannot legally run one replica, and seed the survivor to itsMinReplicaswhen present.Suggested fix
- // Total is 0 — find cheapest variant and set it to 1 + // Total is 0 — find the cheapest variant that can legally keep replicas cheapestIdx := -1 cheapestCost := float64(-1) for i := range decisions { d := &decisions[i] if d.ModelID != modelID || d.Namespace != namespace { continue } + if d.MaxReplicas != nil && *d.MaxReplicas < 1 { + continue + } cost := d.Cost if cost <= 0 { cost = saturation.DefaultVariantCost } if cheapestCost < 0 || cost < cheapestCost || (cost == cheapestCost && d.VariantName < decisions[cheapestIdx].VariantName) { @@ } if cheapestIdx >= 0 { - decisions[cheapestIdx].TargetReplicas = 1 + target := 1 + if decisions[cheapestIdx].MinReplicas != nil && *decisions[cheapestIdx].MinReplicas > target { + target = *decisions[cheapestIdx].MinReplicas + } + decisions[cheapestIdx].TargetReplicas = target updateDecisionAction(&decisions[cheapestIdx], optimizerName)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engines/pipeline/enforcer.go` around lines 147 - 169, The current cheapest-variant selection ignores variant replica bounds and then forces TargetReplicas=1; instead, when scanning decisions for the cheapest match (use decisions[i], ModelID, Namespace, cost logic as in current loop), skip any variant whose MaxReplicas == 0 (or MaxReplicas < 1) so you only consider variants that can legally run replicas; after choosing cheapestIdx, compute target := decisions[cheapestIdx].MinReplicas if MinReplicas > 0 else 1, then clamp target to decisions[cheapestIdx].MaxReplicas (i.e., target = min(target, MaxReplicas)); set decisions[cheapestIdx].TargetReplicas = target, call updateDecisionAction(&decisions[cheapestIdx], optimizerName), and log using logger.Info with the variant name and resulting target/cost (retain cheapestCost and tie-break behavior)..github/workflows/ci-e2e-openshift.yaml (1)
291-315:⚠️ Potential issue | 🟠 MajorFork PRs on pull_request events bypass approval requirements—no blocking status is created.
When the gate job sets
should_run='false'for fork PRs, the e2e-openshift job skips via its job condition (if: needs.gate.outputs.should_run == 'true'). Skipped jobs report as "Success" and do not block required checks per GitHub's branch protection behavior. Additionally, thereport-statusjob only runs onissue_commentevents (line 1180), meaning no commit status is created on the initial pull_request event. As a result, a branch protection rule requiring "e2e-openshift" has no status to enforce on pull_request, allowing the PR to merge without approval.Create an explicit pending commit status in the gate job on pull_request for fork PRs using the exact context name ("CI - OpenShift E2E Tests / e2e-openshift") that branch protection will check. This ensures the merge is blocked until a maintainer comments
/ok-to-testand the e2e-openshift job completes successfully via issue_comment.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/ci-e2e-openshift.yaml around lines 291 - 315, When detecting a fork PR where you set core.setOutput('should_run','false') (the external-contributor branch in the gate logic around tryPostComment and botComment), add an explicit GitHub commit status with state "pending" using the exact context "CI - OpenShift E2E Tests / e2e-openshift" so branch protection sees a blocking status; implement this by calling the GitHub Statuses API (via octokit.rest.repos.createCommitStatus or actions/github-script) for the PR head SHA only on pull_request events, include a short description and optional target_url pointing at the workflow run, and keep the existing comment/posting behavior unchanged so maintainers can later trigger /ok-to-test which will flip the status when e2e-openshift runs.
🟠 Major comments (23)
Dockerfile-2-2 (1)
2-2:⚠️ Potential issue | 🟠 MajorPin the Go builder image by digest.
quay.io/projectquay/golang:1.25is mutable, so a retagged or compromised upstream image can change the toolchain under the same commit and produce different controller binaries (CWE-494). Use an immutable digest here, like the Dockerfile.konflux already does.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` at line 2, The FROM line currently uses a mutable tag "quay.io/projectquay/golang:1.25 AS builder"; replace this with an immutable image digest form (e.g., "quay.io/projectquay/golang@sha256:<digest> AS builder") so the Go builder image is pinned; obtain the correct sha256 digest for the 1.25 image (as done in Dockerfile.konflux) and update the FROM instruction accordingly to ensure reproducible builds.deploy/kind-emulator/install.sh-223-239 (1)
223-239:⚠️ Potential issue | 🟠 MajorValidate cached GUIDELLM_IMG architecture matches the target platform.
The code computes a target
platformbased onKIND_IMAGE_PLATFORMor host architecture, but if the image tag already exists locally (docker image inspectsucceeds on line 232), it skips thedocker pull --platformcall entirely and loads whatever is cached. On mixed-architecture setups, a cached image from a previous run with a different architecture can causekind load docker-imageto preload the wrong image, resulting inexec format errorat runtime.Either always pull
GUIDELLM_IMGwith--platform "$platform"before loading (unconditional pull), or validate the cached image's architecture matches the target:docker image inspect --format '{{.Architecture}}' "$GUIDELLM_IMG"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/kind-emulator/install.sh` around lines 223 - 239, The script currently skips pulling GUIDELLM_IMG when a local image exists, which can leave a wrong-architecture image loaded into Kind; update the logic around GUIDELLM_IMG to either always pull with --platform "$platform" before loading or, if keeping the cache check, validate the cached image architecture using docker image inspect --format '{{.Architecture}}' "$GUIDELLM_IMG" and compare it to the desired $platform (map linux/arm64 -> arm64, linux/amd64 -> amd64) and only reuse the cached image when they match; otherwise perform docker pull --platform "$platform" "$GUIDELLM_IMG" and proceed to load.internal/saturation/analyzer_test.go-8-8 (1)
8-8: 🛠️ Refactor suggestion | 🟠 MajorVariable shadows imported package name.
config := config.SaturationScalingConfig{...}shadows theconfigpackage import. This compiles but is confusing and prevents accessing otherconfigexports in the same scope. Rename the local variable.Proposed fix (apply to all occurrences at lines 20, 83, 145, 197, 231, 271, 328)
- config := config.SaturationScalingConfig{ + satCfg := config.SaturationScalingConfig{ KvCacheThreshold: 0.80, QueueLengthThreshold: 5, KvSpareTrigger: 0.10, QueueSpareTrigger: 3, } // ... later in the test: - config, + satCfg,Also applies to: 20-25
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/saturation/analyzer_test.go` at line 8, The test file declares a local variable named "config" that shadows the imported config package (e.g., config := config.SaturationScalingConfig{...}); rename the local variable (suggestion: cfg or localCfg) wherever it is used (all occurrences constructing SaturationScalingConfig) so code references to the imported package (config.SaturationScalingConfig, other config exports) are not shadowed and remain accessible.internal/engines/pipeline/greedy_saturation_algorithm.go-112-120 (1)
112-120:⚠️ Potential issue | 🟠 MajorApply the
MinReplicasfloor before GPU allocation.By the time Lines 114-115 raise
TargetReplicas, the allocator has already been asked for the smaller pre-floor delta. A case like current=1, target=2, min=3 returnsTargetReplicas=3with one replica worth of GPUs andWasLimited=false, even if spare GPUs exist. Fold the min floor intoreplicasNeededbeforeTryAllocate, or mark the post-floor delta as limited.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engines/pipeline/greedy_saturation_algorithm.go` around lines 112 - 120, The MinReplicas floor is applied after GPU allocation which causes under-requesting GPUs; update the logic so MinReplicas is folded into replicasNeeded before calling TryAllocate (or adjust replicasNeeded = max(replicasNeeded, *d.MinReplicas)) so the allocator is asked for the post-floor delta, and then set d.TargetReplicas from that result; ensure WasLimited is computed relative to the post-floor comparison (replicasAllocated < replicasNeeded) so the limited flag is correct. Reference: MinReplicas, TargetReplicas, replicasNeeded, replicasAllocated, TryAllocate, WasLimited.test/e2e/parallel_load_scaleup_test.go-239-245 (1)
239-245:⚠️ Potential issue | 🟠 MajorDon't treat a nil optimizer result as stabilized when
minReplicasis 0.When scale-to-zero is enabled, this coerces a missing
NumReplicasto0, so theEventuallycan pass before the VA has published any optimizer output. That turns the later assertions into a race on uninitialized status.Fix
- var optimized int32 - if currentVA.Status.DesiredOptimizedAlloc.NumReplicas != nil { - optimized = *currentVA.Status.DesiredOptimizedAlloc.NumReplicas - } + g.Expect(currentVA.Status.DesiredOptimizedAlloc.NumReplicas).NotTo(BeNil(), + "VA should publish DesiredOptimizedAlloc before stabilization") + optimized := *currentVA.Status.DesiredOptimizedAlloc.NumReplicas🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/parallel_load_scaleup_test.go` around lines 239 - 245, The test is treating a nil DesiredOptimizedAlloc.NumReplicas as 0 which lets the Eventually pass when hpaMinReplicas==0 before the optimizer actually published a value; change the check to require NumReplicas to be non-nil before evaluating its value: in the block that reads currentVA.Status.DesiredOptimizedAlloc.NumReplicas (and computes optimized) ensure you first assert/currently wait that NumReplicas != nil (or explicitly fail/wait when nil), then read *NumReplicas and compare it to hpaMinReplicas (update the GinkgoWriter.Printf and the Expect that references optimized and hpaMinReplicas accordingly) so the test won't consider an absent optimizer result as stabilized.test/e2e/scale_from_zero_test.go-39-45 (1)
39-45:⚠️ Potential issue | 🟠 MajorSkip when flow-control prerequisites are absent.
This gate only checks backend/flag state. If GIE flow control or the required
InferenceObjectiveis missing, the spec still runs and then spends several minutes waiting for a queue metric that can never appear. Probe the prerequisite resource/metric up front andSkip(...)early.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/scale_from_zero_test.go` around lines 39 - 45, Update the scale-from-zero test gate to also detect missing GIE flow-control and the required InferenceObjective before running the spec: in the same pre-test check near the existing cfg.ScalerBackend / cfg.ScaleToZeroEnabled block in scale_from_zero_test.go, probe (using the cluster test client used elsewhere in tests) whether the GIE flow-control resource/metric is present and whether the named InferenceObjective exists; if either is absent, call Skip(...) with a clear message (similar style to the existing Skip) so the test exits early instead of waiting for a metric that cannot appear.charts/workload-variant-autoscaler/values.yaml-87-90 (1)
87-90:⚠️ Potential issue | 🟠 MajorDo not ship
H100as the default accelerator fallback.With the new auto-discovery path, a hardcoded fallback masks discovery failures and misclassifies unresolved workloads as
H100. That skews cost-based optimization and allocation on any non-H100cluster. Leave this empty by default and require an explicit override only when operators really need a manual fallback.As per coding guidelines, "REVIEW PRIORITIES: 2. Architectural issues and anti-patterns".Suggested fix
- accelerator: H100 + accelerator: ""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/workload-variant-autoscaler/values.yaml` around lines 87 - 90, The values.yaml currently hardcodes accelerator: H100 which masks auto-discovery failures; remove the hardcoded fallback so the accelerator key is empty/unset by default (or set to null/""), forcing explicit operator override when needed; update any docs/comments near the accelerator key and any code reading .Values.accelerator to treat empty/null as "no fallback" and fall back to auto-discovery logic in functions that reference accelerator.test/e2e/smoke_test.go-124-132 (1)
124-132:⚠️ Potential issue | 🟠 MajorScope the pre-test cleanup to this suite's resources.
Line 125 deletes every
VariantAutoscalingincfg.LLMDNamespace, and Lines 126-132 only warn if that sweep fails. In a shared namespace or parallel run, this can remove another spec's fixture and make the suite order-dependent. Limit cleanup to the objects this suite created (for example by label or name prefix), and fail setup if that scoped cleanup cannot complete.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/smoke_test.go` around lines 124 - 132, The pre-test cleanup currently calls utils.DeleteAllVariantAutoscalings(ctx, crClient, cfg.LLMDNamespace) which removes every VariantAutoscaling in the namespace; change this to a scoped cleanup that only targets objects created by this suite (e.g., filter by a label or name prefix), by either using or adding a helper such as utils.DeleteVariantAutoscalingsByLabel(ctx, crClient, cfg.LLMDNamespace, "<suite-label>") or adding an optional selector parameter to DeleteAllVariantAutoscalings; also make the setup fail (return error / call Fail/Expect) if the scoped cleanup returns an error instead of merely logging a warning so test ordering/shared namespaces cannot be impacted. Ensure you update the test invocation to pass the suite-specific label/prefix and replace the current GinkgoWriter warning branch with a hard failure on error.test/e2e/fixtures/va_builder.go-136-145 (1)
136-145:⚠️ Potential issue | 🟠 MajorThe new default
MinReplicasblocks scale-to-zero callers from opting out.Lines 136-145 now hardcode
Spec.MinReplicas=1for every fixture VA, but the*WithDefaultshelpers still don't acceptVAOptions. That means callers can set the HPA minimum to0while the controller-side VA bound remains1, so min-replica enforcement will clamp the variant back above zero.Suggested fix
func CreateVariantAutoscalingWithDefaults( ctx context.Context, crClient client.Client, namespace, name, deploymentName, modelID, accelerator string, - controllerInstance string, + controllerInstance string, + opts ...VAOption, ) error { - return CreateVariantAutoscaling(ctx, crClient, namespace, name, deploymentName, modelID, accelerator, 30.0, controllerInstance) + return CreateVariantAutoscaling(ctx, crClient, namespace, name, deploymentName, modelID, accelerator, 30.0, controllerInstance, opts...) } @@ func EnsureVariantAutoscalingWithDefaults( ctx context.Context, crClient client.Client, namespace, name, deploymentName, modelID, accelerator string, - controllerInstance string, + controllerInstance string, + opts ...VAOption, ) error { - return EnsureVariantAutoscaling(ctx, crClient, namespace, name, deploymentName, modelID, accelerator, 30.0, controllerInstance) + return EnsureVariantAutoscaling(ctx, crClient, namespace, name, deploymentName, modelID, accelerator, 30.0, controllerInstance, opts...) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/fixtures/va_builder.go` around lines 136 - 145, The fixture currently hardcodes Spec.MinReplicas = ptr.To(int32(1)) which prevents callers from opting into scale-to-zero; update the VA builder so MinReplicas is left nil by default and only set to 1 if no VAOption overrides it: remove the immediate assignment of MinReplicas in the va literal (the va variable), apply opts(va) as currently done, then after applying options set va.Spec.MinReplicas = ptr.To(int32(1)) only when it is still nil; ensure this change is reflected in the WithDefaults helpers and that VAOption functions can set va.Spec.MinReplicas to 0 to opt out.test/e2e/smoke_test.go-736-738 (1)
736-738:⚠️ Potential issue | 🟠 MajorReset
desiredReplicason each poll before reading the pointer.Lines 736-738 only update the outer
desiredReplicaswhenNumReplicasis non-nil. If the status goes nil after a previous non-nil sample, the assertion keeps using the stale value and theEventuallycan pass on old data.Suggested fix
- if va.Status.DesiredOptimizedAlloc.NumReplicas != nil { - desiredReplicas = int(*va.Status.DesiredOptimizedAlloc.NumReplicas) - } + currentDesired := 0 + if va.Status.DesiredOptimizedAlloc.NumReplicas != nil { + currentDesired = int(*va.Status.DesiredOptimizedAlloc.NumReplicas) + } + desiredReplicas = currentDesired @@ - scaleUpDetected := desiredReplicas > int(initialOptimized) + scaleUpDetected := currentDesired > int(initialOptimized) @@ - g.Expect(desiredReplicas).To(BeNumerically(">", int(initialOptimized)), - fmt.Sprintf("VA should recommend more replicas than initial under load (current: %d, initial: %d, elapsed: %v)", desiredReplicas, initialOptimized, elapsed)) + g.Expect(currentDesired).To(BeNumerically(">", int(initialOptimized)), + fmt.Sprintf("VA should recommend more replicas than initial under load (current: %d, initial: %d, elapsed: %v)", currentDesired, initialOptimized, elapsed))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/smoke_test.go` around lines 736 - 738, Reset desiredReplicas at the start of each poll iteration before reading va.Status.DesiredOptimizedAlloc.NumReplicas: ensure the polling loop sets desiredReplicas = 0 (or another sentinel) before checking va.Status.DesiredOptimizedAlloc.NumReplicas so that if the pointer is nil on a later sample you don't reuse a stale value; update the block that currently checks va.Status.DesiredOptimizedAlloc.NumReplicas to assign into desiredReplicas only when the pointer is non-nil and rely on the reset to represent nil cases for the Eventually assertion.internal/engines/saturation/engine_queueing_model.go-157-165 (1)
157-165:⚠️ Potential issue | 🟠 MajorApply default TTFT/ITL targets from the
defaultQM entry.Lines 157-165 only copy
TuningEnabledandSLOMultiplierfromallConfigs["default"]. A valid defaulttargetTTFT/targetITLpair is silently dropped, so cluster-wide SLO overrides never reach the analyzer.Suggested fix
if defaultCfg, ok := allConfigs["default"]; ok { if defaultCfg.TuningEnabled != nil { cfg.TuningEnabled = *defaultCfg.TuningEnabled } if defaultCfg.SLOMultiplier > 1.0 { cfg.SLOMultiplier = defaultCfg.SLOMultiplier } + if defaultCfg.TargetTTFT > 0 && defaultCfg.TargetITL > 0 { + modelKey := queueingmodel.MakeModelKey(namespace, modelID) + cfg.SLOTargets = map[string]*queueingmodel.SLOTarget{ + modelKey: { + TargetTTFT: defaultCfg.TargetTTFT, + TargetITL: defaultCfg.TargetITL, + }, + } + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engines/saturation/engine_queueing_model.go` around lines 157 - 165, The default QM entry handling currently only copies TuningEnabled and SLOMultiplier from allConfigs["default"]; extend that block to also propagate default targetTTFT and targetITL into the working cfg so cluster-wide SLO targets reach the analyzer. Concretely, inside the if defaultCfg, ok := allConfigs["default"] { ... } block check the defaultCfg fields (e.g., TargetTTFT and TargetITL — or their pointer equivalents) and assign them to cfg.TargetTTFT and cfg.TargetITL when the defaults are present (respect existing non-zero/non-nil values on cfg so you don't overwrite explicit per-QM targets). Ensure you use the same presence checks used for TuningEnabled and SLOMultiplier to detect whether the default targets are set..github/workflows/ci-pr-checks.yaml-18-20 (1)
18-20:⚠️ Potential issue | 🟠 MajorConcurrency guard substring matching allows DoS from untrusted commenters.
Line 19 uses
contains(...)for substring matching in the concurrency condition, but thecheck-full-testsjob (line 208) validates with exact-matchvalidCommands.includes(comment)on the trimmed body. A comment like"please /ok-to-test this"passes the concurrency check (substring present, joins shared PR group) but fails validation (exact match required). This lets untrusted commenters cancel legitimate runs even though their comments are rejected—a trivial CI DoS. Use exact-match validation in the concurrency condition to match the command validator.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/ci-pr-checks.yaml around lines 18 - 20, The concurrency condition currently uses contains(github.event.comment.body, '/ok-to-test') which allows substring matches; change it to perform the same exact-trimmed match used by the check-full-tests job (which uses validCommands.includes(comment)). Replace the contains(...) check with an exact comparison against the trimmed comment body (e.g., compare github.event.comment.body trimmed to the literal '/ok-to-test' or mirror the validCommands.includes(comment) logic) so the concurrency guard and the check-full-tests validation use identical exact-match semantics.internal/interfaces/queueing_model_scaling.go-54-67 (1)
54-67:⚠️ Potential issue | 🟠 MajorReject non-finite SLO values during validation (CWE-20).
Lines 54-67 only do range checks, so YAML values like
.nanor.infstill passValidate(): comparisons withNaNare false, and+Infis greater than zero. That lets invalid config propagate into the analyzer and turn scaling math into NaNs/Infs.Suggested fix
-import "fmt" +import ( + "fmt" + "math" +) @@ // SLOMultiplier: 0 = use default, >1 = valid, <=1 = invalid + if math.IsNaN(c.SLOMultiplier) || math.IsInf(c.SLOMultiplier, 0) { + return fmt.Errorf("sloMultiplier must be finite, got %v", c.SLOMultiplier) + } if c.SLOMultiplier != 0 && c.SLOMultiplier <= 1.0 { return fmt.Errorf("sloMultiplier must be > 1.0, got %.2f (k=1 means rho=0, no load tolerance; k<=1 is physically meaningless)", c.SLOMultiplier) } + if math.IsNaN(float64(c.TargetTTFT)) || math.IsInf(float64(c.TargetTTFT), 0) { + return fmt.Errorf("targetTTFT must be finite, got %v", c.TargetTTFT) + } if c.TargetTTFT < 0 { return fmt.Errorf("targetTTFT must be >= 0, got %.2f", c.TargetTTFT) } + if math.IsNaN(float64(c.TargetITL)) || math.IsInf(float64(c.TargetITL), 0) { + return fmt.Errorf("targetITL must be finite, got %v", c.TargetITL) + } if c.TargetITL < 0 { return fmt.Errorf("targetITL must be >= 0, got %.2f", c.TargetITL) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/interfaces/queueing_model_scaling.go` around lines 54 - 67, In the Validate() method (file queueing_model_scaling.go) add non-finite checks for SLOMultiplier, TargetTTFT and TargetITL before the existing range checks: reject values that are NaN or +/-Inf (use math.IsNaN and math.IsInf or equivalent) and return an error naming the offending field (SLOMultiplier, TargetTTFT, TargetITL) so configs like .nan or .inf are rejected outright before the existing comparisons and SLO target co-dependence check..github/workflows/ci-pr-checks.yaml-281-292 (1)
281-292:⚠️ Potential issue | 🟠 MajorVerify the Kind binary before installing it as root (CWE-494).
The binary download lacks integrity verification. Lines 289-291 fetch an executable and install it with elevated privileges without checking authenticity. Kind releases include SHA256 checksums; they should be downloaded and verified before execution.
Suggested fix
- curl -Lo ./kind "https://kind.sigs.k8s.io/dl/v0.25.0/kind-linux-${KIND_ARCH}" + KIND_VERSION="v0.25.0" + curl -fsSLo ./kind "https://kind.sigs.k8s.io/dl/${KIND_VERSION}/kind-linux-${KIND_ARCH}" + curl -fsSLo ./kind.sha256sum "https://kind.sigs.k8s.io/dl/${KIND_VERSION}/kind-linux-${KIND_ARCH}.sha256sum" + echo "$(cat ./kind.sha256sum) ./kind" | sha256sum -c - chmod +x ./kind🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/ci-pr-checks.yaml around lines 281 - 292, The "Install Kind" step downloads and installs the Kind binary without integrity checks; update the step to fetch the official SHA256 checksum for the matching KIND_ARCH and verify the downloaded binary before running chmod/sudo mv. Specifically, after computing KIND_ARCH (from ARCH) and downloading with curl, download the corresponding SHA256SUM (or .sha256) for v0.25.0, verify the binary with sha256sum -c (or compare computed sha256sum) and exit non‑zero on mismatch, only then run chmod +x ./kind and sudo mv ./kind /usr/local/bin/kind; ensure the verification failure path logs an error and stops the workflow.internal/utils/utils.go-494-495 (1)
494-495:⚠️ Potential issue | 🟠 MajorReject ambiguous "In" operators with multiple values to prevent incorrect capacity tracking.
When
NodeSelectorOpInspecifies multiple GPU products, the Kubernetes scheduler may select any of them, but the code returnsexpr.Values[0]unconditionally. This bakes an arbitrary accelerator choice into capacity calculations and allocation decisions, causing wrong recommendations.Only accept single-value expressions:
Fix
- if expr.Operator == corev1.NodeSelectorOpIn && len(expr.Values) > 0 { - return expr.Values[0] + if expr.Operator == corev1.NodeSelectorOpIn { + if len(expr.Values) == 1 { + return expr.Values[0] + } + return "" }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/utils/utils.go` around lines 494 - 495, The code currently returns expr.Values[0] for NodeSelectorOpIn unconditionally; change the branch that checks expr.Operator == corev1.NodeSelectorOpIn to only accept a single value: if len(expr.Values) == 1 return that value, otherwise treat it as unsupported (return an error or a clear "ambiguous selector" sentinel) instead of picking the first entry; update the surrounding function that handles expr.Operator/expr.Values to propagate this error/unsupported result so callers don't bake an arbitrary accelerator choice into capacity calculations.internal/utils/utils.go-475-481 (1)
475-481:⚠️ Potential issue | 🟠 MajorRemove preferred node affinity from accelerator inference logic.
Lines 475-481 use
PreferredDuringSchedulingIgnoredDuringExecutionto infer the accelerator type. This is unreliable: Kubernetes allows the scheduler to ignore preferred affinity and place the pod on a different node type. When that occurs, the inferred accelerator (e.g., MI300 from affinity) will not match the actual GPU on the scheduled node (e.g., A100), causing GPU accounting errors and incorrect scaling decisions downstream. Restrict inference to hard constraints (nodeSelectorandRequiredDuringSchedulingIgnoredDuringExecutiononly), or validate against the actual scheduled pod/node labels before using the value.Additionally, line 494 (
expr.Values[0]) selects the first value from a multi-valueInoperator without justification. All values inInare semantically equivalent alternatives; there is no specified precedence.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/utils/utils.go` around lines 475 - 481, Remove use of nodeAffinity.PreferredDuringSchedulingIgnoredDuringExecution in accelerator inference: stop extracting GPU from the preferred list (delete the loop that calls extractGPUFromNodeSelectorTerm for PreferredDuringSchedulingIgnoredDuringExecution) and only infer accelerators from hard constraints (nodeSelector and nodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution) or otherwise validate against the actual scheduled node labels before using the inferred value; also replace the direct use of expr.Values[0] (in the code handling an "In" operator) with logic that handles all expr.Values (e.g., iterate over expr.Values or otherwise treat them as equivalent alternatives) instead of arbitrarily choosing the first value so multi-value "In" selectors are handled correctly.internal/engines/pipeline/greedy_score_optimizer.go-112-115 (1)
112-115:⚠️ Potential issue | 🟠 Major
remainingchanges meaning from score to raw capacity.Lines 112-115 seed
remainingwithreq.Result.Score, but Lines 347-349 subtract rawPerReplicaCapacity. Once priority or analyzer weights differ from1, high-score models never converge correctly and can keep absorbing GPUs until caps are hit. Track score ordering separately from capacity still needed, or decrement by the same weighted contribution used to buildScore.Also applies to: 347-349
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engines/pipeline/greedy_score_optimizer.go` around lines 112 - 115, The variable remaining currently conflates score and raw capacity (seeded from req.Result.Score or RequiredCapacity) but later is decremented by raw PerReplicaCapacity, causing incorrect behavior when weights differ; fix by introducing two separate variables (e.g., remainingScore and remainingCapacity) so ordering/selection uses remainingScore (seeded from req.Result.Score) while capacity bookkeeping uses remainingCapacity (seeded from req.Result.RequiredCapacity), and when assigning replicas decrement remainingScore by the exact weighted contribution used to compute Score (the priority/analyzer-weighted delta) and decrement remainingCapacity by PerReplicaCapacity; update the code paths that subtract (the block referencing PerReplicaCapacity and the seeding logic around req.Result.Score/RequiredCapacity) to use the appropriate variable.deploy/install.sh-536-543 (1)
536-543:⚠️ Potential issue | 🟠 MajorDo not infer the pool group from other namespaces.
If Lines 527-534 race before the current release creates its
InferencePool, Lines 537-541 fall back to-Aand can pick a stale pool from another namespace. On shared clusters that upgrades WVA to the wrongpoolGroup, and scale-from-zero watches the wrong resource type. Prefer API discovery/CRD detection, or only trust objects fromLLMD_NS.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/install.sh` around lines 536 - 543, The fallback currently uses cluster-wide `kubectl get ... -A` which can pick up stale `InferencePool` from other namespaces; change the logic so `DETECTED_POOL_GROUP` is only set from resources in the target namespace (`LLMD_NS`) or determined via API/CRD discovery instead of `-A`: first try `kubectl get inferencepools.inference.networking.k8s.io -n "$LLMD_NS"` and `kubectl get inferencepools.inference.networking.x-k8s.io -n "$LLMD_NS"`, and if neither exists, use API discovery (e.g., `kubectl api-resources` / CRD lookup) to detect the correct group rather than scanning all namespaces. Ensure all checks reference `LLMD_NS` and only fall back to CRD-based detection when the namespaced checks fail, then set `DETECTED_POOL_GROUP` accordingly.deploy/install.sh-1117-1127 (1)
1117-1127:⚠️ Potential issue | 🟠 MajorThis KEDA short-circuit does not verify who owns
external.metrics.k8s.io.Line 1122 only proves that the APIService object exists. If Prometheus Adapter (or another provider) still owns that APIService, this returns early and
SCALER_BACKEND=kedacomes up without a working KEDA metrics server. Check.spec.service.nameand.spec.service.namespacebefore treating KEDA as operational.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/install.sh` around lines 1117 - 1127, The current short-circuit only checks that the APIService object exists; update the apiservice check (the kubectl get apiservice v1beta1.external.metrics.k8s.io branch) to fetch and validate .spec.service.name and .spec.service.namespace (e.g. via kubectl get apiservice v1beta1.external.metrics.k8s.io -o jsonpath=...) and ensure they point to the KEDA metrics service (compare against the expected service name/namespace used by the KEDA operator deployment, or derive them from the operator pod/service labels), and only call log_success "KEDA CRD, operator, and metrics APIService detected — skipping helm install" and return when those fields match; otherwise emit the existing log_warning and proceed with helm install.internal/engines/pipeline/greedy_score_optimizer.go-331-339 (1)
331-339:⚠️ Potential issue | 🟠 MajorExplicit
maxReplicas: 0is ignored.Line 332 only enforces the bound when
*state.MaxReplicas > 0. With the new semantics,nilmeans "no cap", so a non-nil zero should still forbid allocation. Drop the> 0guard.Suggested fix
- if state.MaxReplicas != nil && *state.MaxReplicas > 0 { + if state.MaxReplicas != nil { maxAdd := *state.MaxReplicas - w.targets[vc.VariantName] if maxAdd <= 0 { continue // already at max } if n > maxAdd { n = maxAdd } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/engines/pipeline/greedy_score_optimizer.go` around lines 331 - 339, The current cap logic in the greedy scorer skips enforcement when state.MaxReplicas is zero because of the `> 0` guard; change the check in the block that references state.MaxReplicas, w.targets[vc.VariantName], vc.VariantName and variable n so that any non-nil state.MaxReplicas (including 0) is enforced: remove the `> 0` condition, compute maxAdd := *state.MaxReplicas - w.targets[vc.VariantName], and keep the existing behavior of continuing when maxAdd <= 0 and setting n = maxAdd when n > maxAdd..github/workflows/ci-e2e-openshift.yaml-83-89 (1)
83-89:⚠️ Potential issue | 🟠 MajorPin the new actions to immutable SHAs.
actions/checkout@v4anddorny/paths-filter@v3are tag-based refs, vulnerable to retagging attacks (CWE-927). Use full commit SHAs for bothuses:entries before merging.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/ci-e2e-openshift.yaml around lines 83 - 89, The workflow uses tag-based action refs "actions/checkout@v4" and "dorny/paths-filter@v3" which are mutable; replace each with the corresponding full immutable commit SHA for the desired release (pin the "uses:" values to the full commit SHAs), e.g., update the "uses:" entries for actions/checkout and dorny/paths-filter to their canonical commit SHAs so the CI references an immutable revision.deploy/install.sh-1756-1759 (1)
1756-1759:⚠️ Potential issue | 🟠 MajorFail fast for
SCALER_BACKEND=nonewhen no metrics API is present.This branch only logs an assumption. On a misconfigured cluster the script still exits green, but HPA can never resolve
external.metrics.k8s.io. Add an explicit APIService check here and abort if it is missing.Suggested fix
if [ "$SCALER_BACKEND" = "keda" ]; then deploy_keda elif [ "$SCALER_BACKEND" = "none" ]; then + if ! kubectl get apiservice v1beta1.external.metrics.k8s.io >/dev/null 2>&1; then + log_error "SCALER_BACKEND=none requires a working external.metrics.k8s.io APIService" + fi log_info "Skipping scaler backend deployment (SCALER_BACKEND=none)" log_info "Assumes an external metrics API (e.g. KEDA) is already installed on the cluster" elif [ "$DEPLOY_PROMETHEUS_ADAPTER" = "true" ]; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/install.sh` around lines 1756 - 1759, When SCALER_BACKEND is "none" the script only logs an assumption but should fail if the external metrics API is not present; update the branch handling SCALER_BACKEND="none" to verify the Kubernetes APIService for "v1beta1.external.metrics.k8s.io" (or the canonical external.metrics.k8s.io APIService name used in the cluster) using kubectl/oc and, if the APIService is absent or not Available, call log_error and exit non-zero instead of continuing; keep the existing log_info messages but add the APIService existence/availability check before returning control so HPA resolution failures are caught early.internal/actuator/actuator.go-41-53 (1)
41-53:⚠️ Potential issue | 🟠 Major
deployment.Status.Replicas >= 0will always be true, causing the fallback logic to never execute.Since
Status.Replicasis defined as anint32that should be non-negative per the Kubernetes API spec, the comparison at line 47 always succeeds—even for newly created Deployments where the controller hasn't yet populated the status. This causes the function to return0for a fresh Deployment, bypassing the intended fallback tospec.replicasat lines 52–53.Validate that the Deployment controller has observed the current spec generation before trusting the status:
Suggested fix
func (a *Actuator) GetCurrentDeploymentReplicasFromDeployment(va *llmdOptv1alpha1.VariantAutoscaling, deployment *appsv1.Deployment) (int32, error) { if deployment == nil { return 0, fmt.Errorf("deployment cannot be nil for %s/%s", va.Namespace, va.GetScaleTargetName()) } - // Prefer status replicas (actual current state) - if deployment.Status.Replicas >= 0 { + // Prefer status once the deployment controller has observed this spec. + if deployment.Status.ObservedGeneration >= deployment.Generation { return deployment.Status.Replicas, nil } - // Fallback to spec if status not ready + // Fallback to spec if status is still stale. if deployment.Spec.Replicas != nil { return *deployment.Spec.Replicas, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/actuator/actuator.go` around lines 41 - 53, The GetCurrentDeploymentReplicasFromDeployment function currently trusts deployment.Status.Replicas unconditionally; change the logic to only use status.Replicas when the controller has observed the current generation (compare deployment.Status.ObservedGeneration to deployment.Generation) and otherwise fall back to deployment.Spec.Replicas (or 0/error if spec is nil). Update the checks inside GetCurrentDeploymentReplicasFromDeployment to: validate deployment non-nil, if Status.ObservedGeneration == Generation then return Status.Replicas, else if Spec.Replicas != nil return *Spec.Replicas, otherwise return an appropriate error or 0; keep references to the function name GetCurrentDeploymentReplicasFromDeployment and the VariantAutoscaling parameter to locate the code.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 85a478fb-508d-4cae-896b-4036aa1b3335
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum,!**/*.sum
📒 Files selected for processing (114)
.github/aw/github-agentic-workflows.md.github/workflows/ci-e2e-openshift.yaml.github/workflows/ci-pr-checks.yaml.github/workflows/link-checker.lock.yml.github/workflows/link-checker.md.github/workflows/typo-checker.lock.yml.github/workflows/typo-checker.md.github/workflows/upstream-monitor.lock.yml.github/workflows/upstream-monitor.mdCONTRIBUTING.mdDockerfileDockerfile.konfluxMakefileOWNERSREADME.md_typos.tomlapi/v1alpha1/variantautoscaling_types.goapi/v1alpha1/variantautoscaling_types_test.goapi/v1alpha1/zz_generated.deepcopy.gocharts/workload-variant-autoscaler/README.mdcharts/workload-variant-autoscaler/crds/llmd.ai_variantautoscalings.yamlcharts/workload-variant-autoscaler/templates/manager/wva-configmap.yamlcharts/workload-variant-autoscaler/templates/manager/wva-deployment-controller-manager.yamlcharts/workload-variant-autoscaler/templates/prometheus-ca-configmap-prom.yamlcharts/workload-variant-autoscaler/templates/prometheus-ca-configmap-wva.yamlcharts/workload-variant-autoscaler/values-dev.yamlcharts/workload-variant-autoscaler/values.yamlconfig/crd/bases/llmd.ai_variantautoscalings.yamlconfig/samples/hpa-integration.yamldeploy/README.mddeploy/configmap-queueing-model.yamldeploy/install.shdeploy/kind-emulator/install.shdeploy/kubernetes/README.mddeploy/openshift/README.mddocs/developer-guide/development.mddocs/developer-guide/testing.mddocs/user-guide/crd-reference.mdgo.modinternal/actuator/actuator.gointernal/actuator/actuator_test.gointernal/collector/registration/queueing_model.gointernal/collector/replica_metrics.gointernal/collector/source/pod_va_mapper_test.gointernal/config/config.gointernal/config/config_test.gointernal/config/helpers.gointernal/config/helpers_test.gointernal/config/loader_test.gointernal/config/saturation_scaling.gointernal/config/saturation_scaling_test.gointernal/config/suite_test.gointernal/controller/configmap_bootstrap.gointernal/controller/configmap_helpers.gointernal/controller/configmap_reconciler.gointernal/controller/indexers/indexers.gointernal/controller/indexers/indexers_test.gointernal/controller/predicates.gointernal/controller/variantautoscaling_controller.gointernal/controller/variantautoscaling_controller_test.gointernal/engines/analyzers/queueingmodel/analyzer.gointernal/engines/analyzers/queueingmodel/config.gointernal/engines/analyzers/queueingmodel/defaults.gointernal/engines/analyzers/queueingmodel/parameters.gointernal/engines/analyzers/queueingmodel/parameters_test.gointernal/engines/analyzers/queueingmodel/tuner/defaults.gointernal/engines/analyzers/queueingmodel/tuner/utils.gointernal/engines/analyzers/queueingmodel/utils.gointernal/engines/analyzers/queueingmodel/utils_test.gointernal/engines/analyzers/saturation_v2/analyzer.gointernal/engines/analyzers/saturation_v2/analyzer_test.gointernal/engines/common/cache.gointernal/engines/common/cache_test.gointernal/engines/pipeline/cost_aware_optimizer.gointernal/engines/pipeline/cost_aware_optimizer_test.gointernal/engines/pipeline/enforcer.gointernal/engines/pipeline/enforcer_test.gointernal/engines/pipeline/greedy_saturation_algorithm.gointernal/engines/pipeline/greedy_saturation_algorithm_test.gointernal/engines/pipeline/greedy_saturation_optimizer.gointernal/engines/pipeline/greedy_score_optimizer.gointernal/engines/pipeline/greedy_score_optimizer_test.gointernal/engines/pipeline/optimizer_interfaces.gointernal/engines/saturation/engine.gointernal/engines/saturation/engine_queueing_model.gointernal/engines/saturation/engine_test.gointernal/engines/saturation/engine_v2.gointernal/engines/saturation/engine_v2_test.gointernal/engines/scalefromzero/engine.gointernal/engines/scalefromzero/engine_test.gointernal/interfaces/analyzer.gointernal/interfaces/queueing_model_scaling.gointernal/interfaces/queueing_model_scaling_test.gointernal/interfaces/saturation_analyzer.gointernal/interfaces/saturation_scaling_test.gointernal/saturation/analyzer.gointernal/saturation/analyzer_test.gointernal/utils/allocation.gointernal/utils/tls.gointernal/utils/tls_test.gointernal/utils/utils.gointernal/utils/utils_test.gointernal/utils/variant.gointernal/utils/variant_test.gotest/e2e/fixtures/va_builder.gotest/e2e/limiter_test.gotest/e2e/parallel_load_scaleup_test.gotest/e2e/saturation_test.gotest/e2e/scale_from_zero_test.gotest/e2e/scale_to_zero_test.gotest/e2e/smoke_test.gotest/e2e/suite_test.gotest/utils/debug_helpers.gotest/utils/e2eutils.go
💤 Files with no reviewable changes (11)
- deploy/openshift/README.md
- .github/aw/github-agentic-workflows.md
- .github/workflows/upstream-monitor.lock.yml
- .github/workflows/typo-checker.md
- deploy/kubernetes/README.md
- .github/workflows/upstream-monitor.md
- .github/workflows/link-checker.md
- internal/interfaces/saturation_scaling_test.go
- .github/workflows/link-checker.lock.yml
- internal/engines/pipeline/greedy_saturation_optimizer.go
- .github/workflows/typo-checker.lock.yml
| NIS float64 // Normalized Innovation Squared | ||
| Covariance [][]float64 // state covariance matrix |
There was a problem hiding this comment.
Return and store copies, not shared *LearnedParameters.
Get hands out the in-map pointer and Set stores the caller's pointer verbatim. That lets any goroutine mutate Covariance, NIS, or LastUpdated after the lock is released, so this store can still race and corrupt shared state.
Proposed fix
+func cloneLearnedParameters(in *LearnedParameters) *LearnedParameters {
+ if in == nil {
+ return nil
+ }
+
+ out := *in
+ if in.Covariance != nil {
+ out.Covariance = make([][]float64, len(in.Covariance))
+ for i := range in.Covariance {
+ out.Covariance[i] = append([]float64(nil), in.Covariance[i]...)
+ }
+ }
+ return &out
+}
+
// Get retrieves parameters for a variant (nil if does not exist)
func (s *ParameterStore) Get(namespace, variantName string) *LearnedParameters {
s.mu.RLock()
defer s.mu.RUnlock()
key := makeVariantKey(namespace, variantName)
- return s.params[key]
+ return cloneLearnedParameters(s.params[key])
}
// Set stores parameters for a variant (overrides any earlier parameters)
func (s *ParameterStore) Set(namespace, variantName string, params *LearnedParameters) {
s.mu.Lock()
defer s.mu.Unlock()
key := makeVariantKey(namespace, variantName)
- s.params[key] = params
+ s.params[key] = cloneLearnedParameters(params)
}Also applies to: 35-47
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/engines/analyzers/queueingmodel/parameters.go` around lines 21 - 22,
Get currently returns the pointer from the map and Set stores the caller's
pointer directly, allowing callers to mutate shared state; change the Get and
Set methods for the LearnedParameters map to always produce and consume deep
copies: when getting, allocate a new LearnedParameters, copy primitive fields
(NIS, LastUpdated), and deep-copy Covariance by allocating outer and inner
slices and copying floats; when setting, make the same deep copy of the incoming
*LearnedParameters before storing it in the map so the map never holds or
exposes caller-owned slice pointers.
There was a problem hiding this comment.
@shuynh2017, agreed — this is a real race condition (CWE-362). Both Get and Set hand out/store raw pointers to LearnedParameters, so any goroutine can mutate Covariance, NIS, or LastUpdated after the lock is released. The cloneLearnedParameters helper correctly isolates ownership on both sides.
- Add
cloneLearnedParametershelper that deep-copiesCovariance(nested slice) - Update
Getto return a clone instead of the raw map pointer - Update
Setto store a clone instead of the caller's pointer
Failed to handle agent chat message. Please try again.
There was a problem hiding this comment.
do you want to open PR in upstream for this fix?
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
|
/approve |
Description
Syncs llm-d/llm-d-workload-variant-autoscaler main branch into ODH main branch.
with two updates:
created by update: add agent to do upstream sync #41
How Has This Been Tested?
Merge criteria:
Summary by CodeRabbit
New Features
Documentation
Removals