Skip to content

Collect Kueue diagnostics on test failure#928

Merged
openshift-merge-bot[bot] merged 1 commit into
mainfrom
kueue-diagnostics-on-failure
Jun 22, 2026
Merged

Collect Kueue diagnostics on test failure#928
openshift-merge-bot[bot] merged 1 commit into
mainfrom
kueue-diagnostics-on-failure

Conversation

@sutaakar

@sutaakar sutaakar commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds StoreKueueDiagnostics() to collect Kueue CR state, full DSC, and Kueue controller-manager pod logs when Kueue integration tests fail
  • Wired into SetupKueue() via t.T().Cleanup() gated on t.T().Failed() — zero overhead on passing tests
  • Searches openshift-kueue-operator and kueue-system namespaces for controller pods using control-plane=controller-manager,app.kubernetes.io/name=kueue labels

Motivation

When Kueue integration tests fail (e.g. "default LocalQueue not found"), the current namespace-scoped artifacts (pod logs, events) don't capture the operator-level state needed to diagnose why LocalQueues aren't being auto-created.

Artifacts produced on failure

File Content
kueue-cr-state.log Full Kueue CR JSON (spec + status + conditions)
dsc-state.log Full DataScienceCluster JSON (spec + status)
kueue-operator-{pod}-{container}.log Last 500 lines from each Kueue controller pod

Test plan

  • go vet ./tests/common/support/... — compiles clean
  • golangci-lint — 0 issues
  • verify-imports — passes
  • Verified on cluster: passing test produces no diagnostic artifacts
  • Verified on cluster: failing test collects all three artifact types correctly

🤖 Generated with Claude Code

When Kueue integration tests fail, the existing namespace-scoped artifacts
(pod logs, events) don't capture the operator-level state needed to diagnose
issues like missing default LocalQueue auto-creation. This adds automatic
collection of Kueue CR state, full DSC, and Kueue controller-manager logs
via SetupKueue's cleanup hook, gated on test failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

SetupKueue in tests/common/support/kueue_operator.go gains a t.T().Cleanup callback that invokes the new StoreKueueDiagnostics when the test has failed. StoreKueueDiagnostics fetches and JSON-serializes the Kueue CR and DataScienceCluster object, then calls helpers to discover operator pods across two hardcoded namespaces via a fixed label selector, tails the last 500 log lines per container, and writes all artifacts to the test output directory. New imports (encoding/json, fmt, io, corev1) are added to support these operations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes


Direct findings — no praise:

CWE-532 (Information Exposure Through Log Files): StoreKueueDiagnostics writes operator pod logs and CR state (which may include secrets or sensitive config) to the output directory unconditionally when any test fails. Verify the output directory is not world-readable and is excluded from artifact uploads in CI pipelines.

CWE-20 (Improper Input Validation) — hardcoded namespace list in findKueueOperatorPods: Two namespaces are baked in with a fixed label selector. If the operator is deployed under a different namespace (e.g., in a forked or customized environment), the function silently returns an empty result — no error, no warning. This is a silent failure path.

Log tail truncation — tail=500: For long-running or crash-looping containers, 500 lines may discard the earliest relevant failure context. The value is hardcoded with no override mechanism.

Error handling in storeKueueOperatorLogs: Per-container retrieval and read errors are logged and skipped (continue), meaning partial diagnostics can be written without any top-level failure signal. Callers receive no indication that the collection was incomplete.

🚥 Pre-merge checks | ✅ 9 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
No Sensitive Data In Logs ⚠️ Warning Code persists full Kueue CR, DSC, and raw operator pod logs without redacting sensitive fields (passwords, tokens, API keys, secrets) at lines 148, 164, and 193 (CWE-532, CWE-200). Apply redaction function before WriteToOutputDir calls: redact Password, Token, APIKey, Secret fields and base64-encoded values from marshaled JSON and pod logs.
✅ Passed checks (9 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: adding diagnostic collection for Kueue on test failure, which matches the primary purpose of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Contribution Quality And Spam Detection ✅ Passed PR adds diagnostic collection to Kueue test setup; no spam signals meet the required threshold. Commit explicitly co-authored by Claude; substantial codebase history shows legitimate contributor; c...
No Hardcoded Secrets ✅ Passed No hardcoded secrets, API keys, tokens, passwords, or credentials detected. All string literals are legitimate Kubernetes resource names, log messages, and configuration values.
No Weak Cryptography ✅ Passed No weak cryptographic primitives (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons detected. File uses only non-cryptographic Go std...
No Injection Vectors ✅ Passed No injection vectors detected. Code uses only trusted inputs from Kubernetes API and hardcoded constants. Test file exception applies (CWE-78, CWE-89, CWE-94, CWE-502, CWE-79).
No Privileged Containers ✅ Passed PR modifies only Go test code (tests/common/support/kueue_operator.go), not Kubernetes/OpenShift manifests, Helm templates, or Dockerfiles; check is not applicable.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@sutaakar sutaakar requested review from ChughShilpa and removed request for kryanbeane June 22, 2026 07:02

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/common/support/kueue_operator.go`:
- Around line 148-149: The code persists full resource JSON and operator logs
through WriteToOutputDir calls at lines 148, 164, and 193 without redacting
sensitive fields like Password, Token, APIKey, and Secret.Data, which can expose
credentials in CI artifacts. Create a helper function to redact these sensitive
field values from the data before it is written to the output directory, then
apply this redaction function to the data parameter in all three
WriteToOutputDir calls (the kueue-cr-state, operator log, and any other artifact
writes) to ensure sensitive information is masked in the diagnostic output.
- Around line 180-187: The io.ReadAll(stream) call on line 186 reads pod logs
without any byte limit, which can cause memory exhaustion if logs grow
unbounded. Wrap the stream returned from GetLogs().Stream() with io.LimitReader
to set a maximum byte limit before passing it to io.ReadAll. This prevents
unbounded memory allocation during pod log retrieval in the test support
function.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d5d53227-7e4e-48fe-8ae8-e5efa84ea2d4

📥 Commits

Reviewing files that changed from the base of the PR and between 2241bf3 and 26bcf98.

📒 Files selected for processing (1)
  • tests/common/support/kueue_operator.go

Comment thread tests/common/support/kueue_operator.go
Comment thread tests/common/support/kueue_operator.go

@ChughShilpa ChughShilpa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Karel

@sutaakar

Copy link
Copy Markdown
Contributor Author

/approve

@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ChughShilpa, sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit 36f9003 into main Jun 22, 2026
11 checks passed
@sutaakar sutaakar deleted the kueue-diagnostics-on-failure branch June 22, 2026 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants