Skip to content

Adding CD daemon ready to not ready test#1051

Open
visheshtanksale wants to merge 2 commits intokubernetes-sigs:mainfrom
visheshtanksale:test-cd
Open

Adding CD daemon ready to not ready test#1051
visheshtanksale wants to merge 2 commits intokubernetes-sigs:mainfrom
visheshtanksale:test-cd

Conversation

@visheshtanksale
Copy link
Copy Markdown
Contributor

@visheshtanksale visheshtanksale commented Apr 17, 2026

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 17, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: visheshtanksale
Once this PR has been reviewed and has the lgtm label, please assign shivamerla for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 17, 2026
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@visheshtanksale: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-dra-driver-nvidia-gpu-e2e-lambda-gpu e2da476 link false /test pull-dra-driver-nvidia-gpu-e2e-lambda-gpu

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good companion to the existing shutdown-path test in test_cd_misc.bats:216: this one locks in the transient-unready path (entry retained with status: NotReady) vs. the shutdown path (entry removed). A few asks before merge.

1. sleep 4 × 3 is flaky under CI load. All three sites (pre-STOP baseline, post-STOP NotReady assert, post-CONT recovery assert) are proxies for "controller has observed and aggregated". Prefer kubectl wait --for=jsonpath='{.status.status}'=NotReady ... --timeout=30s against the ComputeDomain — same pattern you already use for --for=condition=Ready on the pod side. Keeps the test fast when the system is fast and honest when it's slow.

2. Single-pod scope leaves the multi-writer convergence path uncovered. The test uses one worker / one $DAEMON_POD, so it cannot exercise the syncNodeInfoToCD lost-update race #1049 reproduces (that bug needs N≥2 daemons). Worth either linking #1049 in the PR description as a known uncovered scenario, or adding an N≥2 sibling (bats already has multi-node demos in test_cd_mnnvl_workload.bats) that asserts len(.status.nodes) == numNodes after the rollout.

3. Which probe flips readiness? SIGSTOP doesn't map cleanly to a single failure mode: startup / readiness / liveness probes interact differently, and a real IMEX hang (stalled gRPC, GPU-topology read blocked on driver locks) behaves differently from a stopped process. Could you name in a comment (or in the test description) which probe's contract is under test here? That makes the boundary of what this test guarantees explicit.

4. Assertion depth. The test stops at .status.status == NotReady and doesn't check downstream effects (peer daemon signalling, claim invalidation, workload-pod status propagation). That's probably correct for this driver's current surface, but calling it out in the PR body helps future readers understand where this driver's node-bad state machine currently ends.

5. Consider asserting a Condition instead of a string field. .status.status as a single string is less idiomatic than a typed Condition with Reason / Message / LastTransitionTime. Not blocking, but a Condition-based surface would dovetail with downstream propagation if/when anything reads this into Pod status.

6. Recovery identity. On SIGCONT, the test confirms .status.status == Ready but doesn't assert the node entry keeps its Index / CliqueID. The CD-daemon DNS-name stability relies on stable Index. Worth a one-line assertion.

Nice test. Happy to move toward approval once the sleeps are replaced with kubectl wait and the #1049-coverage intent is clarified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants