Adding CD daemon ready to not ready test#1051
Adding CD daemon ready to not ready test#1051visheshtanksale wants to merge 2 commits intokubernetes-sigs:mainfrom
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: visheshtanksale The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
00532ad to
93990c1
Compare
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
|
@visheshtanksale: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Good companion to the existing shutdown-path test in test_cd_misc.bats:216: this one locks in the transient-unready path (entry retained with status: NotReady) vs. the shutdown path (entry removed). A few asks before merge.
1. sleep 4 × 3 is flaky under CI load. All three sites (pre-STOP baseline, post-STOP NotReady assert, post-CONT recovery assert) are proxies for "controller has observed and aggregated". Prefer kubectl wait --for=jsonpath='{.status.status}'=NotReady ... --timeout=30s against the ComputeDomain — same pattern you already use for --for=condition=Ready on the pod side. Keeps the test fast when the system is fast and honest when it's slow.
2. Single-pod scope leaves the multi-writer convergence path uncovered. The test uses one worker / one $DAEMON_POD, so it cannot exercise the syncNodeInfoToCD lost-update race #1049 reproduces (that bug needs N≥2 daemons). Worth either linking #1049 in the PR description as a known uncovered scenario, or adding an N≥2 sibling (bats already has multi-node demos in test_cd_mnnvl_workload.bats) that asserts len(.status.nodes) == numNodes after the rollout.
3. Which probe flips readiness? SIGSTOP doesn't map cleanly to a single failure mode: startup / readiness / liveness probes interact differently, and a real IMEX hang (stalled gRPC, GPU-topology read blocked on driver locks) behaves differently from a stopped process. Could you name in a comment (or in the test description) which probe's contract is under test here? That makes the boundary of what this test guarantees explicit.
4. Assertion depth. The test stops at .status.status == NotReady and doesn't check downstream effects (peer daemon signalling, claim invalidation, workload-pod status propagation). That's probably correct for this driver's current surface, but calling it out in the PR body helps future readers understand where this driver's node-bad state machine currently ends.
5. Consider asserting a Condition instead of a string field. .status.status as a single string is less idiomatic than a typed Condition with Reason / Message / LastTransitionTime. Not blocking, but a Condition-based surface would dovetail with downstream propagation if/when anything reads this into Pod status.
6. Recovery identity. On SIGCONT, the test confirms .status.status == Ready but doesn't assert the node entry keeps its Index / CliqueID. The CD-daemon DNS-name stability relies on stable Index. Worth a one-line assertion.
Nice test. Happy to move toward approval once the sleeps are replaced with kubectl wait and the #1049-coverage intent is clarified.
Uh oh!
There was an error while loading. Please reload this page.