CNF-19588: CI: clockClass recovery test after cloud-event-proxy crash#171
CNF-19588: CI: clockClass recovery test after cloud-event-proxy crash#171jzding wants to merge 3 commits intok8snetworkplumbingwg:mainfrom
Conversation
Summary of ChangesHello @jzding, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a critical conformance test designed to validate the robustness of PTP clock synchronization. The test specifically ensures that the PTP clockClass, a key indicator of clock quality, consistently maintains its expected value of 6 even when the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/ok-to-test |
There was a problem hiding this comment.
Code Review
This pull request adds a new test case to verify that clockClass is maintained after the cloud-event-proxy process is restarted. The changes include the new test logic and a helper function verifyClockClassViaEventAPI. The implementation is mostly correct, but I've found an opportunity to improve the timeout handling in the new helper function to make it more predictable and robust.
235cb37 to
f76ee4e
Compare
56d3ce6 to
a690541
Compare
3cad3ef to
67d9c5b
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new conformance test to verify that the PTP clockClass recovers correctly via the Event API after a cloud-event-proxy crash. It also refactors the PushInitialEvent function to use context-based timeouts for log scanning. A review comment suggests replacing a fixed time.Sleep with a more robust Eventually block to avoid potential test flakiness.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new conformance test to verify clockClass recovery after a cloud-event-proxy crash. It also enhances test reliability by replacing fixed time.Sleep calls with a polling mechanism that waits for the consumer app's health endpoint and refactors PushInitialEvent to use context-based timeouts for log scanning. A suggestion was provided to use pkill instead of kill -9 $(pgrep ...) for more robust process termination within the test environment.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new conformance test to verify that the PTP clock class state is correctly recovered via the Event API after a cloud-event-proxy crash. It adds helper functions for verifying clock class via PMC and the Event API's current state endpoint, and improves the consumer app's readiness check by replacing a hardcoded sleep with a polling mechanism. Feedback was provided to improve the robustness of the tests by using more flexible regexes for process management, adding timeouts to manual polling loops to prevent indefinite blocking, and ensuring proper error handling after log scanning loops.
d7eef32 to
0462031
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new conformance test to verify clock class recovery via the Event API after a cloud-event-proxy crash. Key changes include the addition of helper functions for verifying clock class via PMC and the Event API, the replacement of hardcoded sleeps with a more robust WaitForConsumerReady check, and an update to the retrieval of initial events to run concurrently. Feedback focuses on handling ignored errors from PushInitialEvent to prevent silent failures and optimizing redundant log scanning when subscribing to multiple event topics to improve efficiency.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new conformance test to verify that the PTP clockClass correctly recovers via the Event API after a cloud-event-proxy crash. It includes significant improvements to the test infrastructure, such as replacing static sleeps with a robust WaitForConsumerReady check and refactoring log scanning to handle multiple event topics more efficiently. One piece of feedback suggests that the hardcoded 1-minute timeout for the publisher health check may be too short under high load and should be made configurable or extended to prevent flakiness.
Add crash recovery test verifying clockClass is recovered via event API after cloud-event-proxy process restart. Covers OC, BC, TGM, and dual-follower configs. Fix PushInitialEvent hang by using context.WithTimeout. Add scanner.Err() check for better error reporting. Add --max-time to publisher health curl. WaitForConsumerReady polls the publisher service health URL from within the consumer pod, then waits for the consumer app to log "waiting for events", ensuring the full event pipeline is ready. Run PushInitialEvent calls concurrently in SubscribeToGMChangeEvents to avoid sequential timeouts. verifyClockClassCurrentState subscribes only to clock class topic instead of all three event types. Signed-off-by: Jack Ding <jackding@gmail.com>
In BC mode, clock class events contain two values (one per ptp4l instance) both with data_type "metric". createStoredEvent used data_type as the map key, causing the second value (255) to overwrite the first (6). Fix by using ResourceAddress as the map key in createStoredEvent, which is unique per value. verifyClockClassCurrentState now iterates all values looking for a match instead of a single key lookup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Jack Ding <jackding@gmail.com>
When no internal GM solution exists but external GM solutions are available, automatically fall back to external GM mode instead of failing. Applies to OC, BC, DualNicBC, and DualFollower config generation. This enables tests to run on single-node clusters with an external grandmaster without requiring EXTERNAL_GM=true. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Jack Ding <jackding@gmail.com>
Update: clockClass recovery test after cloud-event-proxy crash
Added a new test
"Should recover clockClass via event API after cloud-event-proxy crash"that verifies clock class is correctly reported through three independent channels after killing and restarting the cloud-event-proxy sidecar process.Test flow
gm.ClockClassfromPARENT_DATA_SET), and Event APIkill -9)Clock class expectations by mode
gm.ClockClassNew helper
checkClockClassViaPMC()— runspmc -b 0 -u -f /var/run/ptp4l.0.config GET PARENT_DATA_SET, parsesgm.ClockClassvia regex, and asserts the expected value withEventually.pmcClockClassRe— compiled regexgm\.ClockClass\s+(\d+)for parsing PMC output.Mode support
Test now runs on all relevant PTP modes: BC, DNBC, DNBC-HA, T-GM, OC, and DualFollower. Other modes are skipped.