Skip to content

CNF-19588: CI: clockClass recovery test after cloud-event-proxy crash#171

Open
jzding wants to merge 3 commits intok8snetworkplumbingwg:mainfrom
jzding:ci-clockclass
Open

CNF-19588: CI: clockClass recovery test after cloud-event-proxy crash#171
jzding wants to merge 3 commits intok8snetworkplumbingwg:mainfrom
jzding:ci-clockclass

Conversation

@jzding
Copy link
Copy Markdown
Contributor

@jzding jzding commented Jan 31, 2026

Update: clockClass recovery test after cloud-event-proxy crash

Added a new test "Should recover clockClass via event API after cloud-event-proxy crash" that verifies clock class is correctly reported through three independent channels after killing and restarting the cloud-event-proxy sidecar process.

Test flow

  1. Determine expected clock class based on discovered PTP mode
  2. Deploy event consumer app
  3. Pre-crash verification — check clockClass via metrics, PMC (gm.ClockClass from PARENT_DATA_SET), and Event API
  4. Kill cloud-event-proxy process (kill -9)
  5. Wait for process restart + health endpoint recovery
  6. Post-crash verification — re-check clockClass via all three channels

Clock class expectations by mode

Mode Metrics / Event API PMC gm.ClockClass
BC, DNBC, DNBC-HA, T-GM 6 6
OC, DualFollower (≥4.21) 255 (SlaveOnly) 6
OC, DualFollower (<4.21) 6 6

New helper

  • checkClockClassViaPMC() — runs pmc -b 0 -u -f /var/run/ptp4l.0.config GET PARENT_DATA_SET, parses gm.ClockClass via regex, and asserts the expected value with Eventually.
  • pmcClockClassRe — compiled regex gm\.ClockClass\s+(\d+) for parsing PMC output.

Mode support

Test now runs on all relevant PTP modes: BC, DNBC, DNBC-HA, T-GM, OC, and DualFollower. Other modes are skipped.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @jzding, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical conformance test designed to validate the robustness of PTP clock synchronization. The test specifically ensures that the PTP clockClass, a key indicator of clock quality, consistently maintains its expected value of 6 even when the cloud-event-proxy component undergoes a process restart. This enhancement significantly strengthens the reliability of the PTP solution by confirming its ability to recover gracefully from transient service interruptions.

Highlights

  • New Conformance Test: A new test case has been added to verify that the PTP clockClass remains at 6 after the cloud-event-proxy process is killed and restarted. This ensures the stability and resilience of the PTP synchronization mechanism.
  • Verification Methods: The new test utilizes both the metrics API and the Event Fast Notification API to confirm the clockClass state before and after the cloud-event-proxy restart.
  • New Helper Function: A new helper function, verifyClockClassViaEventAPI, was introduced to specifically verify the clockClass using the Event Fast Notification API, enhancing test utility.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@jzding
Copy link
Copy Markdown
Contributor Author

jzding commented Jan 31, 2026

/ok-to-test

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new test case to verify that clockClass is maintained after the cloud-event-proxy process is restarted. The changes include the new test logic and a helper function verifyClockClassViaEventAPI. The implementation is mostly correct, but I've found an opportunity to improve the timeout handling in the new helper function to make it more predictable and robust.

@jzding jzding changed the title CI: check clockclass after cloud-event-proxy process restart [WIP] CI: check clockclass after cloud-event-proxy process restart Feb 10, 2026
@jzding jzding force-pushed the ci-clockclass branch 2 times, most recently from 235cb37 to f76ee4e Compare February 12, 2026 18:31
@jzding jzding changed the title [WIP] CI: check clockclass after cloud-event-proxy process restart [WIP] CI: clockClass recovery test after cloud-event-proxy crash Feb 12, 2026
@jzding jzding force-pushed the ci-clockclass branch 4 times, most recently from 3cad3ef to 67d9c5b Compare March 30, 2026 19:36
@jzding
Copy link
Copy Markdown
Contributor Author

jzding commented Mar 30, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new conformance test to verify that the PTP clockClass recovers correctly via the Event API after a cloud-event-proxy crash. It also refactors the PushInitialEvent function to use context-based timeouts for log scanning. A review comment suggests replacing a fixed time.Sleep with a more robust Eventually block to avoid potential test flakiness.

@jzding
Copy link
Copy Markdown
Contributor Author

jzding commented Mar 30, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new conformance test to verify clockClass recovery after a cloud-event-proxy crash. It also enhances test reliability by replacing fixed time.Sleep calls with a polling mechanism that waits for the consumer app's health endpoint and refactors PushInitialEvent to use context-based timeouts for log scanning. A suggestion was provided to use pkill instead of kill -9 $(pgrep ...) for more robust process termination within the test environment.

@jzding jzding changed the title [WIP] CI: clockClass recovery test after cloud-event-proxy crash [WIP] CNF-19588: CI: clockClass recovery test after cloud-event-proxy crash Mar 30, 2026
@jzding jzding changed the title [WIP] CNF-19588: CI: clockClass recovery test after cloud-event-proxy crash CNF-19588: CI: clockClass recovery test after cloud-event-proxy crash Mar 31, 2026
@jzding
Copy link
Copy Markdown
Contributor Author

jzding commented Apr 1, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new conformance test to verify that the PTP clock class state is correctly recovered via the Event API after a cloud-event-proxy crash. It adds helper functions for verifying clock class via PMC and the Event API's current state endpoint, and improves the consumer app's readiness check by replacing a hardcoded sleep with a polling mechanism. Feedback was provided to improve the robustness of the tests by using more flexible regexes for process management, adding timeouts to manual polling loops to prevent indefinite blocking, and ensuring proper error handling after log scanning loops.

@jzding jzding force-pushed the ci-clockclass branch 2 times, most recently from d7eef32 to 0462031 Compare April 1, 2026 17:42
@jzding
Copy link
Copy Markdown
Contributor Author

jzding commented Apr 1, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new conformance test to verify clock class recovery via the Event API after a cloud-event-proxy crash. Key changes include the addition of helper functions for verifying clock class via PMC and the Event API, the replacement of hardcoded sleeps with a more robust WaitForConsumerReady check, and an update to the retrieval of initial events to run concurrently. Feedback focuses on handling ignored errors from PushInitialEvent to prevent silent failures and optimizing redundant log scanning when subscribing to multiple event topics to improve efficiency.

@jzding
Copy link
Copy Markdown
Contributor Author

jzding commented Apr 1, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new conformance test to verify that the PTP clockClass correctly recovers via the Event API after a cloud-event-proxy crash. It includes significant improvements to the test infrastructure, such as replacing static sleeps with a robust WaitForConsumerReady check and refactoring log scanning to handle multiple event topics more efficiently. One piece of feedback suggests that the hardcoded 1-minute timeout for the publisher health check may be too short under high load and should be made configurable or extended to prevent flakiness.

Add crash recovery test verifying clockClass is recovered via event API
after cloud-event-proxy process restart. Covers OC, BC, TGM, and
dual-follower configs.

Fix PushInitialEvent hang by using context.WithTimeout. Add scanner.Err()
check for better error reporting. Add --max-time to publisher health curl.

WaitForConsumerReady polls the publisher service health URL from within the
consumer pod, then waits for the consumer app to log "waiting for events",
ensuring the full event pipeline is ready.

Run PushInitialEvent calls concurrently in SubscribeToGMChangeEvents to
avoid sequential timeouts. verifyClockClassCurrentState subscribes only
to clock class topic instead of all three event types.

Signed-off-by: Jack Ding <jackding@gmail.com>
In BC mode, clock class events contain two values (one per ptp4l
instance) both with data_type "metric". createStoredEvent used
data_type as the map key, causing the second value (255) to overwrite
the first (6).

Fix by using ResourceAddress as the map key in createStoredEvent,
which is unique per value. verifyClockClassCurrentState now iterates
all values looking for a match instead of a single key lookup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Jack Ding <jackding@gmail.com>
When no internal GM solution exists but external GM solutions are
available, automatically fall back to external GM mode instead of
failing. Applies to OC, BC, DualNicBC, and DualFollower config
generation. This enables tests to run on single-node clusters with
an external grandmaster without requiring EXTERNAL_GM=true.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Jack Ding <jackding@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant