Add webhook service endpoint readiness check before creating default PtpOperatorConfig#207
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an exponential backoff retry mechanism for the creation of the default PtpOperatorConfig in main.go. Feedback highlights a potential data race on the err variable shared with the main function scope. Additionally, suggestions were made to optimize the retry loop by using the existing restConfig instead of repeatedly calling ctrl.GetConfigOrDie() and to incorporate context cancellation checks to ensure the goroutine terminates gracefully during shutdown.
9bbbb13 to
fa5b3a0
Compare
main.go
Outdated
| setupLog.Info("waiting for validating webhook to be ready") | ||
| err = waitForWebhookServer(checker) | ||
| if err != nil { | ||
| if err := waitForWebhookServer(checker); err != nil { |
There was a problem hiding this comment.
From the commit message it seems that the issue is that waitForWebhookServer needs to wait for certain endpoints to be populated. Would it make more sense to add that check to waitForWebhookServer?
It seems to me that it is a bug if waitForWebhookServer returns before the webhook is fully set up.
There was a problem hiding this comment.
Agreed I'll take a look :)
fa5b3a0 to
bbf85fd
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request enhances the webhook server readiness logic by increasing the timeout, adding a reachability check for the service via cluster DNS, and incorporating context support. It also refactors the operator configuration initialization to use an existing REST configuration. The reviewer suggested several improvements: implementing the intended exponential backoff strategy for configuration creation, externalizing hardcoded service parameters, and ensuring all network operations and sleep intervals are context-aware to facilitate graceful shutdowns.
| dialTimeout = 2 * time.Second | ||
| ) | ||
| start := time.Now() | ||
| webhookServiceAddr := fmt.Sprintf("webhook-service.%s.svc:%d", names.Namespace, 443) |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request enhances the webhook readiness check by ensuring the webhook service is reachable via cluster DNS before proceeding with the operator configuration. It introduces context-awareness to the waitForWebhookServer function and adds a secondary polling loop to verify network connectivity. The review feedback suggests improving the responsiveness of these polling loops to context cancellation by replacing time.Sleep with select statements and utilizing DialContext for network operations to ensure the process can exit promptly when the context is canceled.
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| default: | ||
| } | ||
| if err = checker(req); err == nil { | ||
| return nil | ||
| } else if time.Since(start) > timeout { | ||
| break | ||
| } | ||
| if time.Since(start) > timeout { | ||
| return fmt.Errorf("timeout waiting for webhook server to start") | ||
| } | ||
| time.Sleep(pollingFreq) // Poll every second | ||
| time.Sleep(pollingFreq) | ||
| } |
There was a problem hiding this comment.
The polling loop uses time.Sleep, which makes it unresponsive to context cancellation during the sleep period. Using a select block with time.After allows the loop to exit immediately when the context is canceled.
for {
if err = checker(req); err == nil {
break
}
if time.Since(start) > timeout {
return fmt.Errorf("timeout waiting for webhook server to start")
}
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(pollingFreq):
}
}| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| default: | ||
| } | ||
| conn, err := net.DialTimeout("tcp", webhookServiceAddr, dialTimeout) | ||
| if err == nil { | ||
| conn.Close() | ||
| setupLog.Info("webhook service endpoints are ready") | ||
| return nil | ||
| } | ||
| if time.Since(start) > timeout { | ||
| return fmt.Errorf("timeout waiting for webhook service endpoints to be ready") | ||
| } | ||
| time.Sleep(pollingFreq) | ||
| } |
There was a problem hiding this comment.
Similar to the first loop, using a select block with time.After improves responsiveness to shutdown signals. Additionally, replacing net.DialTimeout with net.Dialer.DialContext is preferred as it respects the parent context's cancellation during the dial operation.
| for { | |
| select { | |
| case <-ctx.Done(): | |
| return ctx.Err() | |
| default: | |
| } | |
| conn, err := net.DialTimeout("tcp", webhookServiceAddr, dialTimeout) | |
| if err == nil { | |
| conn.Close() | |
| setupLog.Info("webhook service endpoints are ready") | |
| return nil | |
| } | |
| if time.Since(start) > timeout { | |
| return fmt.Errorf("timeout waiting for webhook service endpoints to be ready") | |
| } | |
| time.Sleep(pollingFreq) | |
| } | |
| for { | |
| dialer := net.Dialer{Timeout: dialTimeout} | |
| conn, err := dialer.DialContext(ctx, "tcp", webhookServiceAddr) | |
| if err == nil { | |
| conn.Close() | |
| setupLog.Info("webhook service endpoints are ready") | |
| return nil | |
| } | |
| if time.Since(start) > timeout { | |
| return fmt.Errorf("timeout waiting for webhook service endpoints to be ready") | |
| } | |
| select { | |
| case <-ctx.Done(): | |
| return ctx.Err() | |
| case <-time.After(pollingFreq): | |
| } | |
| } |
bbf85fd to
45b8e79
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request enhances the webhook readiness check by adding a verification step for the service's DNS reachability and integrating context-based cancellation. The waitForWebhookServer function now utilizes http.NewRequestWithContext and a non-blocking polling mechanism. Review feedback recommends further improving lifecycle management by propagating the context into createDefaultOperatorConfig and replacing net.DialTimeout with net.DialContext to ensure network operations respect the provided context.
…PtpOperatorConfig The operator could fail to create the default PtpOperatorConfig on startup due to a race condition: the local webhook server is ready but the Kubernetes Service endpoints are not yet populated, causing the API server's validating webhook call to fail with "no endpoints available". Without retry, the default config is never created and the linuxptp-daemon DaemonSet is never spawned. Extend waitForWebhookServer to also verify the webhook-service is reachable via cluster DNS (using a TCP dial) before proceeding to create the default config. This ensures the endpoint controller has had time to populate the Service endpoints. Use goroutine-local err variables to avoid a data race with mgr.Start, and check ctx.Done() so the goroutine terminates cleanly on manager shutdown. Generated-by: Cursor
45b8e79 to
f935349
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request improves the webhook server readiness check by ensuring that the service endpoints are reachable via cluster DNS before the operator proceeds. It introduces context propagation, increases the timeout duration, and adds a network dialing loop to verify connectivity. The review feedback suggests enhancing the error messages by including the last captured error when a timeout occurs, which would improve observability and troubleshooting.
| if err = checker(req); err == nil { | ||
| return nil | ||
| } else if time.Since(start) > timeout { | ||
| break | ||
| } | ||
| if time.Since(start) > timeout { | ||
| return fmt.Errorf("timeout waiting for webhook server to start") | ||
| } |
There was a problem hiding this comment.
| conn, err := (&net.Dialer{Timeout: dialTimeout}).DialContext(ctx, "tcp", webhookServiceAddr) | ||
| if err == nil { | ||
| conn.Close() | ||
| setupLog.Info("webhook service endpoints are ready") | ||
| return nil | ||
| } | ||
| if time.Since(start) > timeout { | ||
| return fmt.Errorf("timeout waiting for webhook service endpoints to be ready") | ||
| } |
There was a problem hiding this comment.
Similar to the local readiness check, if the TCP dial to the webhook service endpoints times out, the underlying error (e.g., DNS resolution failure, connection refused, or timeout) is not included in the returned error. Wrapping the last dial error in the timeout message would provide better visibility into why the endpoints are not becoming ready.
The operator could fail to create the default PtpOperatorConfig on
startup due to a race condition: the local webhook server is ready
but the Kubernetes Service endpoints are not yet populated, causing
the API server's validating webhook call to fail with "no endpoints
available". Without retry, the default config is never created and
the linuxptp-daemon DaemonSet is never spawned.
Extend waitForWebhookServer to also verify the webhook-service is
reachable via cluster DNS (using a TCP dial) before proceeding to
create the default config. This ensures the endpoint controller has
had time to populate the Service endpoints. Use goroutine-local err
variables to avoid a data race with mgr.Start, and check ctx.Done()
so the goroutine terminates cleanly on manager shutdown.