Skip to content

Add webhook service endpoint readiness check before creating default PtpOperatorConfig#207

Open
nocturnalastro wants to merge 1 commit intok8snetworkplumbingwg:mainfrom
nocturnalastro:fix_operator_webhook_race
Open

Add webhook service endpoint readiness check before creating default PtpOperatorConfig#207
nocturnalastro wants to merge 1 commit intok8snetworkplumbingwg:mainfrom
nocturnalastro:fix_operator_webhook_race

Conversation

@nocturnalastro
Copy link
Copy Markdown
Collaborator

@nocturnalastro nocturnalastro commented Apr 1, 2026

The operator could fail to create the default PtpOperatorConfig on
startup due to a race condition: the local webhook server is ready
but the Kubernetes Service endpoints are not yet populated, causing
the API server's validating webhook call to fail with "no endpoints
available". Without retry, the default config is never created and
the linuxptp-daemon DaemonSet is never spawned.

Extend waitForWebhookServer to also verify the webhook-service is
reachable via cluster DNS (using a TCP dial) before proceeding to
create the default config. This ensures the endpoint controller has
had time to populate the Service endpoints. Use goroutine-local err
variables to avoid a data race with mgr.Start, and check ctx.Done()
so the goroutine terminates cleanly on manager shutdown.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an exponential backoff retry mechanism for the creation of the default PtpOperatorConfig in main.go. Feedback highlights a potential data race on the err variable shared with the main function scope. Additionally, suggestions were made to optimize the retry loop by using the existing restConfig instead of repeatedly calling ctrl.GetConfigOrDie() and to incorporate context cancellation checks to ensure the goroutine terminates gracefully during shutdown.

main.go Outdated
setupLog.Info("waiting for validating webhook to be ready")
err = waitForWebhookServer(checker)
if err != nil {
if err := waitForWebhookServer(checker); err != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the commit message it seems that the issue is that waitForWebhookServer needs to wait for certain endpoints to be populated. Would it make more sense to add that check to waitForWebhookServer?

It seems to me that it is a bug if waitForWebhookServer returns before the webhook is fully set up.

Copy link
Copy Markdown
Collaborator Author

@nocturnalastro nocturnalastro Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed I'll take a look :)

sebsoto
sebsoto previously approved these changes Apr 1, 2026
Copy link
Copy Markdown
Collaborator

@sebsoto sebsoto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@josephdrichard
Copy link
Copy Markdown
Collaborator

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the webhook server readiness logic by increasing the timeout, adding a reachability check for the service via cluster DNS, and incorporating context support. It also refactors the operator configuration initialization to use an existing REST configuration. The reviewer suggested several improvements: implementing the intended exponential backoff strategy for configuration creation, externalizing hardcoded service parameters, and ensuring all network operations and sleep intervals are context-aware to facilitate graceful shutdowns.

dialTimeout = 2 * time.Second
)
start := time.Now()
webhookServiceAddr := fmt.Sprintf("webhook-service.%s.svc:%d", names.Namespace, 443)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The webhook service name webhook-service and port 443 are hardcoded. It would be better to define these as constants in pkg/names or derive them from the operator's configuration to improve maintainability.

@nocturnalastro nocturnalastro changed the title Add exponential backoff retry for default PtpOperatorConfig creation Add webhook service endpoint readiness check before creating default PtpOperatorConfig Apr 1, 2026
@nocturnalastro
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the webhook readiness check by ensuring the webhook service is reachable via cluster DNS before proceeding with the operator configuration. It introduces context-awareness to the waitForWebhookServer function and adds a secondary polling loop to verify network connectivity. The review feedback suggests improving the responsiveness of these polling loops to context cancellation by replacing time.Sleep with select statements and utilizing DialContext for network operations to ensure the process can exit promptly when the context is canceled.

Comment on lines 342 to +355
for {
select {
case <-ctx.Done():
return ctx.Err()
default:
}
if err = checker(req); err == nil {
return nil
} else if time.Since(start) > timeout {
break
}
if time.Since(start) > timeout {
return fmt.Errorf("timeout waiting for webhook server to start")
}
time.Sleep(pollingFreq) // Poll every second
time.Sleep(pollingFreq)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The polling loop uses time.Sleep, which makes it unresponsive to context cancellation during the sleep period. Using a select block with time.After allows the loop to exit immediately when the context is canceled.

	for {
		if err = checker(req); err == nil {
			break
		}
		if time.Since(start) > timeout {
			return fmt.Errorf("timeout waiting for webhook server to start")
		}
		select {
		case <-ctx.Done():
			return ctx.Err()
		case <-time.After(pollingFreq):
		}
	}

Comment on lines +359 to 375
for {
select {
case <-ctx.Done():
return ctx.Err()
default:
}
conn, err := net.DialTimeout("tcp", webhookServiceAddr, dialTimeout)
if err == nil {
conn.Close()
setupLog.Info("webhook service endpoints are ready")
return nil
}
if time.Since(start) > timeout {
return fmt.Errorf("timeout waiting for webhook service endpoints to be ready")
}
time.Sleep(pollingFreq)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the first loop, using a select block with time.After improves responsiveness to shutdown signals. Additionally, replacing net.DialTimeout with net.Dialer.DialContext is preferred as it respects the parent context's cancellation during the dial operation.

Suggested change
for {
select {
case <-ctx.Done():
return ctx.Err()
default:
}
conn, err := net.DialTimeout("tcp", webhookServiceAddr, dialTimeout)
if err == nil {
conn.Close()
setupLog.Info("webhook service endpoints are ready")
return nil
}
if time.Since(start) > timeout {
return fmt.Errorf("timeout waiting for webhook service endpoints to be ready")
}
time.Sleep(pollingFreq)
}
for {
dialer := net.Dialer{Timeout: dialTimeout}
conn, err := dialer.DialContext(ctx, "tcp", webhookServiceAddr)
if err == nil {
conn.Close()
setupLog.Info("webhook service endpoints are ready")
return nil
}
if time.Since(start) > timeout {
return fmt.Errorf("timeout waiting for webhook service endpoints to be ready")
}
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(pollingFreq):
}
}

@nocturnalastro
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the webhook readiness check by adding a verification step for the service's DNS reachability and integrating context-based cancellation. The waitForWebhookServer function now utilizes http.NewRequestWithContext and a non-blocking polling mechanism. Review feedback recommends further improving lifecycle management by propagating the context into createDefaultOperatorConfig and replacing net.DialTimeout with net.DialContext to ensure network operations respect the provided context.

…PtpOperatorConfig

The operator could fail to create the default PtpOperatorConfig on
startup due to a race condition: the local webhook server is ready
but the Kubernetes Service endpoints are not yet populated, causing
the API server's validating webhook call to fail with "no endpoints
available". Without retry, the default config is never created and
the linuxptp-daemon DaemonSet is never spawned.

Extend waitForWebhookServer to also verify the webhook-service is
reachable via cluster DNS (using a TCP dial) before proceeding to
create the default config. This ensures the endpoint controller has
had time to populate the Service endpoints. Use goroutine-local err
variables to avoid a data race with mgr.Start, and check ctx.Done()
so the goroutine terminates cleanly on manager shutdown.

Generated-by: Cursor
@nocturnalastro
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the webhook server readiness check by ensuring that the service endpoints are reachable via cluster DNS before the operator proceeds. It introduces context propagation, increases the timeout duration, and adds a network dialing loop to verify connectivity. The review feedback suggests enhancing the error messages by including the last captured error when a timeout occurs, which would improve observability and troubleshooting.

Comment on lines 343 to 348
if err = checker(req); err == nil {
return nil
} else if time.Since(start) > timeout {
break
}
if time.Since(start) > timeout {
return fmt.Errorf("timeout waiting for webhook server to start")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When the webhook server readiness check times out, the specific error returned by the checker function is lost. Including the last error in the timeout message would significantly improve troubleshooting when the webhook server fails to start or returns unexpected errors.

Comment on lines +359 to +367
conn, err := (&net.Dialer{Timeout: dialTimeout}).DialContext(ctx, "tcp", webhookServiceAddr)
if err == nil {
conn.Close()
setupLog.Info("webhook service endpoints are ready")
return nil
}
if time.Since(start) > timeout {
return fmt.Errorf("timeout waiting for webhook service endpoints to be ready")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the local readiness check, if the TCP dial to the webhook service endpoints times out, the underlying error (e.g., DNS resolution failure, connection refused, or timeout) is not included in the returned error. Wrapping the last dial error in the timeout message would provide better visibility into why the endpoints are not becoming ready.

Copy link
Copy Markdown
Collaborator

@sebsoto sebsoto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants