Depend on the gRPC layer for connection state changes by mnafees · Pull Request #2977 · hatchet-dev/hatchet

mnafees · 2026-02-09T20:06:07Z

Description

Instead of sleeps during reconnection attempts in the Go SDK listener, we want to depend on the native gRPC connection state.

Type of change

Chore (changes which are not directly related to any business logic)

vercel · 2026-02-09T20:06:14Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
hatchet-docs	Ready	Preview, Comment	Feb 9, 2026 8:06pm

Copilot

Pull request overview

This PR updates the Go SDK’s workflow run listener reconnection behavior to rely on gRPC’s native connectivity state transitions instead of using fixed sleep-based delays during retry/reconnect loops.

Changes:

Add a grpc.ClientConn reference to the workflow runs listener so it can observe connection state.
Introduce waitForReadyLocked using grpc/connectivity (GetState, Connect, WaitForStateChange) as the retry delay mechanism.
Replace several time.Sleep calls in resubscribe/send/listen retry paths with connection-state waiting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-09T20:10:57Z

pkg/client/listener.go

+func (w *WorkflowRunsListener) waitForReadyLocked(ctx context.Context) error {
+	for {
+		state := w.conn.GetState()
+		if state == connectivity.Ready {
+			return nil
+		}
+		if state == connectivity.Shutdown {
+			return errors.New("connection shutdown")
+		}


waitForReadyLocked dereferences w.conn without checking for nil. WorkflowRunsListener can be instantiated without conn (e.g., in unit tests or by external callers), which will cause a panic when this method is hit. Consider validating conn is non-nil when constructing the listener (or guarding here and returning a meaningful error / falling back to time-based wait).

Copilot · 2026-02-09T20:10:57Z

pkg/client/listener.go

+			return errors.New("connection shutdown")
+		}
+		// Trigger a connection attempt if the channel is idle
+		w.conn.Connect()


The code calls w.conn.Connect() unconditionally, but the comment says it's only to trigger a connection attempt when the channel is idle. Consider checking state == connectivity.Idle before calling Connect (or update the comment) to avoid unnecessary calls and keep behavior/documentation consistent.

Suggested change

w.conn.Connect()

if state == connectivity.Idle {

w.conn.Connect()

}

Copilot · 2026-02-09T20:10:57Z

pkg/client/listener.go

+			if err != nil && ctx.Err() != nil {
+				return ctx.Err()


doRetrySubscribe ignores all waitForReadyLocked errors except when the parent ctx is canceled. If waitForReadyLocked returns a non-timeout error (e.g., "connection shutdown"), the loop will continue immediately and can spin/log aggressively. Consider handling non-deadline errors explicitly (e.g., return on Shutdown, or treat as a retryable error with backoff).

Suggested change

if err != nil && ctx.Err() != nil {

return ctx.Err()

if err != nil {

// If the parent context has been canceled, stop retrying.

if ctx.Err() != nil {

return ctx.Err()

}

// For non-timeout errors (e.g., connection shutdown), apply a backoff

// to avoid a tight retry loop and excessive logging.

if !errors.Is(err, context.DeadlineExceeded) {

w.l.Warn().Err(err).Msg("waitForReady failed; backing off before retrying subscription")

select {

case <-ctx.Done():

return ctx.Err()

case <-time.After(DefaultActionListenerRetryInterval):

}

}

Copilot · 2026-02-09T20:10:58Z

pkg/client/listener.go

+		waitCtx, cancel := context.WithTimeout(context.Background(), DefaultActionListenerRetryInterval)
+		_ = l.waitForReadyLocked(waitCtx)
+		cancel()


retrySend ignores the error from waitForReadyLocked. If the connection is in Shutdown, waitForReadyLocked returns immediately and this retry loop can become a tight loop (no backoff) while repeatedly attempting resubscribe/send. Consider handling the error (return it on Shutdown, or fall back to a timed wait) to preserve a bounded retry cadence.

Copilot · 2026-02-09T20:10:58Z

pkg/client/listener.go

 			if status.Code(err) == codes.Unavailable {
-				l.l.Warn().Err(err).Msg("dispatcher is unavailable, retrying subscribe after 1 second")
-				time.Sleep(1 * time.Second)
+				l.l.Warn().Err(err).Msg("dispatcher is unavailable, waiting for connection to be ready")
+				waitCtx, cancel := context.WithTimeout(ctx, DefaultActionListenerRetryInterval)
+				_ = l.waitForReadyLocked(waitCtx)
+				cancel()
 			}


In the Unavailable path, this now waits up to DefaultActionListenerRetryInterval (currently 5s) instead of the previous fixed 1s, and it also discards any waitForReadyLocked error. If the intent is to keep quick retry behavior on Unavailable, consider using a shorter timeout here (or a dedicated constant) and handle Shutdown/non-timeout errors to avoid spinning.

Copilot · 2026-02-09T20:10:58Z

pkg/client/listener.go


-				time.Sleep(DefaultActionListenerRetryInterval)
+				waitCtx, cancel := context.WithTimeout(ctx, DefaultActionListenerRetryInterval)
+				_ = l.waitForReadyLocked(waitCtx)
+				cancel()


After a failed retrySubscribe, the code ignores waitForReadyLocked errors. Similar to other call sites, this can lead to very fast looping if the underlying connection is Shutdown (no backoff). Consider checking the returned error and aborting (or enforcing a timed delay) on non-timeout errors.

depend on the gRPC layer for connection state changes

080debe

mnafees requested a review from Copilot February 9, 2026 20:06

mnafees self-assigned this Feb 9, 2026

mnafees added the sdk-go Related to the Go SDK label Feb 9, 2026

Copilot started reviewing on behalf of mnafees February 9, 2026 20:06 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Depend on the gRPC layer for connection state changes#2977

Depend on the gRPC layer for connection state changes#2977
mnafees wants to merge 1 commit intomainfrom
nafees/go-sdk-faster-reconnects

mnafees commented Feb 9, 2026

Uh oh!

vercel bot commented Feb 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-			if err != nil && ctx.Err() != nil {
-				return ctx.Err()
+			if err != nil {
+				// If the parent context has been canceled, stop retrying.
+				if ctx.Err() != nil {
+					return ctx.Err()
+				}
+				// For non-timeout errors (e.g., connection shutdown), apply a backoff
+				// to avoid a tight retry loop and excessive logging.
+				if !errors.Is(err, context.DeadlineExceeded) {
+					w.l.Warn().Err(err).Msg("waitForReady failed; backing off before retrying subscription")
+					select {
+					case <-ctx.Done():
+						return ctx.Err()
+					case <-time.After(DefaultActionListenerRetryInterval):
+					}
+				}

Conversation

mnafees commented Feb 9, 2026

Description

Type of change

Uh oh!

vercel bot commented Feb 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant