Skip to content

Depend on the gRPC layer for connection state changes#2977

Open
mnafees wants to merge 1 commit intomainfrom
nafees/go-sdk-faster-reconnects
Open

Depend on the gRPC layer for connection state changes#2977
mnafees wants to merge 1 commit intomainfrom
nafees/go-sdk-faster-reconnects

Conversation

@mnafees
Copy link
Member

@mnafees mnafees commented Feb 9, 2026

Description

Instead of sleeps during reconnection attempts in the Go SDK listener, we want to depend on the native gRPC connection state.

Type of change

  • Chore (changes which are not directly related to any business logic)

@mnafees mnafees requested a review from Copilot February 9, 2026 20:06
@mnafees mnafees self-assigned this Feb 9, 2026
@mnafees mnafees added the sdk-go Related to the Go SDK label Feb 9, 2026
@vercel
Copy link

vercel bot commented Feb 9, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hatchet-docs Ready Ready Preview, Comment Feb 9, 2026 8:06pm

Request Review

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Go SDK’s workflow run listener reconnection behavior to rely on gRPC’s native connectivity state transitions instead of using fixed sleep-based delays during retry/reconnect loops.

Changes:

  • Add a grpc.ClientConn reference to the workflow runs listener so it can observe connection state.
  • Introduce waitForReadyLocked using grpc/connectivity (GetState, Connect, WaitForStateChange) as the retry delay mechanism.
  • Replace several time.Sleep calls in resubscribe/send/listen retry paths with connection-state waiting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +176 to +184
func (w *WorkflowRunsListener) waitForReadyLocked(ctx context.Context) error {
for {
state := w.conn.GetState()
if state == connectivity.Ready {
return nil
}
if state == connectivity.Shutdown {
return errors.New("connection shutdown")
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitForReadyLocked dereferences w.conn without checking for nil. WorkflowRunsListener can be instantiated without conn (e.g., in unit tests or by external callers), which will cause a panic when this method is hit. Consider validating conn is non-nil when constructing the listener (or guarding here and returning a meaningful error / falling back to time-based wait).

Copilot uses AI. Check for mistakes.
return errors.New("connection shutdown")
}
// Trigger a connection attempt if the channel is idle
w.conn.Connect()
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code calls w.conn.Connect() unconditionally, but the comment says it's only to trigger a connection attempt when the channel is idle. Consider checking state == connectivity.Idle before calling Connect (or update the comment) to avoid unnecessary calls and keep behavior/documentation consistent.

Suggested change
w.conn.Connect()
if state == connectivity.Idle {
w.conn.Connect()
}

Copilot uses AI. Check for mistakes.
Comment on lines +214 to +215
if err != nil && ctx.Err() != nil {
return ctx.Err()
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doRetrySubscribe ignores all waitForReadyLocked errors except when the parent ctx is canceled. If waitForReadyLocked returns a non-timeout error (e.g., "connection shutdown"), the loop will continue immediately and can spin/log aggressively. Consider handling non-deadline errors explicitly (e.g., return on Shutdown, or treat as a retryable error with backoff).

Suggested change
if err != nil && ctx.Err() != nil {
return ctx.Err()
if err != nil {
// If the parent context has been canceled, stop retrying.
if ctx.Err() != nil {
return ctx.Err()
}
// For non-timeout errors (e.g., connection shutdown), apply a backoff
// to avoid a tight retry loop and excessive logging.
if !errors.Is(err, context.DeadlineExceeded) {
w.l.Warn().Err(err).Msg("waitForReady failed; backing off before retrying subscription")
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(DefaultActionListenerRetryInterval):
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +339 to +341
waitCtx, cancel := context.WithTimeout(context.Background(), DefaultActionListenerRetryInterval)
_ = l.waitForReadyLocked(waitCtx)
cancel()
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retrySend ignores the error from waitForReadyLocked. If the connection is in Shutdown, waitForReadyLocked returns immediately and this retry loop can become a tight loop (no backoff) while repeatedly attempting resubscribe/send. Consider handling the error (return it on Shutdown, or fall back to a timed wait) to preserve a bounded retry cadence.

Copilot uses AI. Check for mistakes.
Comment on lines 364 to 369
if status.Code(err) == codes.Unavailable {
l.l.Warn().Err(err).Msg("dispatcher is unavailable, retrying subscribe after 1 second")
time.Sleep(1 * time.Second)
l.l.Warn().Err(err).Msg("dispatcher is unavailable, waiting for connection to be ready")
waitCtx, cancel := context.WithTimeout(ctx, DefaultActionListenerRetryInterval)
_ = l.waitForReadyLocked(waitCtx)
cancel()
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Unavailable path, this now waits up to DefaultActionListenerRetryInterval (currently 5s) instead of the previous fixed 1s, and it also discards any waitForReadyLocked error. If the intent is to keep quick retry behavior on Unavailable, consider using a shorter timeout here (or a dedicated constant) and handle Shutdown/non-timeout errors to avoid spinning.

Copilot uses AI. Check for mistakes.
Comment on lines 379 to +382

time.Sleep(DefaultActionListenerRetryInterval)
waitCtx, cancel := context.WithTimeout(ctx, DefaultActionListenerRetryInterval)
_ = l.waitForReadyLocked(waitCtx)
cancel()
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a failed retrySubscribe, the code ignores waitForReadyLocked errors. Similar to other call sites, this can lead to very fast looping if the underlying connection is Shutdown (no backoff). Consider checking the returned error and aborting (or enforcing a timed delay) on non-timeout errors.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sdk-go Related to the Go SDK

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant