-
Notifications
You must be signed in to change notification settings - Fork 305
Depend on the gRPC layer for connection state changes #2977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -14,6 +14,7 @@ import ( | |||||||||||||||||||||||||||||||||||||
| "golang.org/x/sync/singleflight" | ||||||||||||||||||||||||||||||||||||||
| "google.golang.org/grpc" | ||||||||||||||||||||||||||||||||||||||
| "google.golang.org/grpc/codes" | ||||||||||||||||||||||||||||||||||||||
| "google.golang.org/grpc/connectivity" | ||||||||||||||||||||||||||||||||||||||
| "google.golang.org/grpc/status" | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| dispatchercontracts "github.com/hatchet-dev/hatchet/internal/services/dispatcher/contracts" | ||||||||||||||||||||||||||||||||||||||
|
|
@@ -96,6 +97,8 @@ type WorkflowRunEventHandler func(event WorkflowRunEvent) error | |||||||||||||||||||||||||||||||||||||
| type WorkflowRunsListener struct { | ||||||||||||||||||||||||||||||||||||||
| constructor func(context.Context) (dispatchercontracts.Dispatcher_SubscribeToWorkflowRunsClient, error) | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| conn *grpc.ClientConn | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| client dispatchercontracts.Dispatcher_SubscribeToWorkflowRunsClient | ||||||||||||||||||||||||||||||||||||||
| clientMu sync.Mutex | ||||||||||||||||||||||||||||||||||||||
| generation uint64 | ||||||||||||||||||||||||||||||||||||||
|
|
@@ -124,6 +127,7 @@ func (r *subscribeClientImpl) getWorkflowRunsListener( | |||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| w := &WorkflowRunsListener{ | ||||||||||||||||||||||||||||||||||||||
| constructor: constructor, | ||||||||||||||||||||||||||||||||||||||
| conn: r.conn, | ||||||||||||||||||||||||||||||||||||||
| l: r.l, | ||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
|
|
@@ -165,6 +169,28 @@ func (w *WorkflowRunsListener) getClientSnapshot() (dispatchercontracts.Dispatch | |||||||||||||||||||||||||||||||||||||
| return w.client, w.generation | ||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| // waitForReadyLocked blocks until the gRPC connection is ready or the context is canceled. | ||||||||||||||||||||||||||||||||||||||
| // This uses gRPC's native connectivity state API instead of sleep-based polling, | ||||||||||||||||||||||||||||||||||||||
| // which provides faster reconnection when the server becomes available. | ||||||||||||||||||||||||||||||||||||||
| // This method only accesses the immutable conn field, so it's safe to call while holding clientMu. | ||||||||||||||||||||||||||||||||||||||
| func (w *WorkflowRunsListener) waitForReadyLocked(ctx context.Context) error { | ||||||||||||||||||||||||||||||||||||||
| for { | ||||||||||||||||||||||||||||||||||||||
| state := w.conn.GetState() | ||||||||||||||||||||||||||||||||||||||
| if state == connectivity.Ready { | ||||||||||||||||||||||||||||||||||||||
| return nil | ||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||
| if state == connectivity.Shutdown { | ||||||||||||||||||||||||||||||||||||||
| return errors.New("connection shutdown") | ||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||
| // Trigger a connection attempt if the channel is idle | ||||||||||||||||||||||||||||||||||||||
| w.conn.Connect() | ||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||
| w.conn.Connect() | |
| if state == connectivity.Idle { | |
| w.conn.Connect() | |
| } |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doRetrySubscribe ignores all waitForReadyLocked errors except when the parent ctx is canceled. If waitForReadyLocked returns a non-timeout error (e.g., "connection shutdown"), the loop will continue immediately and can spin/log aggressively. Consider handling non-deadline errors explicitly (e.g., return on Shutdown, or treat as a retryable error with backoff).
| if err != nil && ctx.Err() != nil { | |
| return ctx.Err() | |
| if err != nil { | |
| // If the parent context has been canceled, stop retrying. | |
| if ctx.Err() != nil { | |
| return ctx.Err() | |
| } | |
| // For non-timeout errors (e.g., connection shutdown), apply a backoff | |
| // to avoid a tight retry loop and excessive logging. | |
| if !errors.Is(err, context.DeadlineExceeded) { | |
| w.l.Warn().Err(err).Msg("waitForReady failed; backing off before retrying subscription") | |
| select { | |
| case <-ctx.Done(): | |
| return ctx.Err() | |
| case <-time.After(DefaultActionListenerRetryInterval): | |
| } | |
| } |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retrySend ignores the error from waitForReadyLocked. If the connection is in Shutdown, waitForReadyLocked returns immediately and this retry loop can become a tight loop (no backoff) while repeatedly attempting resubscribe/send. Consider handling the error (return it on Shutdown, or fall back to a timed wait) to preserve a bounded retry cadence.
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Unavailable path, this now waits up to DefaultActionListenerRetryInterval (currently 5s) instead of the previous fixed 1s, and it also discards any waitForReadyLocked error. If the intent is to keep quick retry behavior on Unavailable, consider using a shorter timeout here (or a dedicated constant) and handle Shutdown/non-timeout errors to avoid spinning.
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a failed retrySubscribe, the code ignores waitForReadyLocked errors. Similar to other call sites, this can lead to very fast looping if the underlying connection is Shutdown (no backoff). Consider checking the returned error and aborting (or enforcing a timed delay) on non-timeout errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waitForReadyLocked dereferences w.conn without checking for nil. WorkflowRunsListener can be instantiated without conn (e.g., in unit tests or by external callers), which will cause a panic when this method is hit. Consider validating conn is non-nil when constructing the listener (or guarding here and returning a meaningful error / falling back to time-based wait).