Skip to content

Conversation

@clyang82
Copy link
Contributor

@clyang82 clyang82 commented Dec 24, 2025

Summary

When the message broker is gRPC broker, the maestro agent cannot reconnect to the server after the server is restarted.

Here is analysis based on the logs:

  1. 02:50:14.213 - Reconnection succeeds, sends restartReceiverSignal (baseclient.go:92)
  2. 02:50:14.213695 - Subscription goroutine receives signal, logs "restart the cloudevents receiver" (baseclient.go:238)
  3. 02:50:14.213 - Subscription goroutine sets subscribed = false (line 242)
  4. Loop iteration - Goes to line 183: if !subscribed
  5. 02:50:14.220 - Calls transport.Subscribe(ctx) at baseclient.go:185 - THIS BLOCKS because the gRPC connection is failing
  6. 02:50:14.221 - Meanwhile, monitorConnectionState detects TRANSIENT_FAILURE, sends error to errorChan
  7. 02:50:14.221034 - Error handler receives error, logs "disconnected"
  8. 02:50:14.221 - Error handler tries to send stopReceiverSignal at baseclient.go:108
  9. DEADLOCK:
    - Error handler is blocked in sendReceiverSignal waiting for subscription goroutine to read
    - Subscription goroutine is blocked in transport.Subscribe() at line 185, can't read from receiverChan

logs:

> I1224 02:50:06.326654       1 baseclient.go:102] ERROR CHANNEL WAS TRIGGERED!
E1224 02:50:06.326700       1 baseclient.go:108] "the cloudevents client is disconnected" err="grpc connection is disconnected (state=IDLE)"
E1224 02:50:06.326707       1 transport.go:209] "subscribe stream failed" err="rpc error: code = Unavailable desc = error reading from server:
EOF" subID="3830f06a-dc18-4dbc-b8e4-14a575efd8e9"
I1224 02:50:06.326712       1 transport.go:177] "close grpc transport"
I1224 02:50:06.326733       1 transport.go:287] "STOP monitoring connection state"
E1224 02:50:06.326740       1 baseclient.go:115] "failed to close the cloudevents protocol" err="rpc error: code = Canceled desc = grpc: the
client connection is closing"
I1224 02:50:06.326725       1 baseclient.go:248] "stop the cloudevents receiver"
I1224 02:50:06.326770       1 transport.go:165] "stop receiving events" subID="3830f06a-dc18-4dbc-b8e4-14a575efd8e9"
subOption="cluster_name:\"80ba4ce3-59b9-4a10-8348-1458f61c667c\"  data_type:\"io.open-cluster-management.works.v1alpha1.manifestbundles\""
I1224 02:50:14.213548       1 baseclient.go:78] "reconnecting the cloudevents client"
I1224 02:50:14.213664       1 transport.go:71] "grpc is connected" grpcURL="maestro-grpc-broker.maestro.svc:8091"
I1224 02:50:14.213674       1 baseclient.go:90] "the cloudevents client is reconnected"
I1224 02:50:14.213695       1 baseclient.go:242] "restart the cloudevents receiver"
I1224 02:50:14.213719       1 transport.go:253] "START monitoring connection state"
W1224 02:50:14.220963       1 logging.go:55] [core] [Channel #4 SubChannel #5] grpc: addrConn.createTransport failed to connect to {Addr:
"10.96.234.43:8091", ServerName: "maestro-grpc-broker.maestro.svc:8091", }. Err: connection error: desc = "transport: Error while dialing: dial
tcp 10.96.234.43:8091: connect: connection refused"
I1224 02:50:14.221018       1 baseclient.go:102] ERROR CHANNEL WAS TRIGGERED!
E1224 02:50:14.221034       1 baseclient.go:108] "the cloudevents client is disconnected" err="grpc connection is disconnected
(state=TRANSIENT_FAILURE)"
E1224 02:50:14.221042       1 baseclient.go:195] "Unhandled Error" err="failed to resubscribe, rpc error: code = Unavailable desc = connection
error: desc = \"transport: Error while dialing: dial tcp 10.96.234.43:8091: connect: connection refused\""
I1224 02:50:14.221075       1 transport.go:287] "STOP monitoring connection state"
I1224 02:50:16.248559       1 baseclient.go:151] "Sending event"
eventType="io.open-cluster-management.works.v1alpha1.manifestbundles.status.update_request"
extensions={"clustername":"80ba4ce3-59b9-4a10-8348-1458f61c667c","logtracing":"{}","metadata":"{\"name\":\"work-5qvdd\",\"namespace\":\"80ba4ce3-5
9b9-4a10-8348-1458f61c667c\",\"uid\":\"863d756d-bff1-51c0-94c3-56c8eea2fe3b\",\"generation\":1,\"creationTimestamp\":\"2025-12-24T02:50:05Z\",\"la
bels\":{\"cloudevents.open-cluster-management.io/originalsource\":\"maestro\"},\"annotations\":{\"cloudevents.open-cluster-management.io/datatype\
":\"io.open-cluster-management.works.v1alpha1.manifestbundles\"},\"finalizers\":[\"cluster.open-cluster-management.io/manifest-work-cleanup\"]}","
originalsource":"maestro","resourceid":"863d756d-bff1-51c0-94c3-56c8eea2fe3b","resourceversion":1,"sequenceid":"2003659460563701760","statushash":
"ff7c517a26e6e147f2b968a4dc575ae6a6d4f7bf4ebfc8532193dc7491f13a05"}
I1224 02:50:16.248643       1 patcher.go:207] "Object is patched" objectType="*v1.ManifestWork" objectName="work-5qvdd"
patch="{\"metadata\":{\"resourceVersion\":\"5\",\"uid\":\"863d756d-bff1-51c0-94c3-56c8eea2fe3b\"},\"status\":{\"conditions\":[{\"lastTransitionTim
e\":\"2025-12-24T02:50:05Z\",\"message\":\"Apply manifest work complete\",\"observedGeneration\":1,\"reason\":\"AppliedManifestWorkComplete\",\"st
atus\":\"True\",\"type\":\"Applied\"},{\"lastTransitionTime\":\"2025-12-24T02:50:16Z\",\"message\":\"1 of 1 resources are not
available\",\"observedGeneration\":1,\"reason\":\"ResourcesNotAvailable\",\"status\":\"False\",\"type\":\"Available\"}],\"resourceStatus\":{\"mani
fests\":[{\"conditions\":[{\"lastTransitionTime\":\"2025-12-24T02:50:05Z\",\"message\":\"Apply manifest complete\",\"reason\":\"AppliedManifestCom
plete\",\"status\":\"True\",\"type\":\"Applied\"},{\"lastTransitionTime\":\"2025-12-24T02:50:16Z\",\"message\":\"Resource is not
available\",\"reason\":\"ResourceNotAvailable\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2025-12-24T02:50:05Z\",\"me
ssage\":\"\",\"reason\":\"StatusFeedbackSynced\",\"status\":\"True\",\"type\":\"StatusFeedbackSynced\"}],\"resourceMeta\":{\"group\":\"apps\",\"ki
nd\":\"Deployment\",\"name\":\"nginx-jpbqp\",\"namespace\":\"default\",\"ordinal\":0,\"resource\":\"deployments\",\"version\":\"v1\"},\"statusFeed

Related issue(s)

Fixes #

Summary by CodeRabbit

  • Bug Fixes
    • Fixed a potential deadlock condition that could occur in specific scenarios, improving application stability and reliability when handling concurrent operations.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: clyang82 <[email protected]>
Signed-off-by: clyang82 <[email protected]>
@openshift-ci openshift-ci bot requested review from deads2k and qiujian16 December 24, 2025 07:00
@coderabbitai
Copy link

coderabbitai bot commented Dec 24, 2025

Walkthrough

Modified the sendReceiverSignal function to use non-blocking channel send with a select statement and default case. Adjusted import ordering for the logging package. No exported function signatures changed.

Changes

Cohort / File(s) Summary
Channel communication refactor
pkg/cloudevents/generic/clients/baseclient.go
Modified sendReceiverSignal to perform non-blocking send to receiverChan using select with default case, dropping signals if receiver is unavailable. Updated import statement ordering.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title is vague and lacks specificity about the actual change; ':bug: Resync issue' does not clearly describe what the fix does. Make the title more specific about the solution, such as ':bug: Fix deadlock in gRPC receiver by using non-blocking signal send' to clearly convey the actual fix.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The description provides comprehensive analysis of the problem, including root cause analysis with logs, timing sequence, and clear explanation of the deadlock mechanism.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link

openshift-ci bot commented Dec 24, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clyang82
Once this PR has been reviewed and has the lgtm label, please assign qiujian16 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/cloudevents/generic/clients/baseclient.go (1)

260-267: Excellent fix for the deadlock issue.

The non-blocking send correctly prevents the deadlock described in the PR objectives. When the subscription goroutine is blocked in transport.Subscribe(), it cannot read from receiverChan, so a blocking send would deadlock forever.

However, for consistency with sendReconnectedSignal (lines 289-293) and to aid debugging, consider logging when a signal is dropped.

🔎 Suggested enhancement: Log dropped signals
 func (c *baseClient) sendReceiverSignal(signal int) {
 	c.RLock()
 	defer c.RUnlock()
 
 	if c.receiverChan != nil {
 		select {
 		case c.receiverChan <- signal:
 			// Signal sent successfully
 		default:
-			// Receiver is busy/blocked, can't send now
-			// This prevents deadlock when receiver is stuck in Subscribe()
+			// Receiver is busy/blocked, can't send now
+			// This prevents deadlock when receiver is stuck in Subscribe()
+			klog.V(2).Info("receiver signal not sent, receiver is busy", "signal", signal)
 		}
 	}
 }

Note: You'll need to add ctx context.Context as a parameter to sendReceiverSignal to use klog.FromContext(ctx) (similar to sendReconnectedSignal), or use klog directly as shown above.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between de3e6c6 and e31376e.

📒 Files selected for processing (1)
  • pkg/cloudevents/generic/clients/baseclient.go
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-16T02:22:20.929Z
Learnt from: skeeey
Repo: open-cluster-management-io/sdk-go PR: 144
File: pkg/cloudevents/generic/options/grpc/protocol/protocol.go:200-213
Timestamp: 2025-09-16T02:22:20.929Z
Learning: In the GRPC CloudEvents protocol implementation, when startEventsReceiver encounters a stream error, it sends the error to reconnectErrorChan. The consumer of this channel handles the error by calling Close() on the protocol, which triggers close(p.closeChan), causing OpenInbound to unblock and call cancel() to properly terminate both the events receiver and heartbeat watcher goroutines.

Applied to files:

  • pkg/cloudevents/generic/clients/baseclient.go
📚 Learning: 2025-09-01T03:34:05.141Z
Learnt from: morvencao
Repo: open-cluster-management-io/sdk-go PR: 138
File: pkg/cloudevents/server/grpc/metrics/metrics.go:231-254
Timestamp: 2025-09-01T03:34:05.141Z
Learning: In open-cluster-management.io/sdk-go gRPC CloudEvents metrics, processing duration metrics should only be recorded for unary RPCs, not stream RPCs. Stream RPCs can be long-lived connections that persist as long as the gRPC server runs, making duration metrics confusing and less useful for operators debugging issues.

Applied to files:

  • pkg/cloudevents/generic/clients/baseclient.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: unit
  • GitHub Check: integration
  • GitHub Check: verify

Copy link
Member

@morvencao morvencao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@clyang82
Copy link
Contributor Author

/assign @qiujian16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants