fix(android-driver): bounded gRPC retry with channel rebuild on broken pipe#3316
Open
proksh wants to merge 2 commits into
Open
fix(android-driver): bounded gRPC retry with channel rebuild on broken pipe#3316proksh wants to merge 2 commits into
proksh wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
After #3311 reverted the channel-level retries from #3290, transient socket drops between the worker and the Android emulator (
UNAVAILABLE/Broken pipeinviewHierarchy/deviceInfo/ etc.) surface asINFRA_ERRORand fail the job. #3290 hid these but caused 15-min hangs because gRPC's transparent retries shared the same dead channel and the 120s per-call deadline let a wedged RPC drag on.This PR adds a bounded retry layer above the gRPC client and rebuilds the channel when the socket is dead.
What
GrpcRetry.withRetry— a single retry helper used byAndroidDriver.runDeviceCall. RetriesUNAVAILABLEonly (3 attempts, 200 ms backoff). Hard ceiling of 30s total budget — no possibility of a 15-min hang.DEADLINE_EXCEEDED,INVALID_ARGUMENT,INTERNALare never retried.IOExceptioncause (or"broken pipe"/"io exception"in the message),AndroidDriverswaps in a freshManagedChannel+ stubs before the next retry. This is whatenableRetry()couldn't do: it kept hammering the dead transport.contentDescriptornow routes throughrunDeviceCall— view-hierarchy was the call surfacing in alerts but had its own weaker retry path (1 retry, 1s sleep, no channel rebuild). The duplicatecallViewHierarchyis removed; it now gets the same treatment as every other RPC.Tests
10 new unit tests in
GrpcRetryTest. Cover: happy path, retry onUNAVAILABLE, exhaustion, non-retryable codes (DEADLINE_EXCEEDED/INVALID_ARGUMENT/INTERNAL), broken-pipe (cause + message) →onBrokenPipefires before retry, plainUNAVAILABLEdoes not rebuild, total budget cuts off retries even when attempts remain../gradlew :maestro-client:test— 121 passing, 0 failures.Scope notes
The per-call deadline (
withDeadlineAfter(120s)onblockingStubWithTimeout) is intentionally not changed in this PR — that's a broader behavioral change touching every RPC.