Skip to content

fix(android-driver): bounded gRPC retry with channel rebuild on broken pipe#3316

Open
proksh wants to merge 2 commits into
mainfrom
fix/android-driver-bounded-grpc-retry-with-reconnect
Open

fix(android-driver): bounded gRPC retry with channel rebuild on broken pipe#3316
proksh wants to merge 2 commits into
mainfrom
fix/android-driver-bounded-grpc-retry-with-reconnect

Conversation

@proksh

@proksh proksh commented May 22, 2026

Copy link
Copy Markdown
Contributor

Why

After #3311 reverted the channel-level retries from #3290, transient socket drops between the worker and the Android emulator (UNAVAILABLE / Broken pipe in viewHierarchy / deviceInfo / etc.) surface as INFRA_ERROR and fail the job. #3290 hid these but caused 15-min hangs because gRPC's transparent retries shared the same dead channel and the 120s per-call deadline let a wedged RPC drag on.

This PR adds a bounded retry layer above the gRPC client and rebuilds the channel when the socket is dead.

What

  • GrpcRetry.withRetry — a single retry helper used by AndroidDriver.runDeviceCall. Retries UNAVAILABLE only (3 attempts, 200 ms backoff). Hard ceiling of 30s total budget — no possibility of a 15-min hang. DEADLINE_EXCEEDED, INVALID_ARGUMENT, INTERNAL are never retried.
  • Channel rebuild on broken pipe — when the failure carries an IOException cause (or "broken pipe" / "io exception" in the message), AndroidDriver swaps in a fresh ManagedChannel + stubs before the next retry. This is what enableRetry() couldn't do: it kept hammering the dead transport.
  • contentDescriptor now routes through runDeviceCall — view-hierarchy was the call surfacing in alerts but had its own weaker retry path (1 retry, 1s sleep, no channel rebuild). The duplicate callViewHierarchy is removed; it now gets the same treatment as every other RPC.

Tests

10 new unit tests in GrpcRetryTest. Cover: happy path, retry on UNAVAILABLE, exhaustion, non-retryable codes (DEADLINE_EXCEEDED / INVALID_ARGUMENT / INTERNAL), broken-pipe (cause + message) → onBrokenPipe fires before retry, plain UNAVAILABLE does not rebuild, total budget cuts off retries even when attempts remain.

./gradlew :maestro-client:test — 121 passing, 0 failures.

Scope notes

The per-call deadline (withDeadlineAfter(120s) on blockingStubWithTimeout) is intentionally not changed in this PR — that's a broader behavioral change touching every RPC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant