fix(android): establish a device-transport boundary (fixes transport-failure misclassification)#3331
fix(android): establish a device-transport boundary (fixes transport-failure misclassification)#3331steviec wants to merge 18 commits into
Conversation
…leException in shell()
…missionInternal instead of swallowing
…achableException in setAllPermissions
| } catch (e: Exception) { | ||
| // Best-effort: if we can't read/parse the APK for any non-transport reason, skip granting. | ||
| logger.debug("Failed to read APK permissions for $appId: ${e.message}") | ||
| null |
There was a problem hiding this comment.
setAllPermissions is invoked during launchApp to grant all requested permissions.
From what I can see, the current behavior swallows the exception. I think this should fail fast instead, because a failure here means Maestro was not able to complete the requested launch setup in expected behaviour.
So, given what we've discovered till today:
- IOException needs to be mapped as DeviceUnreachableException
- Hard fail any other exceptions.
There was a problem hiding this comment.
Are you suggesting we don't use a try / catch at all here, and just let the error bubble up? I support that. The only downside is if we've been swalling permission issues that don't affect customers, then suddenly we're throwing infra errors where we didn't before. So maybe we just log it and monitor first so we can know how common this is? What do you think?
amanjeetsingh150
left a comment
There was a problem hiding this comment.
Does this PR also mean consumers will continue configuring dadb and their write timeouts independently?
I’m wondering if we should eventually centralize this in core, so all consumers use a common dadb configuration instead of each one managing it separately.
That said, I understand my suggestions are likely riskier deploys, so I don’t want to block this PR unnecessarily. I’d love to jam offline on how we want to prioritize this:
- do we take a smaller part of it now?
- or do we plan a bigger revamp, especially if we start treating permissions as a first-class Maestro operation? There are open issues on iOS as well.
So it might be worth discussing the broader direction before we go too deep here.
…imeouts) as DeviceUnreachableException Broaden the original SocketTimeoutException-only handling so every adb transport death surfaces as DeviceUnreachableException (retryable INFRA_ERROR) instead of being laundered into UnableToClearState / UnableToSetPermissions (non-retryable TEST_ERROR). The production failure that motivated this was a `java.net.SocketException: Broken pipe` on the dadb write path during clearAppState — a sibling of SocketTimeoutException under IOException, so the timeout-only catch missed it and the customer saw a spurious test failure. shell(): every IOException thrown by dadb.shell() is a transport failure (broken pipe, reset, timeout, EOF, protocol error) — translate the whole catch to DeviceUnreachableException. Command-level failures return a non-zero exitCode instead and stay a plain IOException, so the test-vs-infra boundary is preserved by origin, not by exception subtype. setAllPermissions: widen the APK-pull transport catch to SocketException as well as SocketTimeoutException; non-transport APK parse failures still fall through to best-effort skip. runDeviceCall(): a dead gRPC channel (UNAVAILABLE + IOException cause) now throws DeviceUnreachableException instead of the raw StatusRuntimeException, for parity on propagating gRPC calls. Tests cover broken pipe, bare IOException, and timeout on clearAppState and setPermissions, plus a boundary guard that a non-zero exitCode stays a plain IOException and is NOT reclassified as DeviceUnreachableException.
…ing it in setPermissions/clearState The launchApp and setPermissions command paths caught Exception broadly and rewrapped it into MaestroException.UnableToSetPermissions/UnableToClearState. Since DeviceUnreachableException is a RuntimeException, it got laundered too, and the worker classifier (MaestroTestRunner.mapToTestExecutionException) matched the wrapper as `is MaestroException` -> TestFailure -> TEST_ERROR before it could reach the `is DeviceUnreachableException` arm -> INFRA_ERROR. So the AndroidDriver transport-timeout translation alone did not move the dominant setPermissions/clearState cases out of TEST_ERROR. Rethrow DeviceUnreachableException ahead of the generic catch at the three wrap sites so it propagates bare to the worker and classifies as a retryable INFRA_ERROR. This also fixes the same latent misclassification on iOS. Adds three IntegrationTest regression tests (one per wrap site) asserting runFlow surfaces DeviceUnreachableException rather than a MaestroException.
…okepoint (TranslatingDadb)
…n at the device boundary Replace the TranslatingDadb decorator (which *was* a Dadb, so the transport type still leaked through anything holding it) with a plain internal DadbConnection that does NOT implement Dadb. It exposes only the per-connection operations the device layer uses (shell/openShell/pull/push/install/uninstall/ open/close), so no consumer can name dadb.Dadb through it. Retype AndroidDriver's constructor and the AndroidAppFiles / chromedevtools helpers to take DadbConnection. The constructor and helpers become internal because they now reference the internal DadbConnection; the public connect() entry point that re-introduces Dadb.create lands in Task 2. Behavior-preserving: DadbConnection performs the same IOException -> DeviceUnreachableException translation TranslatingDadb did (open() stays untranslated as the gRPC socket factory). The full AndroidDriverTest suite stays green.
… maestro-cli off Dadb Introduces the device-layer construction owner AndroidDriver.connect(host, port, ...) (Dadb.create -> DadbConnection -> driver, no open()) and the enumeration utility AndroidDevices.list(host) yielding flat AndroidDeviceDescriptor(id, host, port) plus AndroidDevices.isReachable(host, port). Refactors MaestroSessionManager and McpMaestroSessionManager onto these, so maestro-cli no longer names Dadb anywhere. Collapses to a single TCP path: deletes the vestigial adb-server (port 5038) path — createAdbServerDadb(), AdbServer.listDadbs(5038) in DeviceService, the dead AndroidDeviceProvider test helper, and all dadb.adbserver imports. Every dadb.Dadb reference now lives only inside maestro-client (AndroidDriver.connect, AndroidDevices, DadbConnection, DeviceService). Adds unit tests for descriptor parsing.
…rify, swallow it in teardown autoVerifyWithAppName wrapped the APK pull in runCatching, which swallowed the DeviceUnreachableException now raised by DadbConnection on a dead transport, silently skipping auto-verify instead of surfacing retryable infra. Mirror setAllPermissions: rethrow DUE, best-effort-swallow other read failures. The two teardown methods (uninstallMaestroDriverApp / uninstallMaestroServerApp) caught only IOException; DUE is a RuntimeException and now escaped cleanup. Add a DUE catch around both the package check and the uninstall (including the IOException retry) so a dead device cannot crash teardown.
- launchAppCommand: catch DeviceUnreachableException before the generic Exception catch so a wedged transport propagates as retryable infra instead of being laundered into MaestroException.UnableToLaunchApp (which the worker misclassifies as a customer TEST_ERROR). - AndroidMediaStoreTest: build the driver via the public AndroidDriver.connect factory (the constructor is now internal in another module); keep the raw Dadb for fileExists assertions. - IntegrationTest: add coverage asserting a DUE thrown from the driver's launchApp propagates as DeviceUnreachableException, not UnableToLaunchApp.
…a Dadb.list round-trip Replace the host:port descriptor/parseDescriptor model (which dropped colon-less serials and broke by-serial selection) with AndroidDevices.resolveDadb: re-list and match the verbatim dadb.toString() id, reusing that live connection so reconnect uses the same adb-server transport enumeration used. Rewrite the unit test to assert the round-trip invariant instead of codifying the dropped-serial bug.
Resolve a device id to a connection via AndroidDevices.resolveDadb (adb-server round-trip) and wrap it in DadbConnection, mirroring connect(host, port).
…ort reconnect createAndroid and pickAndroidDevice now resolve a device id through AndroidDriver.connectToDevice (adb-server round-trip), so serial-identified devices select and connect again. The explicit --port path still uses AndroidDriver.connect(host, port). Fixes the 'Unable to find device with id emulator-5554' regression.
…tToDevice kdoc Harden the first-device resolution test to verify the picked connection stays open, and reword connectToDevice doc to avoid an ambiguous [open] link.
…pp regression) getApkFile pulled a blank apk path when an app was not installed, and the resulting 'Sync failed: No such file' IOException was translated by DadbConnection into DeviceUnreachableException. That DUE bypasses Orchestra's MaestroException-based 'optional' suppression and fails the flow terminally, breaking 'launchApp optional: true' for a non-existent app (e2e flow commands_optional_tournee). Detect the not-installed case and throw a plain IllegalStateException, which is wrapped into UnableToLaunchApp and honored by optional — without weakening DeviceUnreachableException classification for real transport failures.
|
|
||
| else -> runCatching { | ||
| (Dadb.list() + AdbServer.listDadbs(adbServerPort = 5038)) | ||
| Dadb.list() |
There was a problem hiding this comment.
Just a heads-up here: every consumer currently creates/configures dadb differently. After this PR, we’ll move toward a single shared dadb, which feels like the right design direction.
The main thing to be careful about is configuration. Cloud currently uses custom timeouts / keep-alive values, while other consumers seem to use raw dadb defaults.
I think we should either:
Keep Cloud’s existing config as the default to avoid breaking existing test cases.
Or intentionally move to the shared config, but acknowledge this is a riskier deployment.
There was a problem hiding this comment.
Totally agreed. I think the right move is 1) sane defaults (current defaults internally are NO timeouts, which I think is bad), and 2) allow the contract to accept dadb configuration parameters that are passed through.
| } catch (unreachable: DeviceUnreachableException) { | ||
| // Teardown runs during cleanup/close; a dead transport (already translated by | ||
| // DadbConnection) must not crash teardown. Log and continue. Wraps both the package | ||
| // check and the uninstall, including the IOException retry below. |
There was a problem hiding this comment.
Optional: should we create an abstraction that catches this unreachable exception to avoid this duplication across driver? Something like runDeviceCall?
There was a problem hiding this comment.
This PR actually does exactly that; the DadbConnection translates all dadb-related IOExceptions into DeviceUnreachableExceptions. This catch right here is correct IMO; we are intentionally ignoring a device disconnection because we're in the teardown sequence.
| runDadbCall("shell: $command") { dadb.shell(command) } | ||
|
|
||
| fun openShell(command: String = ""): AdbShellStream = | ||
| runDadbCall("openShell: $command") { dadb.openShell(command) } |
There was a problem hiding this comment.
Ummm I might be missing something was not the requirement to have logs, metrics observability here? To get:
- dadb pull calls
- dadb shell calls
- dadb push calls
There was a problem hiding this comment.
Ah yeah, that's intentional, sorry I should have linked to this section of your proposal: https://app.notion.com/p/mobile-dev/WIP-Root-Causing-Maestro-s-Timeouts-36f4d3bab4ab80498d95caeebfdee068?source=copy_link#047132e3b9da4b2096eaf3aa7f00a662
I explain why I'm not adding that other stuff yet.
There was a problem hiding this comment.
(though if you think we need it now, I'm fine with that, too)
Problem
The raw Android transport (
dadb.Dadb) is passed by hand through every layer — CLI, session managers, drivers, app-file helpers, evenOrchestra. Nothing separates reaching the device from running a test.So when the transport dies mid-flow (broken pipe / reset / EOF), the bare
IOExceptiondadbthrows looks identical to a failing command — and gets wrapped into aMaestroException(UnableToSetPermissions/UnableToClearState/UnableToLaunchApp). A dead device is reported as a customer test failure instead of retryable infra — the source of the recurring ~15-minute-timeout errors. And with no seam owning the transport, every caller has to construct and tear downDadbitself.Approach — draw the boundary
Introduce one device-transport boundary;
dadb.Dadbnever escapes it.AndroidDriver.connect(host, port)andAndroidDevices.list()(flat descriptors, never aDadb). No caller touches the transport.connectbuildsDadb.create → DadbConnection → AndroidDriver.DadbConnectionwraps one openDadband is deliberately not aDadb(so the type can't leak); it's the single chokepoint whereIOException → DeviceUnreachableException. Onlyopen/closeare carve-outs.DeviceUnreachableException(aRuntimeException= retryable infra) propagates throughOrchestrauntouched, except teardown which swallows-and-logs. A non-zero shell exit stays a plainIOException(test-domain) — command failure ≠ transport death.Usage — before / after
Raw
Dadbis gone from callers — you name an endpoint, not a transport.Before
After
Notes
connect(port = 6520); move logcat/crash/ANR ontoDriver) is a follow-up in the copilot repo after this merges.Test plan
:maestro-client:test(AndroidDriverTest, AndroidDevicesTest),:maestro-testIntegrationTest 160 cases green incl. new launchApp-DUE test (negative-control verified):maestro-cli+:maestro-orchestracompile (main + test)rg TranslatingDadb→ none;rg "dadb.Dadb" maestro-cli→ none