fix(android): establish a device-transport boundary (fixes transport-failure misclassification) by steviec · Pull Request #3331 · mobile-dev-inc/Maestro

steviec · 2026-06-02T18:33:02Z

Problem

The raw Android transport (dadb.Dadb) is passed by hand through every layer — CLI, session managers, drivers, app-file helpers, even Orchestra. Nothing separates reaching the device from running a test.

So when the transport dies mid-flow (broken pipe / reset / EOF), the bare IOException dadb throws looks identical to a failing command — and gets wrapped into a MaestroException (UnableToSetPermissions / UnableToClearState / UnableToLaunchApp). A dead device is reported as a customer test failure instead of retryable infra — the source of the recurring ~15-minute-timeout errors. And with no seam owning the transport, every caller has to construct and tear down Dadb itself.

Approach — draw the boundary

Introduce one device-transport boundary; dadb.Dadb never escapes it.

Consumers name an endpoint, get a driver. AndroidDriver.connect(host, port) and AndroidDevices.list() (flat descriptors, never a Dadb). No caller touches the transport.
Inside the boundary, the transport is born and stays. connect builds Dadb.create → DadbConnection → AndroidDriver. DadbConnection wraps one open Dadb and is deliberately not a Dadb (so the type can't leak); it's the single chokepoint where IOException → DeviceUnreachableException. Only open/close are carve-outs.
Errors classify once, at the edge. DeviceUnreachableException (a RuntimeException = retryable infra) propagates through Orchestra untouched, except teardown which swallows-and-logs. A non-zero shell exit stays a plain IOException (test-domain) — command failure ≠ transport death.

 consumer ─► AndroidDriver.connect(host, port) ─┐   (device layer)
                                                ▼
                              Dadb.create ─► DadbConnection ─► AndroidDriver
                                              (IOException → DeviceUnreachableException)

Usage — before / after

Raw Dadb is gone from callers — you name an endpoint, not a transport.

Before

val dadb = Dadb.list().find { it.toString() == id }
    ?: Dadb.discover()
    ?: error("No device")
val driver = AndroidDriver(rawDadb = dadb, hostPort = hostPort)
val maestro = Maestro.android(driver)
Orchestra(maestro).runFlow(commands)

After

val device = AndroidDevices.list().find { it.id == id }
    ?: error("No device")
val driver = AndroidDriver.connect(host = device.host, port = device.port, hostPort = hostPort)
val maestro = Maestro.android(driver)
Orchestra(maestro).runFlow(commands)

A caller with a known endpoint skips enumeration — the worker uses AndroidDriver.connect(port = 6520).

Notes

Collapsed device discovery to a single TCP path; removed the dead adb-server (port 5038) fallback (nothing provisions a 5038 server).
Worker migration (construct via connect(port = 6520); move logcat/crash/ANR onto Driver) is a follow-up in the copilot repo after this merges.

Test plan

:maestro-client:test (AndroidDriverTest, AndroidDevicesTest), :maestro-test IntegrationTest 160 cases green incl. new launchApp-DUE test (negative-control verified)
:maestro-cli + :maestro-orchestra compile (main + test)
rg TranslatingDadb → none; rg "dadb.Dadb" maestro-cli → none

…leException in shell()

…missionInternal instead of swallowing

…achableException in setAllPermissions

amanjeetsingh150 · 2026-06-03T08:42:16Z

+        } catch (e: Exception) {
+            // Best-effort: if we can't read/parse the APK for any non-transport reason, skip granting.
+            logger.debug("Failed to read APK permissions for $appId: ${e.message}")
+            null


setAllPermissions is invoked during launchApp to grant all requested permissions.

From what I can see, the current behavior swallows the exception. I think this should fail fast instead, because a failure here means Maestro was not able to complete the requested launch setup in expected behaviour.

So, given what we've discovered till today:

IOException needs to be mapped as DeviceUnreachableException

Hard fail any other exceptions.

Are you suggesting we don't use a try / catch at all here, and just let the error bubble up? I support that. The only downside is if we've been swalling permission issues that don't affect customers, then suddenly we're throwing infra errors where we didn't before. So maybe we just log it and monitor first so we can know how common this is? What do you think?

amanjeetsingh150

Does this PR also mean consumers will continue configuring dadb and their write timeouts independently?

I’m wondering if we should eventually centralize this in core, so all consumers use a common dadb configuration instead of each one managing it separately.

That said, I understand my suggestions are likely riskier deploys, so I don’t want to block this PR unnecessarily. I’d love to jam offline on how we want to prioritize this:

do we take a smaller part of it now?
or do we plan a bigger revamp, especially if we start treating permissions as a first-class Maestro operation? There are open issues on iOS as well.

So it might be worth discussing the broader direction before we go too deep here.

…imeouts) as DeviceUnreachableException Broaden the original SocketTimeoutException-only handling so every adb transport death surfaces as DeviceUnreachableException (retryable INFRA_ERROR) instead of being laundered into UnableToClearState / UnableToSetPermissions (non-retryable TEST_ERROR). The production failure that motivated this was a `java.net.SocketException: Broken pipe` on the dadb write path during clearAppState — a sibling of SocketTimeoutException under IOException, so the timeout-only catch missed it and the customer saw a spurious test failure. shell(): every IOException thrown by dadb.shell() is a transport failure (broken pipe, reset, timeout, EOF, protocol error) — translate the whole catch to DeviceUnreachableException. Command-level failures return a non-zero exitCode instead and stay a plain IOException, so the test-vs-infra boundary is preserved by origin, not by exception subtype. setAllPermissions: widen the APK-pull transport catch to SocketException as well as SocketTimeoutException; non-transport APK parse failures still fall through to best-effort skip. runDeviceCall(): a dead gRPC channel (UNAVAILABLE + IOException cause) now throws DeviceUnreachableException instead of the raw StatusRuntimeException, for parity on propagating gRPC calls. Tests cover broken pipe, bare IOException, and timeout on clearAppState and setPermissions, plus a boundary guard that a non-zero exitCode stays a plain IOException and is NOT reclassified as DeviceUnreachableException.

…ing it in setPermissions/clearState The launchApp and setPermissions command paths caught Exception broadly and rewrapped it into MaestroException.UnableToSetPermissions/UnableToClearState. Since DeviceUnreachableException is a RuntimeException, it got laundered too, and the worker classifier (MaestroTestRunner.mapToTestExecutionException) matched the wrapper as `is MaestroException` -> TestFailure -> TEST_ERROR before it could reach the `is DeviceUnreachableException` arm -> INFRA_ERROR. So the AndroidDriver transport-timeout translation alone did not move the dominant setPermissions/clearState cases out of TEST_ERROR. Rethrow DeviceUnreachableException ahead of the generic catch at the three wrap sites so it propagates bare to the worker and classifies as a retryable INFRA_ERROR. This also fixes the same latent misclassification on iOS. Adds three IntegrationTest regression tests (one per wrap site) asserting runFlow surfaces DeviceUnreachableException rather than a MaestroException.

…okepoint (TranslatingDadb)

…TranslatingDadb

…n at the device boundary Replace the TranslatingDadb decorator (which *was* a Dadb, so the transport type still leaked through anything holding it) with a plain internal DadbConnection that does NOT implement Dadb. It exposes only the per-connection operations the device layer uses (shell/openShell/pull/push/install/uninstall/ open/close), so no consumer can name dadb.Dadb through it. Retype AndroidDriver's constructor and the AndroidAppFiles / chromedevtools helpers to take DadbConnection. The constructor and helpers become internal because they now reference the internal DadbConnection; the public connect() entry point that re-introduces Dadb.create lands in Task 2. Behavior-preserving: DadbConnection performs the same IOException -> DeviceUnreachableException translation TranslatingDadb did (open() stays untranslated as the gRPC socket factory). The full AndroidDriverTest suite stays green.

… maestro-cli off Dadb Introduces the device-layer construction owner AndroidDriver.connect(host, port, ...) (Dadb.create -> DadbConnection -> driver, no open()) and the enumeration utility AndroidDevices.list(host) yielding flat AndroidDeviceDescriptor(id, host, port) plus AndroidDevices.isReachable(host, port). Refactors MaestroSessionManager and McpMaestroSessionManager onto these, so maestro-cli no longer names Dadb anywhere. Collapses to a single TCP path: deletes the vestigial adb-server (port 5038) path — createAdbServerDadb(), AdbServer.listDadbs(5038) in DeviceService, the dead AndroidDeviceProvider test helper, and all dadb.adbserver imports. Every dadb.Dadb reference now lives only inside maestro-client (AndroidDriver.connect, AndroidDevices, DadbConnection, DeviceService). Adds unit tests for descriptor parsing.

…rify, swallow it in teardown autoVerifyWithAppName wrapped the APK pull in runCatching, which swallowed the DeviceUnreachableException now raised by DadbConnection on a dead transport, silently skipping auto-verify instead of surfacing retryable infra. Mirror setAllPermissions: rethrow DUE, best-effort-swallow other read failures. The two teardown methods (uninstallMaestroDriverApp / uninstallMaestroServerApp) caught only IOException; DUE is a RuntimeException and now escaped cleanup. Add a DUE catch around both the package check and the uninstall (including the IOException retry) so a dead device cannot crash teardown.

- launchAppCommand: catch DeviceUnreachableException before the generic Exception catch so a wedged transport propagates as retryable infra instead of being laundered into MaestroException.UnableToLaunchApp (which the worker misclassifies as a customer TEST_ERROR). - AndroidMediaStoreTest: build the driver via the public AndroidDriver.connect factory (the constructor is now internal in another module); keep the raw Dadb for fileExists assertions. - IntegrationTest: add coverage asserting a DUE thrown from the driver's launchApp propagates as DeviceUnreachableException, not UnableToLaunchApp.

…a Dadb.list round-trip Replace the host:port descriptor/parseDescriptor model (which dropped colon-less serials and broke by-serial selection) with AndroidDevices.resolveDadb: re-list and match the verbatim dadb.toString() id, reusing that live connection so reconnect uses the same adb-server transport enumeration used. Rewrite the unit test to assert the round-trip invariant instead of codifying the dropped-serial bug.

Resolve a device id to a connection via AndroidDevices.resolveDadb (adb-server round-trip) and wrap it in DadbConnection, mirroring connect(host, port).

…ort reconnect createAndroid and pickAndroidDevice now resolve a device id through AndroidDriver.connectToDevice (adb-server round-trip), so serial-identified devices select and connect again. The explicit --port path still uses AndroidDriver.connect(host, port). Fixes the 'Unable to find device with id emulator-5554' regression.

…tToDevice kdoc Harden the first-device resolution test to verify the picked connection stays open, and reword connectToDevice doc to avoid an ambiguous [open] link.

…pp regression) getApkFile pulled a blank apk path when an app was not installed, and the resulting 'Sync failed: No such file' IOException was translated by DadbConnection into DeviceUnreachableException. That DUE bypasses Orchestra's MaestroException-based 'optional' suppression and fails the flow terminally, breaking 'launchApp optional: true' for a non-existent app (e2e flow commands_optional_tournee). Detect the not-installed case and throw a plain IllegalStateException, which is wrapped into UnableToLaunchApp and honored by optional — without weakening DeviceUnreachableException classification for real transport failures.

amanjeetsingh150 · 2026-06-05T09:01:29Z


            else -> runCatching {
-                (Dadb.list() + AdbServer.listDadbs(adbServerPort = 5038))
+                Dadb.list()


Just a heads-up here: every consumer currently creates/configures dadb differently. After this PR, we’ll move toward a single shared dadb, which feels like the right design direction.

The main thing to be careful about is configuration. Cloud currently uses custom timeouts / keep-alive values, while other consumers seem to use raw dadb defaults.

I think we should either:

Keep Cloud’s existing config as the default to avoid breaking existing test cases.
Or intentionally move to the shared config, but acknowledge this is a riskier deployment.

Totally agreed. I think the right move is 1) sane defaults (current defaults internally are NO timeouts, which I think is bad), and 2) allow the contract to accept dadb configuration parameters that are passed through.

amanjeetsingh150 · 2026-06-05T09:06:21Z

+        } catch (unreachable: DeviceUnreachableException) {
+            // Teardown runs during cleanup/close; a dead transport (already translated by
+            // DadbConnection) must not crash teardown. Log and continue. Wraps both the package
+            // check and the uninstall, including the IOException retry below.


Optional: should we create an abstraction that catches this unreachable exception to avoid this duplication across driver? Something like runDeviceCall?

This PR actually does exactly that; the DadbConnection translates all dadb-related IOExceptions into DeviceUnreachableExceptions. This catch right here is correct IMO; we are intentionally ignoring a device disconnection because we're in the teardown sequence.

amanjeetsingh150 · 2026-06-05T09:11:18Z

+        runDadbCall("shell: $command") { dadb.shell(command) }
+
+    fun openShell(command: String = ""): AdbShellStream =
+        runDadbCall("openShell: $command") { dadb.openShell(command) }


Ummm I might be missing something was not the requirement to have logs, metrics observability here? To get:

dadb pull calls

dadb shell calls

dadb push calls

Ah yeah, that's intentional, sorry I should have linked to this section of your proposal: https://app.notion.com/p/mobile-dev/WIP-Root-Causing-Maestro-s-Timeouts-36f4d3bab4ab80498d95caeebfdee068?source=copy_link#047132e3b9da4b2096eaf3aa7f00a662

I explain why I'm not adding that other stuff yet.

(though if you think we need it now, I'm fine with that, too)

steviec added 3 commits June 2, 2026 11:24

fix(android-driver): translate dadb socket timeout to DeviceUnreachab…

76f0574

…leException in shell()

fix(android-driver): propagate DeviceUnreachableException from setPer…

79ef215

…missionInternal instead of swallowing

fix(android-driver): surface APK-pull transport timeout as DeviceUnre…

7df76eb

…achableException in setAllPermissions

amanjeetsingh150 reviewed Jun 3, 2026

View reviewed changes

steviec changed the title ~~fix(android-driver): classify wedged-adbd timeouts as DeviceUnreachableException~~ fix(android-driver): classify dadb transport failures as DeviceUnreachableException Jun 3, 2026

steviec added 8 commits June 3, 2026 10:10

fix(android-driver): translate every dadb transport failure at one ch…

ea59a54

…okepoint (TranslatingDadb)

fix(android-driver): drop hand-picked Socket exception list, rely on …

9640d94

…TranslatingDadb

docs: record TCP-only / drop-5038 decision in execution plan

5d62540

steviec changed the title ~~fix(android-driver): classify dadb transport failures as DeviceUnreachableException~~ fix(android): device transport boundary — surface all dadb failures as DeviceUnreachableException Jun 4, 2026

chore: remove planning/brainstorm docs from PR

98f82a0

steviec changed the title ~~fix(android): device transport boundary — surface all dadb failures as DeviceUnreachableException~~ fix(android): establish a device-transport boundary (fixes transport-failure misclassification) Jun 4, 2026

steviec added 5 commits June 4, 2026 17:56

feat(android-driver): add connectToDevice for discovery-based selection

59bebc9

Resolve a device id to a connection via AndroidDevices.resolveDadb (adb-server round-trip) and wrap it in DadbConnection, mirroring connect(host, port).

test(device-layer): assert first device is not closed; clarify connec…

dade1a1

…tToDevice kdoc Harden the first-device resolution test to verify the picked connection stays open, and reword connectToDevice doc to avoid an ambiguous [open] link.

amanjeetsingh150 reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(android): establish a device-transport boundary (fixes transport-failure misclassification)#3331

fix(android): establish a device-transport boundary (fixes transport-failure misclassification)#3331
steviec wants to merge 18 commits into
mainfrom
fix/android-driver-device-unreachable

steviec commented Jun 2, 2026 •

edited

Loading

Uh oh!

amanjeetsingh150 Jun 3, 2026

Uh oh!

steviec Jun 5, 2026

Uh oh!

amanjeetsingh150 left a comment

Uh oh!

amanjeetsingh150 Jun 5, 2026

Uh oh!

steviec Jun 5, 2026

Uh oh!

amanjeetsingh150 Jun 5, 2026

Uh oh!

steviec Jun 5, 2026

Uh oh!

amanjeetsingh150 Jun 5, 2026

Uh oh!

steviec Jun 5, 2026

Uh oh!

steviec Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

steviec commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Approach — draw the boundary

Usage — before / after

Notes

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amanjeetsingh150 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steviec commented Jun 2, 2026 •

edited

Loading