Skip to content

fix(android): establish a device-transport boundary (fixes transport-failure misclassification)#3331

Open
steviec wants to merge 18 commits into
mainfrom
fix/android-driver-device-unreachable
Open

fix(android): establish a device-transport boundary (fixes transport-failure misclassification)#3331
steviec wants to merge 18 commits into
mainfrom
fix/android-driver-device-unreachable

Conversation

@steviec

@steviec steviec commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Problem

The raw Android transport (dadb.Dadb) is passed by hand through every layer — CLI, session managers, drivers, app-file helpers, even Orchestra. Nothing separates reaching the device from running a test.

So when the transport dies mid-flow (broken pipe / reset / EOF), the bare IOException dadb throws looks identical to a failing command — and gets wrapped into a MaestroException (UnableToSetPermissions / UnableToClearState / UnableToLaunchApp). A dead device is reported as a customer test failure instead of retryable infra — the source of the recurring ~15-minute-timeout errors. And with no seam owning the transport, every caller has to construct and tear down Dadb itself.

Approach — draw the boundary

Introduce one device-transport boundary; dadb.Dadb never escapes it.

  • Consumers name an endpoint, get a driver. AndroidDriver.connect(host, port) and AndroidDevices.list() (flat descriptors, never a Dadb). No caller touches the transport.
  • Inside the boundary, the transport is born and stays. connect builds Dadb.create → DadbConnection → AndroidDriver. DadbConnection wraps one open Dadb and is deliberately not a Dadb (so the type can't leak); it's the single chokepoint where IOException → DeviceUnreachableException. Only open/close are carve-outs.
  • Errors classify once, at the edge. DeviceUnreachableException (a RuntimeException = retryable infra) propagates through Orchestra untouched, except teardown which swallows-and-logs. A non-zero shell exit stays a plain IOException (test-domain) — command failure ≠ transport death.
 consumer ─► AndroidDriver.connect(host, port) ─┐   (device layer)
                                                ▼
                              Dadb.create ─► DadbConnection ─► AndroidDriver
                                              (IOException → DeviceUnreachableException)

Usage — before / after

Raw Dadb is gone from callers — you name an endpoint, not a transport.

Before

val dadb = Dadb.list().find { it.toString() == id }
    ?: Dadb.discover()
    ?: error("No device")
val driver = AndroidDriver(rawDadb = dadb, hostPort = hostPort)
val maestro = Maestro.android(driver)
Orchestra(maestro).runFlow(commands)

After

val device = AndroidDevices.list().find { it.id == id }
    ?: error("No device")
val driver = AndroidDriver.connect(host = device.host, port = device.port, hostPort = hostPort)
val maestro = Maestro.android(driver)
Orchestra(maestro).runFlow(commands)

A caller with a known endpoint skips enumeration — the worker uses AndroidDriver.connect(port = 6520).

Notes

  • Collapsed device discovery to a single TCP path; removed the dead adb-server (port 5038) fallback (nothing provisions a 5038 server).
  • Worker migration (construct via connect(port = 6520); move logcat/crash/ANR onto Driver) is a follow-up in the copilot repo after this merges.

Test plan

  • :maestro-client:test (AndroidDriverTest, AndroidDevicesTest), :maestro-test IntegrationTest 160 cases green incl. new launchApp-DUE test (negative-control verified)
  • :maestro-cli + :maestro-orchestra compile (main + test)
  • rg TranslatingDadb → none; rg "dadb.Dadb" maestro-cli → none

} catch (e: Exception) {
// Best-effort: if we can't read/parse the APK for any non-transport reason, skip granting.
logger.debug("Failed to read APK permissions for $appId: ${e.message}")
null

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setAllPermissions is invoked during launchApp to grant all requested permissions.

From what I can see, the current behavior swallows the exception. I think this should fail fast instead, because a failure here means Maestro was not able to complete the requested launch setup in expected behaviour.

So, given what we've discovered till today:

  • IOException needs to be mapped as DeviceUnreachableException
  • Hard fail any other exceptions.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting we don't use a try / catch at all here, and just let the error bubble up? I support that. The only downside is if we've been swalling permission issues that don't affect customers, then suddenly we're throwing infra errors where we didn't before. So maybe we just log it and monitor first so we can know how common this is? What do you think?

@amanjeetsingh150 amanjeetsingh150 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR also mean consumers will continue configuring dadb and their write timeouts independently?

I’m wondering if we should eventually centralize this in core, so all consumers use a common dadb configuration instead of each one managing it separately.

That said, I understand my suggestions are likely riskier deploys, so I don’t want to block this PR unnecessarily. I’d love to jam offline on how we want to prioritize this:

  • do we take a smaller part of it now?
  • or do we plan a bigger revamp, especially if we start treating permissions as a first-class Maestro operation? There are open issues on iOS as well.

So it might be worth discussing the broader direction before we go too deep here.

…imeouts) as DeviceUnreachableException

Broaden the original SocketTimeoutException-only handling so every adb
transport death surfaces as DeviceUnreachableException (retryable
INFRA_ERROR) instead of being laundered into UnableToClearState /
UnableToSetPermissions (non-retryable TEST_ERROR).

The production failure that motivated this was a `java.net.SocketException:
Broken pipe` on the dadb write path during clearAppState — a sibling of
SocketTimeoutException under IOException, so the timeout-only catch missed
it and the customer saw a spurious test failure.

shell(): every IOException thrown by dadb.shell() is a transport failure
(broken pipe, reset, timeout, EOF, protocol error) — translate the whole
catch to DeviceUnreachableException. Command-level failures return a
non-zero exitCode instead and stay a plain IOException, so the test-vs-infra
boundary is preserved by origin, not by exception subtype.

setAllPermissions: widen the APK-pull transport catch to SocketException as
well as SocketTimeoutException; non-transport APK parse failures still fall
through to best-effort skip.

runDeviceCall(): a dead gRPC channel (UNAVAILABLE + IOException cause) now
throws DeviceUnreachableException instead of the raw StatusRuntimeException,
for parity on propagating gRPC calls.

Tests cover broken pipe, bare IOException, and timeout on clearAppState and
setPermissions, plus a boundary guard that a non-zero exitCode stays a plain
IOException and is NOT reclassified as DeviceUnreachableException.
@steviec steviec changed the title fix(android-driver): classify wedged-adbd timeouts as DeviceUnreachableException fix(android-driver): classify dadb transport failures as DeviceUnreachableException Jun 3, 2026
steviec added 8 commits June 3, 2026 10:10
…ing it in setPermissions/clearState

The launchApp and setPermissions command paths caught Exception broadly and
rewrapped it into MaestroException.UnableToSetPermissions/UnableToClearState.
Since DeviceUnreachableException is a RuntimeException, it got laundered too,
and the worker classifier (MaestroTestRunner.mapToTestExecutionException)
matched the wrapper as `is MaestroException` -> TestFailure -> TEST_ERROR
before it could reach the `is DeviceUnreachableException` arm -> INFRA_ERROR.

So the AndroidDriver transport-timeout translation alone did not move the
dominant setPermissions/clearState cases out of TEST_ERROR. Rethrow
DeviceUnreachableException ahead of the generic catch at the three wrap sites
so it propagates bare to the worker and classifies as a retryable INFRA_ERROR.
This also fixes the same latent misclassification on iOS.

Adds three IntegrationTest regression tests (one per wrap site) asserting
runFlow surfaces DeviceUnreachableException rather than a MaestroException.
…n at the device boundary

Replace the TranslatingDadb decorator (which *was* a Dadb, so the transport
type still leaked through anything holding it) with a plain internal
DadbConnection that does NOT implement Dadb. It exposes only the per-connection
operations the device layer uses (shell/openShell/pull/push/install/uninstall/
open/close), so no consumer can name dadb.Dadb through it.

Retype AndroidDriver's constructor and the AndroidAppFiles /
chromedevtools helpers to take DadbConnection. The constructor and helpers
become internal because they now reference the internal DadbConnection; the
public connect() entry point that re-introduces Dadb.create lands in Task 2.

Behavior-preserving: DadbConnection performs the same IOException ->
DeviceUnreachableException translation TranslatingDadb did (open() stays
untranslated as the gRPC socket factory). The full AndroidDriverTest suite
stays green.
… maestro-cli off Dadb

Introduces the device-layer construction owner AndroidDriver.connect(host, port, ...)
(Dadb.create -> DadbConnection -> driver, no open()) and the enumeration utility
AndroidDevices.list(host) yielding flat AndroidDeviceDescriptor(id, host, port) plus
AndroidDevices.isReachable(host, port). Refactors MaestroSessionManager and
McpMaestroSessionManager onto these, so maestro-cli no longer names Dadb anywhere.

Collapses to a single TCP path: deletes the vestigial adb-server (port 5038) path —
createAdbServerDadb(), AdbServer.listDadbs(5038) in DeviceService, the dead
AndroidDeviceProvider test helper, and all dadb.adbserver imports. Every dadb.Dadb
reference now lives only inside maestro-client (AndroidDriver.connect, AndroidDevices,
DadbConnection, DeviceService). Adds unit tests for descriptor parsing.
…rify, swallow it in teardown

autoVerifyWithAppName wrapped the APK pull in runCatching, which swallowed
the DeviceUnreachableException now raised by DadbConnection on a dead
transport, silently skipping auto-verify instead of surfacing retryable
infra. Mirror setAllPermissions: rethrow DUE, best-effort-swallow other
read failures.

The two teardown methods (uninstallMaestroDriverApp / uninstallMaestroServerApp)
caught only IOException; DUE is a RuntimeException and now escaped cleanup.
Add a DUE catch around both the package check and the uninstall (including
the IOException retry) so a dead device cannot crash teardown.
- launchAppCommand: catch DeviceUnreachableException before the generic
  Exception catch so a wedged transport propagates as retryable infra
  instead of being laundered into MaestroException.UnableToLaunchApp
  (which the worker misclassifies as a customer TEST_ERROR).
- AndroidMediaStoreTest: build the driver via the public
  AndroidDriver.connect factory (the constructor is now internal in
  another module); keep the raw Dadb for fileExists assertions.
- IntegrationTest: add coverage asserting a DUE thrown from the driver's
  launchApp propagates as DeviceUnreachableException, not UnableToLaunchApp.
@steviec steviec changed the title fix(android-driver): classify dadb transport failures as DeviceUnreachableException fix(android): device transport boundary — surface all dadb failures as DeviceUnreachableException Jun 4, 2026
@steviec steviec changed the title fix(android): device transport boundary — surface all dadb failures as DeviceUnreachableException fix(android): establish a device-transport boundary (fixes transport-failure misclassification) Jun 4, 2026
steviec added 5 commits June 4, 2026 17:56
…a Dadb.list round-trip

Replace the host:port descriptor/parseDescriptor model (which dropped
colon-less serials and broke by-serial selection) with AndroidDevices.resolveDadb:
re-list and match the verbatim dadb.toString() id, reusing that live connection
so reconnect uses the same adb-server transport enumeration used. Rewrite the unit
test to assert the round-trip invariant instead of codifying the dropped-serial bug.
Resolve a device id to a connection via AndroidDevices.resolveDadb (adb-server
round-trip) and wrap it in DadbConnection, mirroring connect(host, port).
…ort reconnect

createAndroid and pickAndroidDevice now resolve a device id through
AndroidDriver.connectToDevice (adb-server round-trip), so serial-identified
devices select and connect again. The explicit --port path still uses
AndroidDriver.connect(host, port). Fixes the 'Unable to find device with id
emulator-5554' regression.
…tToDevice kdoc

Harden the first-device resolution test to verify the picked connection stays
open, and reword connectToDevice doc to avoid an ambiguous [open] link.
…pp regression)

getApkFile pulled a blank apk path when an app was not installed, and the
resulting 'Sync failed: No such file' IOException was translated by
DadbConnection into DeviceUnreachableException. That DUE bypasses Orchestra's
MaestroException-based 'optional' suppression and fails the flow terminally,
breaking 'launchApp optional: true' for a non-existent app (e2e flow
commands_optional_tournee). Detect the not-installed case and throw a plain
IllegalStateException, which is wrapped into UnableToLaunchApp and honored by
optional — without weakening DeviceUnreachableException classification for real
transport failures.

else -> runCatching {
(Dadb.list() + AdbServer.listDadbs(adbServerPort = 5038))
Dadb.list()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads-up here: every consumer currently creates/configures dadb differently. After this PR, we’ll move toward a single shared dadb, which feels like the right design direction.

The main thing to be careful about is configuration. Cloud currently uses custom timeouts / keep-alive values, while other consumers seem to use raw dadb defaults.

I think we should either:

Keep Cloud’s existing config as the default to avoid breaking existing test cases.
Or intentionally move to the shared config, but acknowledge this is a riskier deployment.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed. I think the right move is 1) sane defaults (current defaults internally are NO timeouts, which I think is bad), and 2) allow the contract to accept dadb configuration parameters that are passed through.

} catch (unreachable: DeviceUnreachableException) {
// Teardown runs during cleanup/close; a dead transport (already translated by
// DadbConnection) must not crash teardown. Log and continue. Wraps both the package
// check and the uninstall, including the IOException retry below.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: should we create an abstraction that catches this unreachable exception to avoid this duplication across driver? Something like runDeviceCall?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR actually does exactly that; the DadbConnection translates all dadb-related IOExceptions into DeviceUnreachableExceptions. This catch right here is correct IMO; we are intentionally ignoring a device disconnection because we're in the teardown sequence.

runDadbCall("shell: $command") { dadb.shell(command) }

fun openShell(command: String = ""): AdbShellStream =
runDadbCall("openShell: $command") { dadb.openShell(command) }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ummm I might be missing something was not the requirement to have logs, metrics observability here? To get:

  • dadb pull calls
  • dadb shell calls
  • dadb push calls

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, that's intentional, sorry I should have linked to this section of your proposal: https://app.notion.com/p/mobile-dev/WIP-Root-Causing-Maestro-s-Timeouts-36f4d3bab4ab80498d95caeebfdee068?source=copy_link#047132e3b9da4b2096eaf3aa7f00a662

I explain why I'm not adding that other stuff yet.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(though if you think we need it now, I'm fine with that, too)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants