Skip to content

fix(signaling): parallelize Pigeon IPC calls in _startService() to reduce FGS delay#1101

Merged
SERDUN merged 2 commits into
fix/signaling-fgs-xiaomi-crashfrom
fix/bug-2-parallel-pigeon-start
Apr 10, 2026
Merged

fix(signaling): parallelize Pigeon IPC calls in _startService() to reduce FGS delay#1101
SERDUN merged 2 commits into
fix/signaling-fgs-xiaomi-crashfrom
fix/bug-2-parallel-pigeon-start

Conversation

@SERDUN

@SERDUN SERDUN commented Apr 10, 2026

Copy link
Copy Markdown
Member

Problem

_startService() in plugin.dart called startForegroundService() (via _hostApi.startService()) only after completing three sequential Pigeon await calls. Each await is a full Binder round-trip: Dart → BinaryMessenger → Kotlin main thread → Binder reply → Dart resumes.

Under Xiaomi HyperOS memory pressure, each round-trip adds ~100–300 ms. Three sequential calls add ~300–900 ms before startForegroundService() is even called — consuming a significant portion of Xiaomi's aggressive ~1.2s FGS promotion window.

// Before — each await blocks until Kotlin replies
await _hostApi.initializeServiceCallback(...);  // +100–300 ms
await _hostApi.saveConnectionConfig(...);        // +100–300 ms
await _hostApi.saveTrustedCertificates(...);     // +100–300 ms
await _hostApi.startService(...);                // ← startForegroundService() here

Fix

The three credential writes are independent — they write to separate SharedPreferences keys with no ordering dependency between them. They are parallelized via Future.wait(). startService() remains a separate sequential await after Future.wait() resolves, preserving the guarantee that credentials are persisted before the service starts.

// After — credential writes run concurrently
await Future.wait([
    _hostApi.initializeServiceCallback(...),
    _hostApi.saveConnectionConfig(...),
    _hostApi.saveTrustedCertificates(...),
]);
// startService() only dispatched after all credentials are written
await _hostApi.startService(...);

Result

  • Removes two sequential Binder round-trips (~200–600 ms under memory pressure) before startForegroundService() is called
  • startService() is explicitly sequenced after Future.wait() — no cross-channel ordering assumption
  • A failure in any credential write prevents startService() from being dispatched, preserving the original error-isolation behavior

Part of

Umbrella: #1099

…duce FGS delay

Replace three sequential await calls with Future.wait() so startService()
(which calls startForegroundService()) is dispatched without waiting for
three Binder round-trips first.

All four calls share the same BinaryMessenger, which delivers messages to
the Kotlin main-thread Looper in FIFO order. startService() is always
processed after the credential writes, so synchronizeIsolate() reads
correct data on the first attempt.

Under memory pressure (Xiaomi HyperOS), each sequential Binder round-trip
adds ~100–300 ms. Removing three of them saves ~300–900 ms before
startForegroundService() is called, widening the margin within the
vendor-specific FGS promotion window.

This comment was marked as resolved.

…quential

Address review concerns:
- Separate credential writes (Future.wait) from startService() (sequential await)
  so startService() is only called after all SharedPreferences writes complete
- Removes the cross-channel FIFO ordering assumption: credentials are
  independent of each other and run concurrently; startService() is
  explicitly sequenced after Future.wait() resolves
- A failure in any credential write now prevents startService() from
  being dispatched, preserving the original error-isolation behavior

Still removes two sequential Binder round-trips (~200–600 ms under
memory pressure) before startForegroundService() is called.

This comment was marked as resolved.

@SERDUN SERDUN marked this pull request as ready for review April 10, 2026 20:06
@SERDUN SERDUN merged commit 8f57a38 into fix/signaling-fgs-xiaomi-crash Apr 10, 2026
4 checks passed
@SERDUN SERDUN deleted the fix/bug-2-parallel-pigeon-start branch April 10, 2026 20:06
SERDUN added a commit that referenced this pull request Apr 11, 2026
…n Xiaomi (#1099)

* chore(signaling): umbrella branch for Xiaomi FGS crash fixes

* fix(signaling): call startForeground() first in onCreate() to reduce FGS timeout risk (#1100)

Move startForeground() to be the first call after super.onCreate(), before
any SharedPreferences IPC. Under memory pressure (Xiaomi HyperOS), the
SharedPrefs read in getCallbackDispatcher() can block the main thread for
50-500ms, pushing startForeground() past Xiaomi's aggressive ~1.2s window
and causing ForegroundServiceDidNotStartInTimeException.

startForeground() has no dependency on FlutterEngineHelper or callbackHandle —
the engine starts in onStartCommand(), which runs after onCreate() completes.

* fix(signaling): parallelize Pigeon IPC calls in _startService() to reduce FGS delay (#1101)

* fix(signaling): parallelize Pigeon IPC calls in _startService() to reduce FGS delay

Replace three sequential await calls with Future.wait() so startService()
(which calls startForegroundService()) is dispatched without waiting for
three Binder round-trips first.

All four calls share the same BinaryMessenger, which delivers messages to
the Kotlin main-thread Looper in FIFO order. startService() is always
processed after the credential writes, so synchronizeIsolate() reads
correct data on the first attempt.

Under memory pressure (Xiaomi HyperOS), each sequential Binder round-trip
adds ~100–300 ms. Removing three of them saves ~300–900 ms before
startForegroundService() is called, widening the margin within the
vendor-specific FGS promotion window.

* fix(signaling): parallelize credential writes, keep startService() sequential

Address review concerns:
- Separate credential writes (Future.wait) from startService() (sequential await)
  so startService() is only called after all SharedPreferences writes complete
- Removes the cross-channel FIFO ordering assumption: credentials are
  independent of each other and run concurrently; startService() is
  explicitly sequenced after Future.wait() resolves
- A failure in any credential write now prevents startService() from
  being dispatched, preserving the original error-isolation behavior

Still removes two sequential Binder round-trips (~200–600 ms under
memory pressure) before startForegroundService() is called.

* fix(signaling): persistent mode FGS recovery after OS kill (#1103)

* fix(signaling): persistent mode FGS recovery after OS kill

Part 1 — FCM path: add WebtritSignalingService.restoreService() static
method (Pigeon connect() → Kotlin) called from onPushNotificationSyncCallback
finally block after _disposeContext(). Restores the persistent WebSocket for
future calls when FGS is dead and FCM push is the first trigger.

Part 2 — No-GMS path: add SignalingRestartWorker (WorkManager one-shot)
enqueued from onDestroy (15 s) and onTaskRemoved (1 s) in persistent mode.
Result.retry() handles Android 12+ ForegroundServiceStartNotAllowedException.
SignalingRestartWorker.remove() called first in stopService() to prevent
restart after explicit logout.

Also fixes thread-safety bug: isRunning is now @volatile.

Closes bug-3-no-persistent-mode-recovery.

* fix(signaling): address Copilot review comments on FGS recovery

- cancelUniqueWork instead of cancelAllWorkByTag in SignalingRestartWorker.remove()
- doWork(): retry only on ForegroundServiceStartNotAllowedException, failure() on all others
- doWork(): gate on tenantId/token in addition to coreUrl
- doWork(): Log.e with throwable for full stack trace
- connect() in Plugin: add tenantId/token guards; catch ForegroundServiceStartNotAllowedException
  natively and enqueue WorkManager instead of propagating to Dart
- onDestroy/onTaskRemoved: gate WorkManager enqueue on coreUrl+dispatcher to skip after logout
- background_isolate_callbacks: fix comment (exception handled natively), pass (e, st) to logger
- Add unit tests: plugin_test restoreService channel wiring, signaling_service_test delegation

* fix(signaling): address second round Copilot review comments

- SignalingRestartWorker: update KDoc to reflect actual retry logic
  (retry only on ForegroundServiceStartNotAllowedException, failure() on others)
- SignalingRestartWorker: Log.w for transient ForegroundServiceStartNotAllowedException,
  Log.e only for permanent failures — reduces noise in normal retry scenarios
- SignalingForegroundService.onDestroy: add tenantId/token guards to align with
  connect() and doWork() credential checks
- SignalingForegroundService.onTaskRemoved: same tenantId/token guard alignment

* fix(signaling): remove direct API-31 class reference for minSdk-26 safety

Replace `is ForegroundServiceStartNotAllowedException` checks with
javaClass.name string comparison in WebtritSignalingServicePlugin and
SignalingRestartWorker. Removes direct import of the API-31+ framework
class, eliminating any risk of class-verification errors on pre-31 ART.
Also removes the now-redundant Build.VERSION.SDK_INT guard and imports.

Remove stale inline comment from background_isolate_callbacks.dart.

* fix(signaling): restore idiomatic API-31 exception check via @RequiresApi helper

javaClass.name string comparison is fragile, not idiomatic, and misses
subclasses. Replace with a @RequiresApi(S) helper that holds the `is`
check in API-guarded code, satisfying Lint while keeping type safety.
The outer Build.VERSION.SDK_INT >= S guard means the helper is never
called on pre-31 devices, so no class-loading risk exists.

* fix: push isolate reuses FGS hub WebSocket (1-socket invariant) (#1104)

* fix: push isolate reuses FGS hub WebSocket instead of opening its own

PushNotificationIsolateManager previously always created a SignalingModuleImpl
directly, opening a second WebSocket when the FGS hub was already running.
This violated the 1-WebSocket invariant.

The fix adds createPushIsolateModule() to SignalingServicePlatform and
WebtritSignalingService. On Android, the method checks IsolateNameServer for
a live FGS hub: if found and acknowledged, a SignalingHubModule is returned
and no new connection is opened. When no hub is active (app killed), falls
back to a direct SignalingModuleImpl.

PushNotificationIsolateManager now exposes an init() method for async hub
discovery, called from _getOrInit() before run(). run() skips connect() when
the module reports isConnected (hub reuse path).

* logs: add diagnostic logs for push isolate signaling module resolution

Adds log lines to verify the 1-socket invariant at runtime:
- _initSignaling(): logs module type and isConnected after createPushIsolateModule
  resolves, making hub-reuse vs direct-socket path visible in logcat
- close(): logs module type and pending request count on teardown
- _getOrInit(): logs before and after init() so init timing is traceable

These logs confirm whether the hub path or fallback path is taken on each push
notification, and that teardown completes cleanly.

* fix: push isolate starts FGS and waits for hub instead of direct socket

Previously createPushIsolateModule fell back immediately to a direct
SignalingModuleImpl when no hub was running (pushBound mode, first push
after app close). This violated the 1-WebSocket invariant: both the push
isolate and the Activity would open their own WebSocket connections.

New flow:
1. Hub already running -> reuse (persistent mode or second push)
2. No hub -> startFgsOnly(pushBound) -> poll IsolateNameServer until hub
   registers and acknowledges -> return SignalingHubModule
3. Hub not available after 10 s -> emergency fallback to direct module

The push isolate now always shares the same FGS WebSocket with the
Activity, matching the intended architecture.

* fix: address Copilot review comments on push isolate hub reuse

- _tryConnectHub: yield to event loop after ack so hub replay events
  (SignalingConnected etc.) are processed before returning the module.
  Prevents run() from calling connect() on an already-live hub session
  due to isConnected being transiently false while replay is in flight.

- run(): throw StateError when called before init() instead of silently
  leaving _completer unresolved and hanging the caller until timeout.

- lefthook.yml: pass {1} to commit-msg-check.sh so the hook reads the
  actual commit message file (git hook argv) instead of git log -1.

* refactor: unify push isolate signaling with Activity via WebtritSignalingService

Push isolate now uses WebtritSignalingService(mode: pushBound) directly,
the same mechanism the Activity uses. HubConnectionManager inside the
service handles FGS start, hub discovery, and auto-reconnect if the hub
is killed between push arrival and Activity open.

Removes createPushIsolateModule from platform API — it was a one-shot
workaround that bypassed HubConnectionManager and had no reconnect logic.
Removes _tryConnectHub, _startFgsOnly, and related constants from the
Android plugin. The single code path eliminates the duplication.

Lifecycle boundary remains unchanged:
  push mode  — FGS lives from push arrival through Activity close
  persistent — FGS lives indefinitely

* fix: address remaining Copilot review comments

- Remove unnecessary async from _initSignaling() — nothing is awaited
- Rename log field hubConnected -> isConnected (accurate with WebtritSignalingService)
- Always call connect() in run() — WebtritSignalingService.connect() is
  idempotent via _startPending/_isConnected guard; the old conditional
  was left over from SignalingHubModule direct usage
- Fix class doc: qualify 1-WebSocket claim as Android-only
- Fix _initSignaling doc: remove auto-reconnect claim, HubConnectionManager
  handles FGS start and hub discovery, not reconnect after disconnect
- commit-msg-check.sh: add guard for missing or unreadable $1 argument

* docs: fix stale comments after signaling unification

- init() doc: hub discovery and FGS start happen in connect() via
  HubConnectionManager, not in init() as the old comment stated
- background_isolate_callbacks: same correction for _getOrInit comment
- commit-msg-check.sh: add set -euo pipefail for consistency with other scripts

* refactor: replace nullable signalingModule with late field and remove unnecessary async from init()

- _signalingModule/_signalingSubscription: SignalingModule? → late SignalingModule;
  eliminates null-aware ?. access throughout the class since run() requires init() first
- Add _initialized bool flag so close() can guard disposal without accessing late fields prematurely
- init(): Future<void> async => _initSignaling() → void init() (sync, no await needed)
- run(): null check replaced with !_initialized guard; remove intermediate `module` local
- All ?. accesses on _signalingModule replaced with direct . access
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants