Skip to content

fix(signaling): resolve ForegroundServiceDidNotStartInTimeException on Xiaomi#1099

Merged
SERDUN merged 5 commits into
developfrom
fix/signaling-fgs-xiaomi-crash
Apr 11, 2026
Merged

fix(signaling): resolve ForegroundServiceDidNotStartInTimeException on Xiaomi#1099
SERDUN merged 5 commits into
developfrom
fix/signaling-fgs-xiaomi-crash

Conversation

@SERDUN

@SERDUN SERDUN commented Apr 10, 2026

Copy link
Copy Markdown
Member

Summary

Fixes ForegroundServiceDidNotStartInTimeException crash in SignalingForegroundService on Xiaomi HyperOS (Android 15) under memory pressure.

Xiaomi removes the ServiceRecord after ~1.2s instead of the standard 10s. Three independent issues combine to push startForeground() past that window.

Bugs fixed

# File Issue
Bug 1 SignalingForegroundService.kt startForeground() called after SharedPrefs IPC in onCreate()
Bug 2 plugin.dart startForegroundService() called after 3 sequential Pigeon await
Bug 3 Architecture No service recovery in persistent mode after OS kill

Each bug is delivered as a separate PR targeting this branch.

Test plan

  • Force-stop app on Xiaomi/MIUI device under memory pressure, trigger incoming call via FCM
  • Confirm SignalingForegroundService starts without ForegroundServiceDidNotStartInTimeException
  • Verify signaling connects and call is received correctly
  • Confirm no regression on Pixel / emulator (standard 10s timeout)

…FGS timeout risk (#1100)

Move startForeground() to be the first call after super.onCreate(), before
any SharedPreferences IPC. Under memory pressure (Xiaomi HyperOS), the
SharedPrefs read in getCallbackDispatcher() can block the main thread for
50-500ms, pushing startForeground() past Xiaomi's aggressive ~1.2s window
and causing ForegroundServiceDidNotStartInTimeException.

startForeground() has no dependency on FlutterEngineHelper or callbackHandle —
the engine starts in onStartCommand(), which runs after onCreate() completes.
SERDUN added 2 commits April 10, 2026 23:06
…duce FGS delay (#1101)

* fix(signaling): parallelize Pigeon IPC calls in _startService() to reduce FGS delay

Replace three sequential await calls with Future.wait() so startService()
(which calls startForegroundService()) is dispatched without waiting for
three Binder round-trips first.

All four calls share the same BinaryMessenger, which delivers messages to
the Kotlin main-thread Looper in FIFO order. startService() is always
processed after the credential writes, so synchronizeIsolate() reads
correct data on the first attempt.

Under memory pressure (Xiaomi HyperOS), each sequential Binder round-trip
adds ~100–300 ms. Removing three of them saves ~300–900 ms before
startForegroundService() is called, widening the margin within the
vendor-specific FGS promotion window.

* fix(signaling): parallelize credential writes, keep startService() sequential

Address review concerns:
- Separate credential writes (Future.wait) from startService() (sequential await)
  so startService() is only called after all SharedPreferences writes complete
- Removes the cross-channel FIFO ordering assumption: credentials are
  independent of each other and run concurrently; startService() is
  explicitly sequenced after Future.wait() resolves
- A failure in any credential write now prevents startService() from
  being dispatched, preserving the original error-isolation behavior

Still removes two sequential Binder round-trips (~200–600 ms under
memory pressure) before startForegroundService() is called.
* fix(signaling): persistent mode FGS recovery after OS kill

Part 1 — FCM path: add WebtritSignalingService.restoreService() static
method (Pigeon connect() → Kotlin) called from onPushNotificationSyncCallback
finally block after _disposeContext(). Restores the persistent WebSocket for
future calls when FGS is dead and FCM push is the first trigger.

Part 2 — No-GMS path: add SignalingRestartWorker (WorkManager one-shot)
enqueued from onDestroy (15 s) and onTaskRemoved (1 s) in persistent mode.
Result.retry() handles Android 12+ ForegroundServiceStartNotAllowedException.
SignalingRestartWorker.remove() called first in stopService() to prevent
restart after explicit logout.

Also fixes thread-safety bug: isRunning is now @volatile.

Closes bug-3-no-persistent-mode-recovery.

* fix(signaling): address Copilot review comments on FGS recovery

- cancelUniqueWork instead of cancelAllWorkByTag in SignalingRestartWorker.remove()
- doWork(): retry only on ForegroundServiceStartNotAllowedException, failure() on all others
- doWork(): gate on tenantId/token in addition to coreUrl
- doWork(): Log.e with throwable for full stack trace
- connect() in Plugin: add tenantId/token guards; catch ForegroundServiceStartNotAllowedException
  natively and enqueue WorkManager instead of propagating to Dart
- onDestroy/onTaskRemoved: gate WorkManager enqueue on coreUrl+dispatcher to skip after logout
- background_isolate_callbacks: fix comment (exception handled natively), pass (e, st) to logger
- Add unit tests: plugin_test restoreService channel wiring, signaling_service_test delegation

* fix(signaling): address second round Copilot review comments

- SignalingRestartWorker: update KDoc to reflect actual retry logic
  (retry only on ForegroundServiceStartNotAllowedException, failure() on others)
- SignalingRestartWorker: Log.w for transient ForegroundServiceStartNotAllowedException,
  Log.e only for permanent failures — reduces noise in normal retry scenarios
- SignalingForegroundService.onDestroy: add tenantId/token guards to align with
  connect() and doWork() credential checks
- SignalingForegroundService.onTaskRemoved: same tenantId/token guard alignment

* fix(signaling): remove direct API-31 class reference for minSdk-26 safety

Replace `is ForegroundServiceStartNotAllowedException` checks with
javaClass.name string comparison in WebtritSignalingServicePlugin and
SignalingRestartWorker. Removes direct import of the API-31+ framework
class, eliminating any risk of class-verification errors on pre-31 ART.
Also removes the now-redundant Build.VERSION.SDK_INT guard and imports.

Remove stale inline comment from background_isolate_callbacks.dart.

* fix(signaling): restore idiomatic API-31 exception check via @RequiresApi helper

javaClass.name string comparison is fragile, not idiomatic, and misses
subclasses. Replace with a @RequiresApi(S) helper that holds the `is`
check in API-guarded code, satisfying Lint while keeping type safety.
The outer Build.VERSION.SDK_INT >= S guard means the helper is never
called on pre-31 devices, so no class-loading risk exists.

This comment was marked as resolved.

* fix: push isolate reuses FGS hub WebSocket instead of opening its own

PushNotificationIsolateManager previously always created a SignalingModuleImpl
directly, opening a second WebSocket when the FGS hub was already running.
This violated the 1-WebSocket invariant.

The fix adds createPushIsolateModule() to SignalingServicePlatform and
WebtritSignalingService. On Android, the method checks IsolateNameServer for
a live FGS hub: if found and acknowledged, a SignalingHubModule is returned
and no new connection is opened. When no hub is active (app killed), falls
back to a direct SignalingModuleImpl.

PushNotificationIsolateManager now exposes an init() method for async hub
discovery, called from _getOrInit() before run(). run() skips connect() when
the module reports isConnected (hub reuse path).

* logs: add diagnostic logs for push isolate signaling module resolution

Adds log lines to verify the 1-socket invariant at runtime:
- _initSignaling(): logs module type and isConnected after createPushIsolateModule
  resolves, making hub-reuse vs direct-socket path visible in logcat
- close(): logs module type and pending request count on teardown
- _getOrInit(): logs before and after init() so init timing is traceable

These logs confirm whether the hub path or fallback path is taken on each push
notification, and that teardown completes cleanly.

* fix: push isolate starts FGS and waits for hub instead of direct socket

Previously createPushIsolateModule fell back immediately to a direct
SignalingModuleImpl when no hub was running (pushBound mode, first push
after app close). This violated the 1-WebSocket invariant: both the push
isolate and the Activity would open their own WebSocket connections.

New flow:
1. Hub already running -> reuse (persistent mode or second push)
2. No hub -> startFgsOnly(pushBound) -> poll IsolateNameServer until hub
   registers and acknowledges -> return SignalingHubModule
3. Hub not available after 10 s -> emergency fallback to direct module

The push isolate now always shares the same FGS WebSocket with the
Activity, matching the intended architecture.

* fix: address Copilot review comments on push isolate hub reuse

- _tryConnectHub: yield to event loop after ack so hub replay events
  (SignalingConnected etc.) are processed before returning the module.
  Prevents run() from calling connect() on an already-live hub session
  due to isConnected being transiently false while replay is in flight.

- run(): throw StateError when called before init() instead of silently
  leaving _completer unresolved and hanging the caller until timeout.

- lefthook.yml: pass {1} to commit-msg-check.sh so the hook reads the
  actual commit message file (git hook argv) instead of git log -1.

* refactor: unify push isolate signaling with Activity via WebtritSignalingService

Push isolate now uses WebtritSignalingService(mode: pushBound) directly,
the same mechanism the Activity uses. HubConnectionManager inside the
service handles FGS start, hub discovery, and auto-reconnect if the hub
is killed between push arrival and Activity open.

Removes createPushIsolateModule from platform API — it was a one-shot
workaround that bypassed HubConnectionManager and had no reconnect logic.
Removes _tryConnectHub, _startFgsOnly, and related constants from the
Android plugin. The single code path eliminates the duplication.

Lifecycle boundary remains unchanged:
  push mode  — FGS lives from push arrival through Activity close
  persistent — FGS lives indefinitely

* fix: address remaining Copilot review comments

- Remove unnecessary async from _initSignaling() — nothing is awaited
- Rename log field hubConnected -> isConnected (accurate with WebtritSignalingService)
- Always call connect() in run() — WebtritSignalingService.connect() is
  idempotent via _startPending/_isConnected guard; the old conditional
  was left over from SignalingHubModule direct usage
- Fix class doc: qualify 1-WebSocket claim as Android-only
- Fix _initSignaling doc: remove auto-reconnect claim, HubConnectionManager
  handles FGS start and hub discovery, not reconnect after disconnect
- commit-msg-check.sh: add guard for missing or unreadable $1 argument

* docs: fix stale comments after signaling unification

- init() doc: hub discovery and FGS start happen in connect() via
  HubConnectionManager, not in init() as the old comment stated
- background_isolate_callbacks: same correction for _getOrInit comment
- commit-msg-check.sh: add set -euo pipefail for consistency with other scripts

* refactor: replace nullable signalingModule with late field and remove unnecessary async from init()

- _signalingModule/_signalingSubscription: SignalingModule? → late SignalingModule;
  eliminates null-aware ?. access throughout the class since run() requires init() first
- Add _initialized bool flag so close() can guard disposal without accessing late fields prematurely
- init(): Future<void> async => _initSignaling() → void init() (sync, no await needed)
- run(): null check replaced with !_initialized guard; remove intermediate `module` local
- All ?. accesses on _signalingModule replaced with direct . access

This comment was marked as resolved.

This comment was marked as resolved.

@SERDUN SERDUN marked this pull request as ready for review April 11, 2026 16:13
@SERDUN SERDUN merged commit 1ed2931 into develop Apr 11, 2026
9 checks passed
@SERDUN SERDUN deleted the fix/signaling-fgs-xiaomi-crash branch April 11, 2026 16:13
@SERDUN SERDUN restored the fix/signaling-fgs-xiaomi-crash branch April 16, 2026 19:29
@SERDUN SERDUN deleted the fix/signaling-fgs-xiaomi-crash branch May 22, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants