Skip to content

fix(signaling): call startForeground() first in onCreate() to reduce FGS timeout risk#1100

Merged
SERDUN merged 1 commit into
fix/signaling-fgs-xiaomi-crashfrom
fix/bug-1-startforeground-oncreate
Apr 10, 2026
Merged

fix(signaling): call startForeground() first in onCreate() to reduce FGS timeout risk#1100
SERDUN merged 1 commit into
fix/signaling-fgs-xiaomi-crashfrom
fix/bug-1-startforeground-oncreate

Conversation

@SERDUN

@SERDUN SERDUN commented Apr 10, 2026

Copy link
Copy Markdown
Member

Problem

SignalingForegroundService.onCreate() called startForeground() as the 5th operation, after a blocking SharedPreferences read (StorageDelegate.getCallbackDispatcher()).

Under Xiaomi HyperOS memory pressure, that SharedPrefs IPC blocks the main thread for 50–500ms. Combined with the main thread being busy during Flutter cold start, the total delay from startForegroundService() to startForeground() exceeded Xiaomi's aggressive ~1.2s window (vs the standard 10s), causing ForegroundServiceDidNotStartInTimeException.

From the crash log:

T+0.000s  startForegroundService()
T+1.205s  Xiaomi removes ServiceRecord
T+2.252s  startForeground() → rejected → crash

Fix

Move startForeground() to be the first call after super.onCreate(), before any IPC.

// Before
override fun onCreate() {
    super.onCreate()
    Log.d(TAG, ...)
    instance = this
    val callbackHandle = StorageDelegate.getCallbackDispatcher(applicationContext)  // SharedPrefs IPC
    flutterEngineHelper = FlutterEngineHelper(...)
    startForeground()  // 5th — too late under pressure
    isRunning = true
}

// After
override fun onCreate() {
    super.onCreate()
    startForeground()  // first — no IPC before it

    Log.d(TAG, ...)
    instance = this
    val callbackHandle = StorageDelegate.getCallbackDispatcher(applicationContext)
    flutterEngineHelper = FlutterEngineHelper(...)
    isRunning = true
}

Safety

  • startForeground() has no dependency on flutterEngineHelper or callbackHandle — the engine starts in onStartCommand(), which runs after onCreate() completes
  • instance is only read by WebtritSignalingServicePlugin.notifyIsolateReady(), which is called after onStartCommand() + Dart isolate ready — well after onCreate() completes
  • isRunning placement at the end of onCreate() is unchanged in meaning

Part of

Umbrella: #1099

…FGS timeout risk

Move startForeground() to be the first call after super.onCreate(), before
any SharedPreferences IPC. Under memory pressure (Xiaomi HyperOS), the
SharedPrefs read in getCallbackDispatcher() can block the main thread for
50-500ms, pushing startForeground() past Xiaomi's aggressive ~1.2s window
and causing ForegroundServiceDidNotStartInTimeException.

startForeground() has no dependency on FlutterEngineHelper or callbackHandle —
the engine starts in onStartCommand(), which runs after onCreate() completes.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Moves startForeground() to the first operation in SignalingForegroundService.onCreate() (immediately after super.onCreate()) to minimize the risk of OEM-specific foreground-service start timeouts caused by blocking IPC during service initialization.

Changes:

  • Call startForeground() at the beginning of onCreate() (before the SharedPreferences read for the callback handle).
  • Keep the rest of initialization order intact (logging, instance, FlutterEngineHelper, isRunning).

@SERDUN SERDUN marked this pull request as ready for review April 10, 2026 19:33
@SERDUN SERDUN merged commit 95acf7e into fix/signaling-fgs-xiaomi-crash Apr 10, 2026
4 checks passed
@SERDUN SERDUN deleted the fix/bug-1-startforeground-oncreate branch April 10, 2026 19:34
SERDUN added a commit that referenced this pull request Apr 11, 2026
…n Xiaomi (#1099)

* chore(signaling): umbrella branch for Xiaomi FGS crash fixes

* fix(signaling): call startForeground() first in onCreate() to reduce FGS timeout risk (#1100)

Move startForeground() to be the first call after super.onCreate(), before
any SharedPreferences IPC. Under memory pressure (Xiaomi HyperOS), the
SharedPrefs read in getCallbackDispatcher() can block the main thread for
50-500ms, pushing startForeground() past Xiaomi's aggressive ~1.2s window
and causing ForegroundServiceDidNotStartInTimeException.

startForeground() has no dependency on FlutterEngineHelper or callbackHandle —
the engine starts in onStartCommand(), which runs after onCreate() completes.

* fix(signaling): parallelize Pigeon IPC calls in _startService() to reduce FGS delay (#1101)

* fix(signaling): parallelize Pigeon IPC calls in _startService() to reduce FGS delay

Replace three sequential await calls with Future.wait() so startService()
(which calls startForegroundService()) is dispatched without waiting for
three Binder round-trips first.

All four calls share the same BinaryMessenger, which delivers messages to
the Kotlin main-thread Looper in FIFO order. startService() is always
processed after the credential writes, so synchronizeIsolate() reads
correct data on the first attempt.

Under memory pressure (Xiaomi HyperOS), each sequential Binder round-trip
adds ~100–300 ms. Removing three of them saves ~300–900 ms before
startForegroundService() is called, widening the margin within the
vendor-specific FGS promotion window.

* fix(signaling): parallelize credential writes, keep startService() sequential

Address review concerns:
- Separate credential writes (Future.wait) from startService() (sequential await)
  so startService() is only called after all SharedPreferences writes complete
- Removes the cross-channel FIFO ordering assumption: credentials are
  independent of each other and run concurrently; startService() is
  explicitly sequenced after Future.wait() resolves
- A failure in any credential write now prevents startService() from
  being dispatched, preserving the original error-isolation behavior

Still removes two sequential Binder round-trips (~200–600 ms under
memory pressure) before startForegroundService() is called.

* fix(signaling): persistent mode FGS recovery after OS kill (#1103)

* fix(signaling): persistent mode FGS recovery after OS kill

Part 1 — FCM path: add WebtritSignalingService.restoreService() static
method (Pigeon connect() → Kotlin) called from onPushNotificationSyncCallback
finally block after _disposeContext(). Restores the persistent WebSocket for
future calls when FGS is dead and FCM push is the first trigger.

Part 2 — No-GMS path: add SignalingRestartWorker (WorkManager one-shot)
enqueued from onDestroy (15 s) and onTaskRemoved (1 s) in persistent mode.
Result.retry() handles Android 12+ ForegroundServiceStartNotAllowedException.
SignalingRestartWorker.remove() called first in stopService() to prevent
restart after explicit logout.

Also fixes thread-safety bug: isRunning is now @volatile.

Closes bug-3-no-persistent-mode-recovery.

* fix(signaling): address Copilot review comments on FGS recovery

- cancelUniqueWork instead of cancelAllWorkByTag in SignalingRestartWorker.remove()
- doWork(): retry only on ForegroundServiceStartNotAllowedException, failure() on all others
- doWork(): gate on tenantId/token in addition to coreUrl
- doWork(): Log.e with throwable for full stack trace
- connect() in Plugin: add tenantId/token guards; catch ForegroundServiceStartNotAllowedException
  natively and enqueue WorkManager instead of propagating to Dart
- onDestroy/onTaskRemoved: gate WorkManager enqueue on coreUrl+dispatcher to skip after logout
- background_isolate_callbacks: fix comment (exception handled natively), pass (e, st) to logger
- Add unit tests: plugin_test restoreService channel wiring, signaling_service_test delegation

* fix(signaling): address second round Copilot review comments

- SignalingRestartWorker: update KDoc to reflect actual retry logic
  (retry only on ForegroundServiceStartNotAllowedException, failure() on others)
- SignalingRestartWorker: Log.w for transient ForegroundServiceStartNotAllowedException,
  Log.e only for permanent failures — reduces noise in normal retry scenarios
- SignalingForegroundService.onDestroy: add tenantId/token guards to align with
  connect() and doWork() credential checks
- SignalingForegroundService.onTaskRemoved: same tenantId/token guard alignment

* fix(signaling): remove direct API-31 class reference for minSdk-26 safety

Replace `is ForegroundServiceStartNotAllowedException` checks with
javaClass.name string comparison in WebtritSignalingServicePlugin and
SignalingRestartWorker. Removes direct import of the API-31+ framework
class, eliminating any risk of class-verification errors on pre-31 ART.
Also removes the now-redundant Build.VERSION.SDK_INT guard and imports.

Remove stale inline comment from background_isolate_callbacks.dart.

* fix(signaling): restore idiomatic API-31 exception check via @RequiresApi helper

javaClass.name string comparison is fragile, not idiomatic, and misses
subclasses. Replace with a @RequiresApi(S) helper that holds the `is`
check in API-guarded code, satisfying Lint while keeping type safety.
The outer Build.VERSION.SDK_INT >= S guard means the helper is never
called on pre-31 devices, so no class-loading risk exists.

* fix: push isolate reuses FGS hub WebSocket (1-socket invariant) (#1104)

* fix: push isolate reuses FGS hub WebSocket instead of opening its own

PushNotificationIsolateManager previously always created a SignalingModuleImpl
directly, opening a second WebSocket when the FGS hub was already running.
This violated the 1-WebSocket invariant.

The fix adds createPushIsolateModule() to SignalingServicePlatform and
WebtritSignalingService. On Android, the method checks IsolateNameServer for
a live FGS hub: if found and acknowledged, a SignalingHubModule is returned
and no new connection is opened. When no hub is active (app killed), falls
back to a direct SignalingModuleImpl.

PushNotificationIsolateManager now exposes an init() method for async hub
discovery, called from _getOrInit() before run(). run() skips connect() when
the module reports isConnected (hub reuse path).

* logs: add diagnostic logs for push isolate signaling module resolution

Adds log lines to verify the 1-socket invariant at runtime:
- _initSignaling(): logs module type and isConnected after createPushIsolateModule
  resolves, making hub-reuse vs direct-socket path visible in logcat
- close(): logs module type and pending request count on teardown
- _getOrInit(): logs before and after init() so init timing is traceable

These logs confirm whether the hub path or fallback path is taken on each push
notification, and that teardown completes cleanly.

* fix: push isolate starts FGS and waits for hub instead of direct socket

Previously createPushIsolateModule fell back immediately to a direct
SignalingModuleImpl when no hub was running (pushBound mode, first push
after app close). This violated the 1-WebSocket invariant: both the push
isolate and the Activity would open their own WebSocket connections.

New flow:
1. Hub already running -> reuse (persistent mode or second push)
2. No hub -> startFgsOnly(pushBound) -> poll IsolateNameServer until hub
   registers and acknowledges -> return SignalingHubModule
3. Hub not available after 10 s -> emergency fallback to direct module

The push isolate now always shares the same FGS WebSocket with the
Activity, matching the intended architecture.

* fix: address Copilot review comments on push isolate hub reuse

- _tryConnectHub: yield to event loop after ack so hub replay events
  (SignalingConnected etc.) are processed before returning the module.
  Prevents run() from calling connect() on an already-live hub session
  due to isConnected being transiently false while replay is in flight.

- run(): throw StateError when called before init() instead of silently
  leaving _completer unresolved and hanging the caller until timeout.

- lefthook.yml: pass {1} to commit-msg-check.sh so the hook reads the
  actual commit message file (git hook argv) instead of git log -1.

* refactor: unify push isolate signaling with Activity via WebtritSignalingService

Push isolate now uses WebtritSignalingService(mode: pushBound) directly,
the same mechanism the Activity uses. HubConnectionManager inside the
service handles FGS start, hub discovery, and auto-reconnect if the hub
is killed between push arrival and Activity open.

Removes createPushIsolateModule from platform API — it was a one-shot
workaround that bypassed HubConnectionManager and had no reconnect logic.
Removes _tryConnectHub, _startFgsOnly, and related constants from the
Android plugin. The single code path eliminates the duplication.

Lifecycle boundary remains unchanged:
  push mode  — FGS lives from push arrival through Activity close
  persistent — FGS lives indefinitely

* fix: address remaining Copilot review comments

- Remove unnecessary async from _initSignaling() — nothing is awaited
- Rename log field hubConnected -> isConnected (accurate with WebtritSignalingService)
- Always call connect() in run() — WebtritSignalingService.connect() is
  idempotent via _startPending/_isConnected guard; the old conditional
  was left over from SignalingHubModule direct usage
- Fix class doc: qualify 1-WebSocket claim as Android-only
- Fix _initSignaling doc: remove auto-reconnect claim, HubConnectionManager
  handles FGS start and hub discovery, not reconnect after disconnect
- commit-msg-check.sh: add guard for missing or unreadable $1 argument

* docs: fix stale comments after signaling unification

- init() doc: hub discovery and FGS start happen in connect() via
  HubConnectionManager, not in init() as the old comment stated
- background_isolate_callbacks: same correction for _getOrInit comment
- commit-msg-check.sh: add set -euo pipefail for consistency with other scripts

* refactor: replace nullable signalingModule with late field and remove unnecessary async from init()

- _signalingModule/_signalingSubscription: SignalingModule? → late SignalingModule;
  eliminates null-aware ?. access throughout the class since run() requires init() first
- Add _initialized bool flag so close() can guard disposal without accessing late fields prematurely
- init(): Future<void> async => _initSignaling() → void init() (sync, no await needed)
- run(): null check replaced with !_initialized guard; remove intermediate `module` local
- All ?. accesses on _signalingModule replaced with direct . access
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants