Skip to content

fix(filesystem): resolve provider activation on registration#17625

Draft
cdamus wants to merge 2 commits into
masterfrom
issue/17506-activate-provider-resilience
Draft

fix(filesystem): resolve provider activation on registration#17625
cdamus wants to merge 2 commits into
masterfrom
issue/17506-activate-provider-resilience

Conversation

@cdamus

@cdamus cdamus commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

What it does

Hardens FileService.activateProvider against a dangling onWillActivateFileSystemProvider listener, which can wedge frontend startup indefinitely.

Fixes #17506

activateProvider tied a scheme's activation to the settlement of WaitUntilEvent.fire(...), i.e. Promise.all of every listener's waitUntil promise. If any one of those promises never settles (the user-storage emitter carries several listeners), the activation, and everything awaiting it, hangs forever. In #17506 this stalls all four User-scope preference providers → UserConfigsPreferenceProvider.readyPreferenceServiceImpl.initializeProvidersFrontendApplication.start, so the workbench never comes up. As the reporter found, even registering the provider by hand against the live container did not help: the activations deferred stayed pending.

This PR is split into two independently reviewable commits:

  1. resolve provider activation on registration (core fix, low risk): resolve the pending activation as soon as a provider is registered for the scheme. This decouples activation completion from unrelated listeners and restores registration as a recovery path.
  2. time out provider activation as a backstop (deferrable): if no provider is registered within a timeout, reject the activation so callers fail fast (e.g. readPreferencesFromFile treats the file as absent and startup proceeds degraded) instead of hanging, and remove the scheme from activations so a later attempt can retry once the connection recovers. The default timeout (DEFAULT_PROVIDER_ACTIVATION_TIMEOUT, 90s, overridable via the protected getActivationTimeout()) is not less than the websocket heartbeat detection window (checkAliveTimeout 30s + pingTimeout 60s), so a genuinely dropped connection is detected and rejects in-flight RPCs before this
    fires. Activation only ever awaits constant-time backend calls (capability handshake, config-directory lookup), so the timeout cannot abort legitimate long-running work.

Caution

The second commit is a containment/blast-radius measure; the underlying dangling-promise cause (a lost RPC reply on a connection that still appears alive) belongs to the family addressed by #17334 and is left as a follow-up. Reviewers may choose to take only the first commit.

How to test

The root-cause hang is timing-dependent and not reliably reproducible in the example apps, so verification is via the new automated tests, which fail without the fix (TDD).

To gut-check the first test against the bug, revert commit 1 and confirm it goes from passing to a 'PENDING' assertion failure.

Follow-ups

  • the true root cause is a waitUntil/RPC promise that never settles on a still-open connection (lost reply, detached continuation), the same class fix(core): guarantee promise rejection on failure to send RPC call #17334 began addressing. Worth a dedicated issue to audit remaining RPC loss points (reply decode failure, a reply routed to an already-closed multiplexer sub-channel) and reject the pending request at the point of loss.
  • a general per-request RPC timeout was considered and rejected: it can be made efficient (a single self-cancelling sweep per active protocol, no per-request timers) but not correct, since a pure age-based timeout cannot distinguish a lost reply from a legitimately long-running backend call and would break such APIs. The scoped activation timeout here is safe precisely because its dependencies are constant-time.

Breaking changes

  • This PR introduces breaking changes and requires careful review. If yes, the breaking changes section in the changelog has been updated.

Attribution

Review checklist

Reminder for reviewers

cdamus added 2 commits June 4, 2026 16:03
FileService::activateProvider coupled the resolution of a scheme's
activation to the settlement of every onWillActivateFileSystemProvider
listener (via WaitUntilEvent.fire / Promise.all(waitables)). A single
listener whose waitUntil promise never settles therefore wedged the
activation, and everything awaiting it, forever. With the user-storage
emitter carrying several listeners, an unrelated dangling listener could
hang preference loading and the whole frontend startup. Registering the
provider directly did not help: the activations deferred stayed pending.

Resolve the pending activation as soon as a provider is registered for
the scheme, decoupling activation completion from unrelated listeners
and restoring registration as a recovery path.

Fixes #17506

Signed-off-by: Christian W. Damus <cdamus@eclipsesource.com>
Even with activation resolving on provider registration, an activation
whose provider never registers (its own onWillActivateFileSystemProvider
listener dangles, e.g. a lost RPC reply on a connection that still
seems to be alive) would hang forever, blocking preference loading and
frontend startup with no recovery.

Add a timeout backstop to FileService.activateProvider: if no provider
is registered for the scheme within the timeout, reject the activation
so that callers fail fast (e.g. readPreferencesFromFile treats the
file as absent and startup proceeds degraded) rather than hanging. On
rejection the scheme is removed from activations so a later attempt
can retry once the connection recovers.

The default timeout (DEFAULT_PROVIDER_ACTIVATION_TIMEOUT, 90s,
overridable via getActivationTimeout) is not less than the websocket
heartbeat detection window (checkAliveTimeout 30s + pingTimeout 60s), so
a real disconnect is detected and rejects in-flight RPCs before this
fires. Activation only awaits constant-time backend calls, so the
timeout cannot abort legitimate long-running work.

For #17506

Signed-off-by: Christian W. Damus <cdamus@eclipsesource.com>
* The default time, in milliseconds, after which {@link FileService.activateProvider} gives up waiting
* for a provider to be registered for a scheme and rejects the activation instead of hanging forever.
*/
export const DEFAULT_PROVIDER_ACTIVATION_TIMEOUT = 90_000;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like longer than any activation should plausibly take. Perhaps, given the suggested cause, it should be tied to how long the application waits before timing out a connection or declaring disconnection?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what this effectively does: 90s is the expected maximum time it will take for Theia to abandon a stuck RCP socket. I didn't want to add API for this service to get the internal timeout parameter from the RPC service but maybe that would be better.

I'm still hoping that this commit isn't needed at all by the OP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Waiting on reviewers

Development

Successfully merging this pull request may close these issues.

FileService activateProvider('user-storage') hangs during frontend startup in 1.71

2 participants