fix(filesystem): resolve provider activation on registration#17625
Draft
cdamus wants to merge 2 commits into
Draft
fix(filesystem): resolve provider activation on registration#17625cdamus wants to merge 2 commits into
cdamus wants to merge 2 commits into
Conversation
FileService::activateProvider coupled the resolution of a scheme's activation to the settlement of every onWillActivateFileSystemProvider listener (via WaitUntilEvent.fire / Promise.all(waitables)). A single listener whose waitUntil promise never settles therefore wedged the activation, and everything awaiting it, forever. With the user-storage emitter carrying several listeners, an unrelated dangling listener could hang preference loading and the whole frontend startup. Registering the provider directly did not help: the activations deferred stayed pending. Resolve the pending activation as soon as a provider is registered for the scheme, decoupling activation completion from unrelated listeners and restoring registration as a recovery path. Fixes #17506 Signed-off-by: Christian W. Damus <cdamus@eclipsesource.com>
Even with activation resolving on provider registration, an activation whose provider never registers (its own onWillActivateFileSystemProvider listener dangles, e.g. a lost RPC reply on a connection that still seems to be alive) would hang forever, blocking preference loading and frontend startup with no recovery. Add a timeout backstop to FileService.activateProvider: if no provider is registered for the scheme within the timeout, reject the activation so that callers fail fast (e.g. readPreferencesFromFile treats the file as absent and startup proceeds degraded) rather than hanging. On rejection the scheme is removed from activations so a later attempt can retry once the connection recovers. The default timeout (DEFAULT_PROVIDER_ACTIVATION_TIMEOUT, 90s, overridable via getActivationTimeout) is not less than the websocket heartbeat detection window (checkAliveTimeout 30s + pingTimeout 60s), so a real disconnect is detected and rejects in-flight RPCs before this fires. Activation only awaits constant-time backend calls, so the timeout cannot abort legitimate long-running work. For #17506 Signed-off-by: Christian W. Damus <cdamus@eclipsesource.com>
| * The default time, in milliseconds, after which {@link FileService.activateProvider} gives up waiting | ||
| * for a provider to be registered for a scheme and rejects the activation instead of hanging forever. | ||
| */ | ||
| export const DEFAULT_PROVIDER_ACTIVATION_TIMEOUT = 90_000; |
Contributor
There was a problem hiding this comment.
This seems like longer than any activation should plausibly take. Perhaps, given the suggested cause, it should be tied to how long the application waits before timing out a connection or declaring disconnection?
Contributor
Author
There was a problem hiding this comment.
That's what this effectively does: 90s is the expected maximum time it will take for Theia to abandon a stuck RCP socket. I didn't want to add API for this service to get the internal timeout parameter from the RPC service but maybe that would be better.
I'm still hoping that this commit isn't needed at all by the OP.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What it does
Hardens
FileService.activateProvideragainst a danglingonWillActivateFileSystemProviderlistener, which can wedge frontend startup indefinitely.Fixes #17506
activateProvidertied a scheme's activation to the settlement ofWaitUntilEvent.fire(...), i.e.Promise.allof every listener'swaitUntilpromise. If any one of those promises never settles (theuser-storageemitter carries several listeners), the activation, and everything awaiting it, hangs forever. In #17506 this stalls all fourUser-scope preference providers →UserConfigsPreferenceProvider.ready→PreferenceServiceImpl.initializeProviders→FrontendApplication.start, so the workbench never comes up. As the reporter found, even registering the provider by hand against the live container did not help: theactivationsdeferred stayed pending.This PR is split into two independently reviewable commits:
readPreferencesFromFiletreats the file as absent and startup proceeds degraded) instead of hanging, and remove the scheme fromactivationsso a later attempt can retry once the connection recovers. The default timeout (DEFAULT_PROVIDER_ACTIVATION_TIMEOUT, 90s, overridable via theprotected getActivationTimeout()) is not less than the websocket heartbeat detection window (checkAliveTimeout30s +pingTimeout60s), so a genuinely dropped connection is detected and rejects in-flight RPCs before thisfires. Activation only ever awaits constant-time backend calls (capability handshake, config-directory lookup), so the timeout cannot abort legitimate long-running work.
Caution
The second commit is a containment/blast-radius measure; the underlying dangling-promise cause (a lost RPC reply on a connection that still appears alive) belongs to the family addressed by #17334 and is left as a follow-up. Reviewers may choose to take only the first commit.
How to test
The root-cause hang is timing-dependent and not reliably reproducible in the example apps, so verification is via the new automated tests, which fail without the fix (TDD).
To gut-check the first test against the bug, revert commit 1 and confirm it goes from passing to a 'PENDING' assertion failure.
Follow-ups
waitUntil/RPC promise that never settles on a still-open connection (lost reply, detached continuation), the same class fix(core): guarantee promise rejection on failure to send RPC call #17334 began addressing. Worth a dedicated issue to audit remaining RPC loss points (reply decode failure, a reply routed to an already-closed multiplexer sub-channel) and reject the pending request at the point of loss.Breaking changes
Attribution
Review checklist
nlsservice (for details, please see the Internationalization/Localization section in the Coding Guidelines)Reminder for reviewers