Isolate plugins in an out-of-process COM host#40120
Conversation
There was a problem hiding this comment.
Pull request overview
This PR moves WSL plugins from in-process LoadLibrary inside wslservice.exe to isolated out-of-process wslpluginhost.exe COM local servers, aiming to keep the service alive even if a plugin crashes.
Changes:
- Introduces
wslpluginhost.exe(COM local server) that loads one plugin DLL and forwards lifecycle events, while proxying plugin API callbacks back to the service. - Adds new COM contracts (
IWslPluginHost,IWslPluginHostCallback) and consolidates proxy/stub generation intowslserviceproxystub.dll. - Updates service-side plugin management and adds a new
shared_mutexpath intended to avoid re-entrancy/deadlock on COM RPC callback threads.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/windows/wslpluginhost/exe/resource.h | Adds resource IDs for new host executable. |
| src/windows/wslpluginhost/exe/PluginHost.h | Declares COM class implementing IWslPluginHost plus API callback stubs. |
| src/windows/wslpluginhost/exe/PluginHost.cpp | Implements plugin DLL loading, lifecycle dispatch, and callback forwarding to service. |
| src/windows/wslpluginhost/exe/main.rc | Adds version/icon resources for wslpluginhost.exe. |
| src/windows/wslpluginhost/exe/main.cpp | Implements COM local server entrypoint and class factory registration. |
| src/windows/wslpluginhost/exe/CMakeLists.txt | Adds build target for wslpluginhost.exe. |
| src/windows/wslpluginhost/CMakeLists.txt | Wires new subdirectory into build. |
| src/windows/service/stub/CMakeLists.txt | Adds MIDL proxy/stub sources for WslPluginHost.idl into wslserviceproxystub. |
| src/windows/service/inc/WslPluginHost.idl | Defines out-of-proc COM interfaces for plugin hosting + callbacks. |
| src/windows/service/inc/CMakeLists.txt | Adds new wslpluginhostidl MIDL generation target. |
| src/windows/service/exe/PluginManager.h | Refactors plugin manager to track out-of-proc hosts and adds callback implementation type. |
| src/windows/service/exe/PluginManager.cpp | Implements COM activation of hosts, job object assignment, and service-side callback handlers. |
| src/windows/service/exe/LxssUserSession.h | Adds shared_mutex and makes plugin-callback methods private/friend-only. |
| src/windows/service/exe/LxssUserSession.cpp | Switches plugin callback locking from m_instanceLock to m_callbackLock and gates VM teardown. |
| src/windows/service/exe/CMakeLists.txt | Adds dependency on wslpluginhostidl. |
| src/windows/common/precomp.h | Adds <shared_mutex> include for new locking. |
| msipackage/package.wix.in | Installs wslpluginhost.exe and registers COM AppID/CLSID/interfaces for activation and proxy/stub. |
| msipackage/CMakeLists.txt | Adds wslpluginhost.exe to packaged binaries and build dependencies. |
| CMakeLists.txt | Adds subdirectory for wslpluginhost and adjusts global include directories. |
Comments suppressed due to low confidence (2)
src/windows/service/exe/LxssUserSession.cpp:3644
- CreateLinuxProcess now only takes m_callbackLock, but it calls _RunningInstance(), which is annotated Requires_lock_held(m_instanceLock) and reads m_runningInstances. This is both a locking-contract violation and can race with writers that still use m_instanceLock only. Refactor so callback code can safely read the running-instance map (e.g., provide a callback-safe lookup guarded by m_callbackLock, and ensure all writes to m_runningInstances/m_utilityVm also take m_callbackLock exclusively after m_instanceLock per the stated lock ordering).
// Shared lock prevents _VmTerminate from destroying the VM or instances
// while we use them. See MountRootNamespaceFolder for rationale.
std::shared_lock lock(m_callbackLock);
RETURN_HR_IF(E_NOT_VALID_STATE, !m_utilityVm);
if (Distro == nullptr)
{
*Socket = m_utilityVm->CreateRootNamespaceProcess(Path, Arguments).release();
}
else
{
const auto distro = _RunningInstance(Distro);
THROW_HR_IF(WSL_E_VM_MODE_INVALID_STATE, !distro);
const auto wsl2Distro = dynamic_cast<WslCoreInstance*>(distro.get());
src/windows/service/exe/LxssUserSession.cpp:2614
- m_runningInstances is updated here without taking m_callbackLock, but plugin callbacks now read m_runningInstances under m_callbackLock (and intentionally do not take m_instanceLock). Because these are different locks, this doesn’t provide synchronization and can lead to data races/UB when a callback runs concurrently with instance creation/termination. Writers that mutate m_runningInstances (and m_utilityVm if accessed by callbacks) need to also take m_callbackLock (exclusive) in the documented order m_instanceLock → m_callbackLock.
// This needs to be done before plugins are notified because they might try to run a command inside the distribution.
m_runningInstances[registration.Id()] = instance;
if (version == LXSS_WSL_VERSION_2)
{
auto cleanupOnFailure =
wil::scope_exit_log(WI_DIAGNOSTICS_INFO, [&]() { m_runningInstances.erase(registration.Id()); });
m_pluginManager.OnDistributionStarted(&m_session, instance->DistributionInformation());
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/windows/service/exe/LxssUserSession.cpp:3643
- CreateLinuxProcess now only takes m_callbackLock, but it reads m_runningInstances via _RunningInstance(), which is annotated Requires_lock_held(m_instanceLock) and the map itself is Guarded_by(m_instanceLock). This is a real race/contract violation (and can also break static analysis). You’ll need a consistent locking strategy for callback threads (e.g., make all accesses/mutations of m_runningInstances + m_utilityVm also take m_callbackLock in the documented order, or refactor callbacks to avoid touching m_runningInstances without m_instanceLock).
// Shared lock prevents _VmTerminate from destroying the VM or instances
// while we use them. See MountRootNamespaceFolder for rationale.
std::shared_lock lock(m_callbackLock);
RETURN_HR_IF(E_NOT_VALID_STATE, !m_utilityVm);
if (Distro == nullptr)
{
*Socket = m_utilityVm->CreateRootNamespaceProcess(Path, Arguments).release();
}
else
{
const auto distro = _RunningInstance(Distro);
THROW_HR_IF(WSL_E_VM_MODE_INVALID_STATE, !distro);
3418f7e to
3f03d4f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 9 comments.
Comments suppressed due to low confidence (1)
src/windows/service/exe/LxssUserSession.cpp:3653
CreateLinuxProcessnow only takesm_callbackLockbut calls_RunningInstance(Distro)(which is annotated_Requires_lock_held_(m_instanceLock)and touchesm_lockedDistributions). This violates the locking contract and can race with instance state changes or deadlock if someone later adds the required lock. Consider adding a callback-safe lookup that is guarded bym_callbackLockonly (and does not call_EnsureNotLocked), or refactoring_RunningInstance/ the guarded annotations so callback code never needsm_instanceLock.
if (Distro == nullptr)
{
*Socket = m_utilityVm->CreateRootNamespaceProcess(Path, Arguments).release();
}
else
{
const auto distro = _RunningInstance(Distro);
THROW_HR_IF(WSL_E_VM_MODE_INVALID_STATE, !distro);
const auto wsl2Distro = dynamic_cast<WslCoreInstance*>(distro.get());
THROW_HR_IF(WSL_E_WSL2_NEEDED, !wsl2Distro);
|
Hey @benhillis 👋 — Following up on this PR. It currently has merge conflicts that need to be resolved, and the CI build didn't run (shows action_required). There are also 20 unresolved review threads, including some security-relevant findings (TOCTOU on DLL signature validation, |
f32e629 to
5641289
Compare
|
CI is all green now and conflicts are resolved. Would love to get some eyes on this when someone has a chance — the main design change is moving plugin loading out of wslservice into separate COM hosts so a bad plugin can't take down the service. |
3492d6b to
57501d9
Compare
ed98200 to
c66d3d7
Compare
5239259 to
8cebe31
Compare
8cebe31 to
6ca2d5d
Compare
6ca2d5d to
f7419fb
Compare
f7419fb to
4f63862
Compare
4f63862 to
99b8998
Compare
99b8998 to
74247c2
Compare
74247c2 to
8706282
Compare
WSL plugin DLLs are moved out of wslservice.exe into a separate wslpluginhost.exe COM server so plugin code can no longer crash or destabilize the service. Each plugin is activated in its own host process (CLSCTX_LOCAL_SERVER, SYSTEM-only via AppID) and reached through a versioned COM interface defined in WslPluginHost.idl. All hosts are tied to a service-owned job object and terminate when wslservice exits. The plugin API is unchanged; existing plugins run unmodified. A crashing or disconnected host is classified by IsHostCrash (RPC_E_DISCONNECTED, RPC_E_SERVER_DIED[_DNE], CO_E_OBJNOTCONNECTED, RPC_S_SERVER_UNAVAILABLE, RPC_S_CALL_FAILED[_DNE]); the service logs it and continues instead of treating it as a fatal plugin error. RPC_E_CALL_REJECTED is intentionally excluded as a transient busy state rather than a dead host. Plugin->service callbacks (MountFolder, ExecuteBinary, and the WSLC session APIs) arrive on a different COM thread than the outbound hook, so they cannot re-enter the lock held during the hook: - VM path: LxssUserSessionImpl guards callbacks with a shared_mutex (shared for callbacks, exclusive in _VmTerminate after OnVmStopping drains in-flight callbacks before the utility VM is destroyed). - WSLC path: PluginManager resolves sessions through its own reference map under a dedicated lock, and WSLCSessionManager releases its session lock before any plugin notification fires, so callbacks never re-enter the session lock. A session is registered in the reference map but not published until OnWslcSessionCreated succeeds, so a vetoed or race-lost session is never handed out. Proxy/stub is consolidated into wslserviceproxystub.dll. One new exe, no new DLLs. Tests - HostCrashIsolation: kills wslpluginhost.exe mid-OnVmStarted and verifies the service survives and m_initOnce stays sticky. - ConcurrentCallbacks: four plugin threads hammer MountFolder and ExecuteBinary, exercising the shared callback lock. - AsyncApiCallFromWorker: a plugin worker thread calls into the service post-hook (cross-apartment, non-COM-initialized). - CallbacksDuringTerminationDoNotCrash: worker threads race _VmTerminate's exclusive lock and VM teardown, then wind down deterministically after OnVmStopping signals them and are joined on the next session start. - Existing WSL1 plugin tests broadened alongside the refactor. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
8706282 to
c13fe64
Compare
| { | ||
| // Only treat plugin-reported errors (from entry point) as fatal. | ||
| // COM infrastructure errors (activation, connectivity) are non-fatal | ||
| // — the plugin is simply unavailable. |
There was a problem hiding this comment.
The design so far was to block WSL usage if plugins fail to load (in this case CO_E_SERVER_EXEC_FAILURE could probably hit if let's say the plugin crashed the process).
There's probably a bigger discussion on how we should handle this, but I think for now keeping the same behavior and recording which plugin crashed and displaying an error message is the right thing to do
|
|
||
| if (FAILED(loadResult)) | ||
| { | ||
| // Treat host-process crashes and benign COM activation races (server is |
There was a problem hiding this comment.
Are COM activation races possible here ? I'd expect each plugin to be loaded only once per machine, so CO_E_SERVER_STOPPING feels unexpected there.
I also think that we should fail if we hit a host crash, since this is the current behavior (at least now we can give a proper error message)
| // CoReleaseServerProcess on clean shutdown, just not on a service crash. | ||
| EnsureJobObjectCreated(); | ||
| wil::unique_handle process; | ||
| const HRESULT getProcessHr = host->GetProcessHandle(&process); |
There was a problem hiding this comment.
I think we should fail if any of those calls fail (this, and assigning to the job object). On failure the caller will be able to display a proper error message about which plugin failed to load
| } | ||
|
|
||
| THROW_IF_FAILED_MSG( | ||
| host->Initialize(plugin.callback.Get(), plugin.path.c_str(), plugin.name.c_str()), |
There was a problem hiding this comment.
Given that the host processing also runs as SYSTEM, could we pass the job object down to Initialize() and get rid of GetProcessHandle ?
| { | ||
| if (e.hostCookie == 0) | ||
| { | ||
| continue; |
There was a problem hiding this comment.
I think we should fail / assert if we ever hit this
There was a problem hiding this comment.
(same if we get crashes during plugin calls)
| Session->ApplicationPid, | ||
| Session->UserToken, | ||
| static_cast<DWORD>(sidData.size()), | ||
| sidData.empty() ? nullptr : sidData.data(), |
There was a problem hiding this comment.
Can sidData ever be empty here ?
| // --- IWslPluginHostCallback WSLC implementations (service-side) --- | ||
|
|
||
| THROW_HR_IF(E_ILLEGAL_STATE_CHANGE, g_pluginErrorMessage.has_value()); | ||
| DWORD PluginHostCallbackImpl::InsertProcessLocked(wil::com_ptr<IWSLCProcess> process) |
There was a problem hiding this comment.
Instead of having to keep track of cookies, could we pass a IWSLCProcess reference back to the caller ? That way we don't have to deal with this, and collisions across plugins become impossible
Summary
WSL plugin DLLs are moved out of
wslservice.exeinto a separatewslpluginhost.exeCOM server so plugin code can no longer crash or destabilize the service. Each plugin is activated in its own host process (CLSCTX_LOCAL_SERVER, SYSTEM-only via AppID) and reached through a versioned COM interface defined inWslPluginHost.idl. All hosts are tied to a service-owned job object and terminate whenwslserviceexits.The plugin API is unchanged and existing plugins run unmodified. Proxy/stub is consolidated into
wslserviceproxystub.dll— one new exe, no new DLLs.Detailed Description
Host-crash classification
A crashing or disconnected host is classified by
IsHostCrashand surfaced as "host died, log and continue" rather than a fatal plugin error:RPC_E_DISCONNECTED,RPC_E_SERVER_DIED,RPC_E_SERVER_DIED_DNE,CO_E_OBJNOTCONNECTED,RPC_S_SERVER_UNAVAILABLE,RPC_S_CALL_FAILED,RPC_S_CALL_FAILED_DNE.RPC_E_CALL_REJECTEDis intentionally not classified as a host crash: it is a transient COM busy state (an STA message filter rejecting a call), not a "server process died" signal. The plugin host is MTA with no message filter, so it should not occur here; treating it as a crash would silently skip future legitimate calls.Callback re-entrancy
Plugin→service callbacks (
MountFolder,ExecuteBinary, and the WSLC session APIs) arrive on a different COM thread than the outbound hook, so they cannot re-enter the lock held during the hook:LxssUserSessionImplguards callbacks with ashared_mutex: shared for callbacks, exclusive in_VmTerminateafterOnVmStoppingdrains in-flight callbacks before the utility VM is destroyed.PluginManagerresolves sessions through its own reference map under a dedicated lock, andWSLCSessionManagerreleases its session lock before any plugin notification fires, so callbacks never re-enter the session lock.New tests
wslpluginhost.exemid-OnVmStartedand verifies the service survives andm_initOncestays sticky.MountFolder+ExecuteBinary, exercising the shared callback lock._VmTerminate's exclusive lock acquire / VM teardown, exiting deterministically via anOnVmStopping-set stop signal.Existing WSL1 plugin tests were broadened alongside the refactor.
Validation
bin\x64\debug\test.bat -f /name:*Plugin*— all plugin tests pass.FormatSource.ps1clean on changed files.