|
| 1 | +# Multi-process race reproducers |
| 2 | + |
| 3 | +This folder is a regression-test scaffold, not a shipped artifact. It exists to deterministically reproduce a class of races in `Wrapper.InnerStart` that surface when multiple OS processes share a single LocalDB user instance. |
| 4 | + |
| 5 | +## Symptom |
| 6 | + |
| 7 | +When two test-host processes (e.g. two `dotnet test` invocations, or Rider's runner concurrent with a CLI run) target the same `SqlInstance<T>` for the same Windows user, intermittent failures appear with stack traces like: |
| 8 | + |
| 9 | +``` |
| 10 | +SetUp : Microsoft.Data.SqlClient.SqlException : |
| 11 | + A network-related or instance-specific error occurred while establishing a connection to SQL Server. |
| 12 | + ... error: 50 - Local Database Runtime error occurred. |
| 13 | + The specified LocalDB instance does not exist. |
| 14 | + ----> System.ComponentModel.Win32Exception : Unknown error (0x89c50107) |
| 15 | + at Wrapper.OpenMasterConnection() in C:\projects\localdb\src\LocalDb\Wrapper.cs:line 274 |
| 16 | + at Wrapper.CreateAndDetachTemplate(...) in ...:line 229 |
| 17 | + at Wrapper.CreateDatabaseFromTemplate(String name) in ...:line 83 |
| 18 | + at EfLocalDb.SqlInstance`1.Build(String, IEnumerable`1) in ...:line 73 |
| 19 | + at EfLocalDbNunit.LocalDbTestBase`1.Reset() in ...:line 85 |
| 20 | +``` |
| 21 | + |
| 22 | +The `0x89C50107` native code is `LOCALDB_ERROR_INSTANCE_DOES_NOT_EXIST`. Other manifestations of the same underlying race include SQL deadlocks during `CREATE DATABASE [template]` and `Operating system error 2: cannot find the file specified` on `template.mdf`. |
| 23 | + |
| 24 | +Once a machine is in this state it tends to stay broken: every subsequent `dotnet test` triggers the same race because the wrapper directory is empty (no `template.mdf`), so every process re-runs the destructive `StopAndDelete + CleanStart` branch. |
| 25 | + |
| 26 | +## Root cause |
| 27 | + |
| 28 | +`Wrapper.InnerStart` (LocalDb.csproj, `Wrapper.cs`): |
| 29 | + |
| 30 | +```csharp |
| 31 | +var info = LocalDbApi.GetInstance(instance); |
| 32 | +if (!info.Exists) { CleanStart(); return; } |
| 33 | +if (!info.IsRunning) { LocalDbApi.StartInstance(instance); } |
| 34 | +if (!File.Exists(DataFile)) |
| 35 | +{ |
| 36 | + LocalDbApi.StopAndDelete(instance); |
| 37 | + CleanStart(); // CreateInstance + StartInstance + CreateAndDetachTemplate |
| 38 | + return; |
| 39 | +} |
| 40 | +``` |
| 41 | + |
| 42 | +There are two unsynchronized concurrency surfaces here: |
| 43 | + |
| 44 | +1. **In-process** — `Wrapper.semaphoreSlim` is declared but never `WaitAsync`'d; two `Wrapper` instances for the same instance name running in the same process race on `LocalDbApi.*` calls and on the SQL DDL inside `CreateAndDetachTemplate`. |
| 45 | +2. **Cross-process** — even if (1) were fixed with an in-process lock, the `LocalDbApi.*` calls reach into the per-Windows-user LocalDB metadata, which is shared across all processes belonging to that user. Two processes both running `InnerStart` against the same instance race on `StopAndDelete` / `CreateInstance` / `StartInstance` and on the same master DB. |
| 46 | + |
| 47 | +Both surfaces dissolve under one fix: serialize `InnerStart` per instance name with an in-process lock **and** a named cross-process mutex. |
| 48 | + |
| 49 | +## The three reproducer tests |
| 50 | + |
| 51 | +| Test | Race surface | Failure surfaced | |
| 52 | +|---|---|---| |
| 53 | +| `ConcurrentStartTests.ConcurrentStartWithMissingTemplateShouldNotRace` | In-process (two `Wrapper` instances, one process, no helper exe) | SQL deadlock 1205 during `CREATE DATABASE [template]` | |
| 54 | +| `MultiProcessConcurrentStartTests.MultiProcessConcurrentStartShouldNotRace` | Multi-process, symmetric (3 child processes all running `Wrapper.Start`) | SQL deadlock OR `template.mdf` not found OR `0x89C50107` (varies by timing) | |
| 55 | +| `InstanceDoesNotExistRaceTests.KillerVsVictimSurfacesInstanceDoesNotExist` | Multi-process, asymmetric (one killer hammering `StopAndDelete`, one victim opening `SqlConnection`) | **Exact `0x89C50107` deterministically** — victim only exits 0 when it observes that specific code | |
| 56 | + |
| 57 | +## Why each part exists |
| 58 | + |
| 59 | +### `LocalDb.MultiProcessHelper` project |
| 60 | + |
| 61 | +The asymmetric/multi-process tests need to spawn separate Windows processes via `Process.Start`. A Windows process needs an executable; an executable needs an entry point; that entry point lives in `Program.cs`. |
| 62 | + |
| 63 | +We can't reuse `LocalDb.Tests.exe` for this — its entry point is owned by the test runner (NUnit + Microsoft.Testing.Platform), and we'd have to either fight the runner or invoke `dotnet test --filter` recursively (slow and awkward). A purpose-built console exe is simpler and faster. |
| 64 | + |
| 65 | +### `Program.cs` with three modes (`wrapper-start`, `killer`, `victim`) |
| 66 | + |
| 67 | +Different tests need different child behaviors. Rather than ship three executables, the same exe takes a mode argument: |
| 68 | + |
| 69 | +- **`wrapper-start`** — full `Wrapper.Start` cycle. Used by the symmetric multi-process test where every child runs the same code path. |
| 70 | +- **`killer`** — bare `LocalDbApi.StopAndDelete(name)` in a tight loop. Maximizes the chance of catching a victim mid-handshake. |
| 71 | +- **`victim`** — `SqlConnection.OpenAsync` in a tight loop, walking exception chains for `Win32Exception.NativeErrorCode == 0x89C50107`. Exits 0 the first time it observes that exact code, exits 1/2 otherwise. |
| 72 | + |
| 73 | +Splitting the killer and victim into separate processes is what makes `0x89C50107` reliably reproducible — symmetric children all running `Wrapper.Start` race on multiple things at once and surface a mix of error types; the asymmetric setup isolates the specific race window where the LocalDB API resolves the instance name as "does not exist." |
| 74 | + |
| 75 | +### Strong-name signing (`SignAssembly` + `..\key.snk` in the .csproj) |
| 76 | + |
| 77 | +`Wrapper`, `LocalDbApi`, and `DirectoryFinder` are `internal` types in the LocalDb assembly. The LocalDb assembly is strong-named and grants `InternalsVisibleTo` only to assemblies whose public key matches a specific `PublicKey=...` blob. For the helper to use those internal types, it must be signed with the same key. `..\key.snk` is the existing project-wide signing key (the same one Benchmark uses). |
| 78 | + |
| 79 | +Alternative considered: drive the race entirely through `EfLocalDb.SqlInstance<T>` (a public API). That works but requires defining a `DbContext` and adds EF Core to the helper's dependency surface. Reaching for `Wrapper` directly keeps the helper minimal and exercises exactly the layer where the race lives. |
| 80 | + |
| 81 | +### `InternalsVisibleTo` entry for `LocalDb.MultiProcessHelper` |
| 82 | + |
| 83 | +Standard IVT plumbing — added next to the existing entries in `src/LocalDb/InternalsVisibleTo.cs`. Same `PublicKey=` blob as the others (it's the public half of `key.snk`). |
| 84 | + |
| 85 | +### `<ProjectReference ... ReferenceOutputAssembly="false" Private="false" />` in `LocalDb.Tests.csproj` |
| 86 | + |
| 87 | +The test project does **not** want to link the helper's assembly into its own output — it only wants the helper exe to exist on disk before tests run. `ReferenceOutputAssembly="false"` says "build it, but don't add a reference to its DLL in my compile inputs." `Private="false"` says "don't copy its outputs into my bin folder." With both set, the helper builds whenever the test project does (so a fresh `dotnet test` always finds an up-to-date helper), but there's no compile-time coupling between them. |
| 88 | + |
| 89 | +The test resolves the helper path at runtime via `HelperExeResolver.cs`, which walks up from the test's `bin/<Config>/net10.0/` to find the sibling project's matching `bin/<Config>/net10.0/LocalDb.MultiProcessHelper.exe`. |
| 90 | + |
| 91 | +### `LocalDb.slnx` entry |
| 92 | + |
| 93 | +Nothing surprising — registers the new project so tooling (Rider, Visual Studio, `dotnet sln` operations) sees it. Without this the project still builds via the test project's `ProjectReference`, but it won't appear in solution-level views. |
| 94 | + |
| 95 | +### Signal-file barrier (`signalFile` argument) |
| 96 | + |
| 97 | +`Process.Start` spin-up jitter is on the order of 100–300 ms — wider than the actual race window for `0x89C50107`, which is microseconds. If children just started running their work immediately, the slowest child would always lose the race in a predictable order, and the test would be flaky. |
| 98 | + |
| 99 | +The barrier flips this around: each child spawns, waits in a polling loop for a signal file to appear, and only proceeds once the parent test creates that file. The parent waits 750 ms after spawning all children (giving them enough time to load their CLR and reach the wait loop), then writes the signal — releasing them within a few ms of each other. That's tight enough to land the children in the actual race window reliably. |
| 100 | + |
| 101 | +### `HelperExeResolver` (shared lookup) |
| 102 | + |
| 103 | +Both multi-process tests need to find the helper exe at runtime, and the path resolution is non-trivial enough to want one place to update if the build layout changes. Pulling it out also avoids duplicate logic that could drift between the two tests. |
| 104 | + |
| 105 | +## Suggested fix in `LocalDb` |
| 106 | + |
| 107 | +Wire up the existing `Wrapper.semaphoreSlim` field around `InnerStart`'s body to handle the in-process race, and add a named OS mutex keyed on the instance name (e.g. `Global\\LocalDb_Wrapper_InnerStart_{instanceName}`) around the entire `InnerStart` operation to handle the cross-process race. Both tests in this folder should pass once that lock is in place; if either still fails, the lock isn't covering the right span. |
| 108 | + |
| 109 | +## Running the tests |
| 110 | + |
| 111 | +```powershell |
| 112 | +dotnet test src/LocalDb.Tests/LocalDb.Tests.csproj ` |
| 113 | + --configuration Release ` |
| 114 | + --filter "FullyQualifiedName~ConcurrentStart|FullyQualifiedName~MultiProcessConcurrentStart|FullyQualifiedName~KillerVsVictim" |
| 115 | +``` |
| 116 | + |
| 117 | +The deterministic `KillerVsVictimSurfacesInstanceDoesNotExist` finishes in ~8 s. The symmetric `MultiProcessConcurrentStartShouldNotRace` finishes in ~15-30 s. The in-process `ConcurrentStartWithMissingTemplateShouldNotRace` finishes in ~2 minutes (it intentionally rebuilds the template 5× for a non-flaky signal). |
0 commit comments