|
| 1 | +# ICM 792529661 — Native Memory OOM with `Encrypt=Strict` |
| 2 | + |
| 3 | +## Executive Summary |
| 4 | + |
| 5 | +When using **`Encrypt=Strict`** (TDS 8.0) with **SQL Authentication** on **Windows native SNI**, each new `SqlConnection.Open()` leaks **~50–100 KB of native memory** that is never reclaimed, even after connection close/dispose. Under high-throughput workloads, this leads to **out-of-memory crashes**. |
| 6 | + |
| 7 | +**Root Cause:** Windows SChannel's **TLS 1.3 session ticket cache** stores resumption tickets in a process-global, per-credential cache. Each new TLS 1.3 connection receives session tickets from the server, and SChannel caches them indefinitely. There is **no public API** to evict, limit, or disable this cache from user mode. |
| 8 | + |
| 9 | +**Key observation — the leak does NOT occur with:** |
| 10 | +- Managed SNI (uses .NET `SslStream`, which doesn't use SChannel's session cache) |
| 11 | +- TLS 1.2 connections (no session tickets) |
| 12 | +- Non-Strict encryption modes (TLS handshake is handled differently) |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## Detailed Technical Explanation |
| 17 | + |
| 18 | +### 1. The Connection Flow with `Encrypt=Strict` (TDS 8.0) |
| 19 | + |
| 20 | +In TDS 8.0 ("Strict" encryption), the TLS handshake happens **before any TDS traffic**. The flow is: |
| 21 | + |
| 22 | +``` |
| 23 | +Client SQL Server |
| 24 | + | | |
| 25 | + |--- TCP Connect ----------------->| |
| 26 | + |--- TLS ClientHello ------------>| ← TLS wraps the entire connection |
| 27 | + |<-- TLS ServerHello + Cert ------| |
| 28 | + |--- TLS Finished --------------->| |
| 29 | + |<-- TLS Finished ----------------| |
| 30 | + | | |
| 31 | + |=== TDS traffic inside TLS ======| |
| 32 | + |--- TDS Login7 (SQL Auth) ------>| |
| 33 | + |<-- TDS Login Response ----------| |
| 34 | +``` |
| 35 | + |
| 36 | +This differs from `Encrypt=Mandatory` where TDS pre-login happens first, then TLS wraps only the login, then optionally continues encrypted. |
| 37 | + |
| 38 | +### 2. TLS 1.3 Session Tickets |
| 39 | + |
| 40 | +TLS 1.3 introduced **post-handshake session tickets** (RFC 8446 §4.6.1). After the handshake completes, the server sends `NewSessionTicket` messages: |
| 41 | + |
| 42 | +``` |
| 43 | +Client SQL Server |
| 44 | + | | |
| 45 | + |=== Handshake complete ===========| |
| 46 | + |<-- NewSessionTicket (ticket 1) --| ← Server pushes tickets |
| 47 | + |<-- NewSessionTicket (ticket 2) --| ← Often 2+ tickets |
| 48 | + | | |
| 49 | +``` |
| 50 | + |
| 51 | +These tickets allow the client to perform **0-RTT or 1-RTT resumption** on future connections — skipping the expensive key exchange. SQL Server typically sends **2 session tickets** per connection. |
| 52 | + |
| 53 | +### 3. SChannel's Session Ticket Cache |
| 54 | + |
| 55 | +On Windows, the TLS implementation is **SChannel** (Secure Channel), a system DLL (`schannel.dll`). When SChannel receives `NewSessionTicket` messages, it: |
| 56 | + |
| 57 | +1. Deserializes the ticket (contains encrypted session state, PSK identity, expiry) |
| 58 | +2. Stores it in a **process-global hash table** keyed by server name + credential handle |
| 59 | +3. Each ticket is ~20–50 KB (includes the PSK, ticket nonce, server certificate chain hash, etc.) |
| 60 | + |
| 61 | +**The critical problem:** SChannel has **no public API** to: |
| 62 | +- Limit the number of cached tickets |
| 63 | +- Evict specific tickets |
| 64 | +- Disable ticket acceptance per-context |
| 65 | +- Set a maximum cache size |
| 66 | + |
| 67 | +The cache grows unbounded as new connections produce new tickets. |
| 68 | + |
| 69 | +### 4. Why Only `Encrypt=Strict`? |
| 70 | + |
| 71 | +With `Encrypt=Mandatory` or `Encrypt=Optional`: |
| 72 | +- The TLS session is often **reused** across the connection pool because the pool keeps TCP connections alive |
| 73 | +- New TLS handshakes happen infrequently (only on pool misses or reconnects) |
| 74 | +- The ticket cache grows slowly |
| 75 | + |
| 76 | +With `Encrypt=Strict`: |
| 77 | +- In high-throughput scenarios or when connections are frequently created/destroyed, many new TLS sessions occur |
| 78 | +- Each new TLS 1.3 handshake → server sends 2 new tickets → SChannel caches them |
| 79 | +- **~50–100 KB per connection** leaked permanently |
| 80 | + |
| 81 | +### 5. Memory Growth Mechanics |
| 82 | + |
| 83 | +``` |
| 84 | +Connection 1: TLS handshake → 2 tickets cached → +50 KB |
| 85 | +Connection 2: TLS handshake → 2 tickets cached → +50 KB (old tickets NOT evicted) |
| 86 | +Connection 3: TLS handshake → 2 tickets cached → +50 KB |
| 87 | +... |
| 88 | +Connection N: TLS handshake → 2 tickets cached → +50 KB |
| 89 | +
|
| 90 | +Total leaked: N × ~50 KB (never freed) |
| 91 | +``` |
| 92 | + |
| 93 | +Even though connections are closed and disposed, the tickets remain in SChannel's process-global cache. The `DeleteSecurityContext` and `FreeCredentialsHandle` calls do NOT purge associated tickets. |
| 94 | + |
| 95 | +### 6. Why Managed SNI Doesn't Leak |
| 96 | + |
| 97 | +Managed SNI uses .NET's `SslStream` class, which: |
| 98 | +- Uses its own managed TLS implementation |
| 99 | +- .NET's `SslStream` disposes cleanly and the managed GC reclaims all associated buffers |
| 100 | +- The session cache in the managed path is bounded and properly evicted |
| 101 | + |
| 102 | +### 7. The Native SNI Code Path |
| 103 | + |
| 104 | +In `ssl.cpp`, the relevant flow is: |
| 105 | + |
| 106 | +```cpp |
| 107 | +// Credential acquisition |
| 108 | +AcquireCredentialsHandle(..., &schCredentials, ..., &credHandle); |
| 109 | + |
| 110 | +// TLS handshake |
| 111 | +InitializeSecurityContext(&credHandle, ..., &ctxtHandle, ...); |
| 112 | +// ↑ This is where SChannel receives and caches session tickets |
| 113 | + |
| 114 | +// Connection close |
| 115 | +DeleteSecurityContext(&ctxtHandle); // Does NOT purge ticket cache |
| 116 | +FreeCredentialHandle(&credHandle); // Does NOT purge ticket cache |
| 117 | +``` |
| 118 | +
|
| 119 | +--- |
| 120 | +
|
| 121 | +## Fix Attempts (All Failed) |
| 122 | +
|
| 123 | +| # | Approach | Implementation | Outcome | |
| 124 | +|---|----------|---------------|---------| |
| 125 | +| 1 | **`SCH_CRED_DISABLE_RECONNECTS`** | Set flag on `SCHANNEL_CRED` structure passed to `AcquireCredentialsHandle` | Only prevents client from *offering* tickets for resumption. Does NOT prevent server from *sending* tickets, and does NOT prevent SChannel from *caching* received tickets. **Still leaks.** | |
| 126 | +| 2 | **Per-connection unique credentials** | Create fresh `CredHandle` for each connection instead of sharing | Ticket cache is indexed by {server name, credential config}. Fresh creds just create new cache buckets — tickets still accumulate. **Still leaks.** | |
| 127 | +| 3 | **`dwSessionLifespan = 1`** | Set minimum session lifetime on credential | Controls how long SChannel will *reuse* a cached ticket for outbound reconnection. Does NOT control how long tickets are *stored* in memory. **Still leaks.** | |
| 128 | +| 4 | **`ApplyControlToken` + `SSL_SESSION_DISABLE`** | Applied post-handshake to disable caching on the security context | Only applies to future operations on that context — tickets already received and cached are not affected. **Still leaks.** | |
| 129 | +| 5 | **`SslEmptyCacheW(NULL)`** | Called periodically or per-connection to flush entire SChannel cache | Nuclear option: purges ALL cached sessions process-wide. Causes thundering-herd re-handshakes, race conditions, and performance collapse. Tickets re-accumulate immediately. **Not viable.** | |
| 130 | +
|
| 131 | +**Benchmark results after all fixes:** ~68–108 KB/connection growth (unchanged from baseline). |
| 132 | +
|
| 133 | +--- |
| 134 | +
|
| 135 | +## Platform Constraints |
| 136 | +
|
| 137 | +| Platform | Managed SNI Available? | Native SNI Required? | Workaround Possible? | |
| 138 | +|----------|----------------------|---------------------|---------------------| |
| 139 | +| **.NET 8/9 (Windows)** | Yes (opt-in via `UseManagedSNIOnWindows`) | Default but not required | Yes — use managed SNI | |
| 140 | +| **.NET Framework 4.6.2+** | **No** | **Yes — only option** | **No managed fallback** | |
| 141 | +
|
| 142 | +--- |
| 143 | +
|
| 144 | +## Impact |
| 145 | +
|
| 146 | +- **Affected:** Any Windows application using native SNI + `Encrypt=Strict` + TLS 1.3 (default for SQL Server 2022+) |
| 147 | +- **Severity:** Process eventually OOMs under sustained connection creation patterns |
| 148 | +- **Rate:** ~50–100 KB per unique TLS session |
| 149 | +- **.NET Framework:** Cannot use managed SNI — permanently affected unless native fix found |
| 150 | +- **.NET Core:** Can work around via `UseManagedSNIOnWindows=true` |
| 151 | +
|
| 152 | +--- |
| 153 | +
|
| 154 | +## Viable Paths Forward |
| 155 | +
|
| 156 | +### For .NET Core/.NET 8+ (Short-term) |
| 157 | +- Auto-switch to managed SNI when `Encrypt=Strict` is used |
| 158 | +- Or document `UseManagedSNIOnWindows=true` as recommended workaround |
| 159 | +
|
| 160 | +### For .NET Framework (Short-term) |
| 161 | +- **Cap TLS to 1.2** for Strict connections on native SNI (avoids session tickets entirely, but loses TLS 1.3 benefits) |
| 162 | +- **Accept and document** the limitation with guidance on connection pooling to minimize new TLS handshakes |
| 163 | +
|
| 164 | +### Long-term |
| 165 | +- **File a Windows/SChannel bug** requesting a cache eviction API or per-context opt-out |
| 166 | +- Windows team provides a proper API to control session ticket caching behavior per-credential or per-context |
| 167 | +
|
| 168 | +--- |
| 169 | +
|
| 170 | +## The Fundamental Problem |
| 171 | +
|
| 172 | +This is a **design limitation in Windows SChannel**. The session ticket cache was designed for web browsers where: |
| 173 | +- You connect to a few hundred unique servers |
| 174 | +- Cache growth is bounded by the number of unique servers |
| 175 | +- The browser process restarts regularly |
| 176 | +
|
| 177 | +For database drivers: |
| 178 | +- You connect to the SAME server thousands/millions of times |
| 179 | +- Each connection gets new tickets (SQL Server rotates tickets) |
| 180 | +- The process runs for months/years (service lifetime) |
| 181 | +- The cache grows unboundedly because SQL Server issues fresh tickets per-connection |
| 182 | +
|
| 183 | +**There is no user-mode fix — it requires either a Windows/SChannel update to provide a cache control API, or avoiding TLS 1.3 on the native path.** |
| 184 | +
|
| 185 | +--- |
| 186 | +
|
| 187 | +## References |
| 188 | +
|
| 189 | +- **ICM:** 792529661 |
| 190 | +- **Affected component:** `Microsoft.Data.SqlClient.SNI` (native SNI, `ssl.cpp`) |
| 191 | +- **Branch (SNI):** `dev/ad/oom-fix` in `Microsoft.Data.SqlClient.sni` repo |
| 192 | +- **Branch (SqlClient):** `dev/ad/strict-oom` in `dotnet/SqlClient` |
| 193 | +- **Benchmark tool:** `tools/StrictEncryptMemoryBenchmark/` |
| 194 | +- **RFC 8446 §4.6.1:** TLS 1.3 Post-Handshake Messages — NewSessionTicket |
| 195 | +- **MS-TDS 8.0:** Strict encryption mode specification |
0 commit comments