Microsoft.Azure.Cosmos.Direct: TCP keepalive configuration silently fails on Linux - uint passed to SetSocketOption resolves to wrong overload

> We are continuously addressing and improving the SDK, if possible, make sure the problem persist in the [latest SDK version](https://www.nuget.org/packages/Microsoft.Azure.Cosmos).

The issue appears to be present in both 3.55.0 and 3.58.0 (latest stable) based on IL inspection of `Microsoft.Azure.Cosmos.Direct.dll`. Since the source code for the Direct transport layer is not publicly available, this investigation was done by decompiling the binary - we may be missing context.

**Describe the bug**

TCP keepalive time and interval are never applied on Linux in Direct TCP mode. `Connection.SetKeepAliveSocketOptions` passes `uint` values to `Socket.SetSocketOption`, which resolves to the `object` overload. A boxed `uint` fails the `is int` check inside that overload, causing `ArgumentException`. The exception is silently caught by `IsKeepAliveCustomizationSupported()`, which returns `false`, making the entire method a no-op.

On Windows, the `IOControl(KeepAliveValues)` code path uses a `byte[]` and returns early, so it works correctly and never hits the broken code.

Result: Linux connections use OS default keepalive (`tcp_keepalive_time=7200s`, 2 hours) instead of the intended 30 seconds. Dead idle RNTBD connections survive for 2+ hours instead of ~39 seconds.

**To Reproduce**

The core issue is a C# type system problem — boxed `uint` is not `int`:

```csharp
// This is what the SDK does on Linux (from IL of Microsoft.Azure.Cosmos.Direct.dll):
using var socket = new Socket(SocketType.Stream, ProtocolType.Tcp);
uint keepAliveTime = 30u;  // field type in Connection.cs

// Resolves to SetSocketOption(level, name, object) — NOT the int overload
// Inside: "optionValue is int" → false for boxed uint → throws ArgumentException
socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.TcpKeepAliveTime, keepAliveTime);

// This works — what it should do:
socket.SetSocketOption(SocketOptionLevel.Tcp, (SocketOptionName.TcpKeepAliveTime, (int)keepAliveTime);
```

IL from `Connection.SetKeepAliveSocketOptions` in v3.58.0 confirming the `uint` → `box UInt32` → `object` overload path:
```
IL_0010: ldsfld uint32 ...Connection::SocketOptionTcpKeepAliveInterval
IL_0015: box [netstandard]System.UInt32
IL_001a: callvirt instance void Socket::SetSocketOption(SocketOptionLevel, SocketOptionName, object)
```

And `IsKeepAliveCustomizationSupported` silently swallows the exception:
```csharp
try {
    socket.SetSocketOption(..., socketOptionTcpKeepAliveInterval); // uint → throws
    return true;
} catch {
    return false; // keepalive customization "not supported"
}
```

**Expected behavior**

On Linux, TCP keepalive should be configured to `tcp_keepalive_time=30s`, `tcp_keepalive_intvl=1s` — matching the Windows `IOControl(KeepAliveValues)` configuration. Dead idle connections should be detected within ~39 seconds.

**Actual behavior**

On Linux, `SetKeepAliveSocketOptions` is a no-op. Keepalive is enabled (`SO_KEEPALIVE=true`) but with OS defaults: `tcp_keepalive_time=7200s` (2 hours). Dead idle RNTBD connections remain in the connection pool for up to 2 hours.

We discovered this during a production incident where a transient network issue affected cross-region Direct TCP connections to the CosmosDB write endpoint. We run 5 identical partitions — 4 on Windows and 1 on Linux. All hit the same issue simultaneously. The 4 Windows partitions recovered in under 5 minutes. The Linux partition took ~55 minutes. From service logs, we identified dead RNTBD connections (status: Connected, `callsPendingReceive: 0`, `lastReceive` timestamps 20-57 minutes stale) that remained in the pool producing intermittent 408 (RequestTimeout) errors whenever randomly selected by the load balancer.

**Environment summary**

SDK Version: 3.55.0 (also confirmed in 3.58.0 IL)
OS Version: Microsoft Azure Linux 3.0 (AKS), .NET 9.0.14

**Additional context**

Suggested fix — cast to `int` at the call site or change the field types:

```csharp
// Option A: cast at call site
clientSocket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.TcpKeepAliveInterval, (int)SocketOptionTcpKeepAliveInterval);
clientSocket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.TcpKeepAliveTime, (int)SocketOptionTcpKeepAliveTime);

// Option B: change field types from uint to int
private static readonly int SocketOptionTcpKeepAliveInterval = (int)GetUInt32FromEnvironmentVariableOrDefault(...);
private static readonly int SocketOptionTcpKeepAliveTime = (int)GetUInt32FromEnvironmentVariableOrDefault(...);
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Microsoft.Azure.Cosmos.Direct: TCP keepalive configuration silently fails on Linux - uint passed to SetSocketOption resolves to wrong overload #5761

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Microsoft.Azure.Cosmos.Direct: TCP keepalive configuration silently fails on Linux - uint passed to SetSocketOption resolves to wrong overload #5761

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions