You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #49095 introduces an HTTP/2 PING-based broken-connection health check for the Cosmos gateway transport (Http2PingHandler). In the current design, each parent H2 channel installs its own handler, and handlerAdded schedules a periodic check via ctx.executor().scheduleAtFixedRate(...). This means one ScheduledFuture is created per parent H2 connection.
This is functionally correct and mirrors Netty's own IdleStateHandler (which is also per-channel), so it is a fine starting point. This issue tracks a scalability follow-up, not a defect in the merged PR.
Primary trigger: HTTP/2 flipping to default-on
The handler's install is gated purely on HTTP/2 being effectively enabled (Http2PingHandler.isPingHealthEffectivelyEnabled → kill-switch AND ping interval > 0 AND Http2ConnectionConfig.isEffectivelyEnabled()). There is no thin-client restriction — any connection on a client with H2 enabled installs the handler.
Today HTTP/2 is in preview and off by default (Configs.DEFAULT_HTTP2_ENABLED = false; Http2ConnectionConfig.setEnabled(...) javadoc: "the default value (false while in preview, true later) will be applied"). So the per-channel ScheduledFuture footprint is currently limited to clients that explicitly opt into H2.
When that default flips to true, the handler installs on every gateway H2 parent channel across every CosmosClient in the process — with no opt-in gate. That is the primary, near-certain amplifier of the scaling problem below; the multi-client scenario is a secondary case that stacks on top of it.
Problem
The number of scheduled timer tasks scales with channels, while the number of EventLoops is fixed and bounded by CPU (reactor-netty's default LoopResources, ~max(cores, 4) workers; the gateway client does not call .runOn(...), so it inherits the shared default loop group).
Because EventLoop count is a fixed denominator, channels concentrate onto a small set of loops rather than spreading out:
H2 pool ceiling is DEFAULT_HTTP2_MAX_CONNECTION_POOL_SIZE = 1000 connections per client.
Channel count further multiplies by distinct endpoints and by the number of CosmosClient instances in the process.
Each scheduled task runs on the EventLoop I/O thread, so periodic checks and socket I/O for all channels on that loop share one thread.
At high fan-out (e.g. a multi-tenant host with hundreds of CosmosClient instances), this produces a large per-loop scheduled-task-queue depth and a per-channel object/ScheduledFutureTask footprint that grows without any ceiling tied to the EventLoop count.
Proposed design: one scanner per EventLoop
Move scheduling off the per-channel handler and onto a shared per-EventLoop scanner so the timer count tracks loops (bounded) instead of channels (unbounded):
Keep Http2PingHandler in each channel's pipeline — it must still intercept inbound PING-ACK frames and track lastReadNanos. The per-channel state object is unavoidable.
Remove the per-handler scheduleAtFixedRate.
Introduce a shared registry ConcurrentMap<EventExecutor, LoopPingState>, where each LoopPingState holds a Set<Http2PingHandler> plus a single ScheduledFuture.
handlerAdded → computeIfAbsent(ctx.executor()), add self to the loop's set, and schedule the single scanner only when the state is newly created.
Scanner tick → iterate the loop's handlers and run the existing per-channel check logic (maybeSendPing, extracted into a runPingCheck(now) method).
handlerRemoved / channelInactive → deregister; the scanner self-cancels when it finds its set empty.
Why it is safe
The scanner runs on the loop thread, and so do handlerAdded / handlerRemoved for every channel assigned to that loop. Therefore the per-loop set and the empty→cancel decision are single-threaded per loop and require no locks. Only the top-level map is cross-thread, which ConcurrentHashMap covers.
Result: ScheduledFuture count = active EventLoops = O(cores), flat across both channels and clients.
Alternatives considered
Status quo (per-channel ScheduledFuture) — matches Netty IdleStateHandler; correct, but the timer/object count scales with channels. This is the same pattern high-fanout deployments typically outgrow.
HashedWheelTimer — also collapses the timer count, but runs on its own thread, forcing an eventLoop().execute(...) hop back onto the channel's loop for every channel-state access. The per-EventLoop scanner keeps everything on the loop thread and avoids that hop.
Acceptance criteria
Http2PingHandler no longer schedules a per-channel ScheduledFuture; scheduling is owned by a per-EventLoop scanner.
Active scheduled tasks for PING health = number of active EventLoops, independent of channel and client count (verifiable in a unit/diagnostic test).
PING send / ACK tracking, failure-threshold counting, and connection close-on-threshold behavior are unchanged.
Scanner is correctly cancelled when a loop has no remaining channels (no leak across client open/close cycles on the shared default loop group).
Existing PING health tests (including the network-fault lifecycle tests) continue to pass.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/Http2ConnectionConfig.java — isEffectivelyEnabled() (per-client override falling back to the global flag); setEnabled(...) javadoc noting the default flips to true post-preview.
Background
PR #49095 introduces an HTTP/2 PING-based broken-connection health check for the Cosmos gateway transport (
Http2PingHandler). In the current design, each parent H2 channel installs its own handler, andhandlerAddedschedules a periodic check viactx.executor().scheduleAtFixedRate(...). This means oneScheduledFutureis created per parent H2 connection.This is functionally correct and mirrors Netty's own
IdleStateHandler(which is also per-channel), so it is a fine starting point. This issue tracks a scalability follow-up, not a defect in the merged PR.Primary trigger: HTTP/2 flipping to default-on
The handler's install is gated purely on HTTP/2 being effectively enabled (
Http2PingHandler.isPingHealthEffectivelyEnabled→ kill-switch AND ping interval > 0 ANDHttp2ConnectionConfig.isEffectivelyEnabled()). There is no thin-client restriction — any connection on a client with H2 enabled installs the handler.Today HTTP/2 is in preview and off by default (
Configs.DEFAULT_HTTP2_ENABLED = false;Http2ConnectionConfig.setEnabled(...)javadoc: "the default value (falsewhile in preview,truelater) will be applied"). So the per-channelScheduledFuturefootprint is currently limited to clients that explicitly opt into H2.When that default flips to
true, the handler installs on every gateway H2 parent channel across everyCosmosClientin the process — with no opt-in gate. That is the primary, near-certain amplifier of the scaling problem below; the multi-client scenario is a secondary case that stacks on top of it.Problem
The number of scheduled timer tasks scales with channels, while the number of EventLoops is fixed and bounded by CPU (reactor-netty's default
LoopResources, ~max(cores, 4)workers; the gateway client does not call.runOn(...), so it inherits the shared default loop group).Because EventLoop count is a fixed denominator, channels concentrate onto a small set of loops rather than spreading out:
DEFAULT_HTTP2_MAX_CONNECTION_POOL_SIZE = 1000connections per client.CosmosClientinstances in the process.At high fan-out (e.g. a multi-tenant host with hundreds of
CosmosClientinstances), this produces a large per-loop scheduled-task-queue depth and a per-channel object/ScheduledFutureTaskfootprint that grows without any ceiling tied to the EventLoop count.Proposed design: one scanner per EventLoop
Move scheduling off the per-channel handler and onto a shared per-EventLoop scanner so the timer count tracks loops (bounded) instead of channels (unbounded):
Http2PingHandlerin each channel's pipeline — it must still intercept inbound PING-ACK frames and tracklastReadNanos. The per-channel state object is unavoidable.scheduleAtFixedRate.ConcurrentMap<EventExecutor, LoopPingState>, where eachLoopPingStateholds aSet<Http2PingHandler>plus a singleScheduledFuture.handlerAdded→computeIfAbsent(ctx.executor()), add self to the loop's set, and schedule the single scanner only when the state is newly created.maybeSendPing, extracted into arunPingCheck(now)method).handlerRemoved/channelInactive→ deregister; the scanner self-cancels when it finds its set empty.Why it is safe
The scanner runs on the loop thread, and so do
handlerAdded/handlerRemovedfor every channel assigned to that loop. Therefore the per-loop set and the empty→cancel decision are single-threaded per loop and require no locks. Only the top-level map is cross-thread, whichConcurrentHashMapcovers.Result:
ScheduledFuturecount = active EventLoops = O(cores), flat across both channels and clients.Alternatives considered
ScheduledFuture) — matches NettyIdleStateHandler; correct, but the timer/object count scales with channels. This is the same pattern high-fanout deployments typically outgrow.HashedWheelTimer— also collapses the timer count, but runs on its own thread, forcing aneventLoop().execute(...)hop back onto the channel's loop for every channel-state access. The per-EventLoop scanner keeps everything on the loop thread and avoids that hop.Acceptance criteria
Http2PingHandlerno longer schedules a per-channelScheduledFuture; scheduling is owned by a per-EventLoop scanner.References
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/Http2PingHandler.java— per-channel scheduling inhandlerAdded;pingTaskfield;maybeSendPing.sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java— handler install path.sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java—DEFAULT_HTTP2_MAX_CONNECTION_POOL_SIZE, PING defaults, andDEFAULT_HTTP2_ENABLED = false(H2 preview default-off).sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/Http2ConnectionConfig.java—isEffectivelyEnabled()(per-client override falling back to the global flag);setEnabled(...)javadoc noting the default flips totruepost-preview.