Skip to content

Add closed connection tracking with close reasons to DIAG HTTP server #55

@joamag

Description

@joamag

Problem (Why)

In the infra-bemisc deployment, client connections from omni-gateway (Deno) to ldj.frontdoorhd.com (behind haproxy + netius proxy_c) are being killed every 5-10 minutes with no prior notice, despite KEEPALIVE_TIMEOUT=3600. HAProxy's idle timeout is set to 2 hours, ruling it out as the cause. Diagnosing this is difficult because the DIAG HTTP server only exposes currently active connections — once a connection closes, all context is lost. Log-based diagnostics have proven inefficient due to high traffic volume. We need a way to inspect recently closed connections and their close reasons via the DIAG HTTP endpoint to identify the root cause of these disconnections.

Description (What)

Add a ring buffer of recently closed connections to the DIAG system, capturing close reason, timestamps, duration, last activity time, error details, and paired connection ID (for proxy correlation). Expose this via a new GET /connections/closed endpoint on DiagApp. Close reasons will be string constants (e.g., "timeout", "client_eof", "upstream_error", "error", "explicit"). The ring buffer defaults to 512 entries, configurable via DIAG_CLOSED_MAX. Tracking is active when running under DIAG mode. The endpoint returns the full buffer, most recent first.

Implementation (How)

  • 1. Define close reason constants in src/netius/base/conn.py — add string constants: "timeout", "client_eof", "upstream_error", "error", "explicit", "unknown" (and others as needed)
  • 2. Add close_reason field to BaseConnection — initialize to None, set before close() is called; include close_reason, close_timestamp, and last_activity_timestamp in info_dict() output
  • 3. Implement a ring buffer for closed connections in src/netius/base/common.py (or a new utility) — use collections.deque(maxlen=N) with max size from DIAG_CLOSED_MAX conf (default 512); store a snapshot dict of the connection's info_dict() plus close metadata at close time
  • 4. Hook into Base.on_connection_d() — when DIAG is active, capture the closed connection's info dict (including close reason, close timestamp, connection duration, last activity timestamp, error details) and append to the ring buffer
  • 5. Propagate close reasons at all close call sites — audit BaseConnection.close(), timeout handlers, EOF/error handlers in src/netius/base/common.py, and ensure each sets close_reason before closing
  • 6. Propagate close reasons in proxy server — in src/netius/servers/proxy.py, set appropriate close reasons in _on_prx_close() (upstream error), _on_raw_close() (tunnel close), on_connection_d(), on_stream_d(); include paired/correlated connection ID in the close metadata
  • 7. Add paired connection ID to proxy close records — when a proxy frontend or backend connection closes, include the paired connection's ID (from conn_map) in the close snapshot so frontend/backend closures can be correlated
  • 8. Add GET /connections/closed endpoint to DiagApp in src/netius/base/diag.py — return the full ring buffer contents as JSON, most recent first
  • 9. Add DIAG_CLOSED_MAX conf support — read from netius conf system, default to 512, used to size the deque
  • 10. Test — add tests for the ring buffer behavior (overflow, ordering), close reason propagation, and the new DIAG endpoint

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestp-highHigh priority issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions