Skip to content

fix(dlt-receive): set SO_RCVTIMEO and SO_KEEPALIVE to enable -r reconnect on half-open TCP#845

Open
aki1770-del wants to merge 1 commit into
COVESA:masterfrom
aki1770-del:fix/dlt-client-reconnect-halfopen-817
Open

fix(dlt-receive): set SO_RCVTIMEO and SO_KEEPALIVE to enable -r reconnect on half-open TCP#845
aki1770-del wants to merge 1 commit into
COVESA:masterfrom
aki1770-del:fix/dlt-client-reconnect-halfopen-817

Conversation

@aki1770-del

Copy link
Copy Markdown
Contributor

fix(dlt-receive): set SO_RCVTIMEO and SO_KEEPALIVE in dlt_client_connect to enable -r reconnect on half-open TCP

Fixes #817.

Problem

dlt-receive -r <interval> is supposed to reconnect automatically when the
DLT daemon becomes unavailable. This works correctly for clean disconnects
(RST/FIN received): dlt_receiver_receive() returns ≤0, dlt_client_main_loop
exits, and the reconnect loop fires after the specified interval.

However, when the network path fails silently — half-open TCP, common in
VM/container restarts and network partitions — the kernel socket stays in
ESTABLISHED state. dlt_receiver_receive() calls recv() which blocks
forever
. The reconnect interval is never consulted.

Fix

Three files changed:

include/dlt/dlt_client.h: Add recv_timeout_sec field to DltClient
struct (0 = no timeout, backwards-compatible default).

src/lib/dlt_client.c — in dlt_client_connect(), after blocking mode
is restored on a successful TCP connect:

  • If client->recv_timeout_sec > 0: call setsockopt(SO_RCVTIMEO) with the
    configured timeout. When recv() times out it returns -1/EAGAIN, which
    dlt_receiver_receive() maps to a ≤0 return, causing dlt_client_main_loop
    to exit and the reconnect loop to fire.
  • Always set SO_KEEPALIVE (+ TCP_KEEPIDLE/TCP_KEEPINTVL/TCP_KEEPCNT
    where available) as a second line of defence.

src/console/dlt-receive.c — before the reconnect loop: when -r is
active, set dltclient.recv_timeout_sec = rvalue / 1000 (minimum 5 s) so
the socket timeout matches the reconnect interval.

Behaviour

Scenario Before After
Clean disconnect (RST/FIN) Reconnect works Unchanged
Half-open TCP (-r not set) recv() blocks forever recv() blocks forever (no change — timeout = 0)
Half-open TCP (-r 5000 = 5 s) recv() blocks forever, -r ignored recv() times out after 5 s, reconnect fires

AI-assisted — authored with Claude, reviewed by Komada.

…nect on half-open TCP

Fixes COVESA#817. With half-open TCP (VM/container restart, network partition),
recv() blocks forever and the -r reconnect interval is never consulted.

Fix: add recv_timeout_sec field to DltClient (default 0 = no change).
In dlt_client_connect(): if recv_timeout_sec > 0, set SO_RCVTIMEO;
always set SO_KEEPALIVE + TCP_KEEPIDLE/INTVL/CNT where available.
In dlt-receive.c: wire recv_timeout_sec = rvalue/1000 (min 5s) before
reconnect loop when -r is active.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Akihiko Komada <aki1770@gmail.com>
@minminlittleshrimp

Copy link
Copy Markdown
Collaborator

Hi @aki1770-del
Could you kindly perform testing for dlt-viewer with this patch, is the component fetching log normally?
Basically dlt-viewer and dlt-receive, control clients (dlt-control, dlt-logstorage-ctrl, dlt-xxx-ctrl) might be affected so please make sure some regression tests can be covered here.
Thanks in advance

@aki1770-del

Copy link
Copy Markdown
Contributor Author

Thanks @minminlittleshrimp — regression evidence below. It's from a clean local build on x64 Linux (CMake with the CI flags); CI on the PR is still pending, so this is the local run, not a green-CI signal.

Build: clean — cmake -B build -DCMAKE_BUILD_TYPE=Release -DWITH_DLT_COVERAGE=ON -DBUILD_GMOCK=OFF then build, exit 0, no warnings on the patched dlt_client.c / dlt-receive.c.

Existing test suite (regression): ctest19/20 on this branch. The single failure — gtest_dlt_daemon_event_handler (t_dlt_connection_create.normal, a setsockopt on a test fd) — reproduces identically on master without the patch (same test, same setsockopt assertion) — so it's pre-existing, not introduced here. No new regression (19/20 → 19/20).

dlt-receive -r (the #817 fix): confirmed. Against a half-open server (accepts the connection, then never sends), dlt-receive -r 5000 now recovers — strace shows SO_RCVTIMEO = 5s set on connect, and the reconnect loop re-fires (~10s apart) instead of blocking on recv() forever.

Control clients / backwards-compat (the now-unconditional SO_KEEPALIVE): dlt-control operates normally (set-all-log-level → exit 0, control message forwarded); strace confirms it gets SO_KEEPALIVE but not SO_RCVTIMEO (it never opts into the timeout), so its blocking-recv behavior is unchanged. Normal dlt-receive (no -r) connects and fetches traces fine, keepalive active on the socket.

dlt-viewer: it doesn't link dlt_client.c (it uses its own QTcpSocket), and the patch adds only setsockopt calls plus a struct field — no protocol/wire-byte changes — so it can't regress dlt-viewer's log-fetch path. I wasn't able to run a dlt-viewer smoke test on this machine (no Qt toolchain here); flagging that honestly rather than reporting a test I didn't run.

Caveats: local run, not CI-green yet; the half-open recovery is a scripted simulation + strace, not a real network partition; dlt-logstorage-ctrl wasn't built in this config (optional feature) — it shares the same dlt_client_connect() path as dlt-control, so the same keepalive-yes / rcvtimeo-no behavior applies, but I didn't exercise its binary.

On the SO_KEEPALIVE going unconditional: my reading is it's low-risk — control clients get keepalive but never opt into the SO_RCVTIMEO, so their blocking-recv behavior is unchanged (strace above). But you flagged it as a possible concern and you know the client matrix better than I do — if you'd rather it were opt-in, say the word and I'll gate it. Whichever you prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] dlt-receive -r reconnect does not work when server connection hangs (no socket timeout / keepalive)

2 participants