Skip to content

remote logger: Reconnect if writes have been stalled for too long#17528

Open
rgacogne wants to merge 3 commits into
PowerDNS:masterfrom
rgacogne:remote-logger-reconnect-stalled-writes
Open

remote logger: Reconnect if writes have been stalled for too long#17528
rgacogne wants to merge 3 commits into
PowerDNS:masterfrom
rgacogne:remote-logger-reconnect-stalled-writes

Conversation

@rgacogne

@rgacogne rgacogne commented Jun 5, 2026

Copy link
Copy Markdown
Member

Short description

In very limited testing this solves the issue reported in #17526

Checklist

I have:

  • read the CONTRIBUTING.md document
  • read and accepted the Developer Certificate of Origin document, including the AI Policy, and added a "Signed-off-by" to my commits
  • compiled this code
  • tested this code
  • included documentation (including possible behaviour changes)
  • documented the code
  • added or modified regression test(s)
  • added or modified unit test(s)

Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>
@coveralls

coveralls commented Jun 5, 2026

Copy link
Copy Markdown

Coverage Report for CI Build 27763802787

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage decreased (-0.009%) to 67.034%

Details

  • Coverage decreased (-0.009%) from the base build.
  • Patch coverage: 22 uncovered changes across 1 file (13 of 35 lines covered, 37.14%).
  • 2335 coverage regressions across 32 files.

Uncovered Changes

File Changed Covered %
pdns/remote_logger.cc 31 9 29.03%
Total (3 files) 35 13 37.14%

Coverage Regressions

2335 previously-covered lines in 32 files lost coverage.

Top 10 Files by Coverage Loss Lines Losing Coverage Coverage
pdns/dnsdistdist/dnsdist-lua.cc 418 52.0%
pdns/dnsdistdist/dnsdist-configuration-yaml.cc 367 55.56%
pdns/dnsdistdist/dnsdist-lua-ffi.cc 337 66.63%
modules/lmdbbackend/lmdbbackend.cc 241 72.0%
pdns/auth-main.cc 225 50.03%
pdns/tcpiohandler.cc 175 57.03%
pdns/dnsdistdist/dnsdist-nghttp2.cc 128 68.55%
pdns/dnsdistdist/dnsdist.cc 104 68.33%
pdns/libssl.cc 66 57.5%
pdns/dnsdistdist/dnsdist-web.cc 55 77.32%

Coverage Stats

Coverage Status
Relevant Lines: 171110
Covered Lines: 126323
Line Coverage: 73.83%
Relevant Branches: 81854
Covered Branches: 43250
Branch Coverage: 52.84%
Branches in Coverage %: Yes
Coverage Strength: 5495415.15 hits per line

💛 - Coveralls

@omoerbeek omoerbeek self-requested a review June 11, 2026 05:27
@omoerbeek

Copy link
Copy Markdown
Member

Did a few tests using Recursor, first using the master branch end then with the PR.

Test setup: a rec producing many protobuf messages and one or two instances of the python protobuf receiver from distrib.

When running master, after a ^Z on the running protobuf receiver and starting a second receiver script a switch to the new one does not happen. Rec keeps on writing to the first receiver, until the TCP buffer gets full. It then starts dropping messages.

When running the PR, after a while (you can see the TCP buffer getting filled using netstat) rec sees the buffer full and writes no longer succeeding and rec concludes the queue is stalled. Rec closes the connection and connects to the other instance of the protobuf receiver and continues writing protobuf messages.

So from a testing perspective this works as advertised.

@rgacogne rgacogne marked this pull request as ready for review June 11, 2026 11:15
@rgacogne

Copy link
Copy Markdown
Member Author

Thanks, Otto! Do you think the logic is correct? And if so, do you think we need to make the threshold configurable?

@omoerbeek

Copy link
Copy Markdown
Member

Thanks, Otto! Do you think the logic is correct? And if so, do you think we need to make the threshold configurable?

I did not look at the code yet in detail. The 5s used seems arbitrary, making it configurable is probably good.

@omoerbeek omoerbeek left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks good, but s_stalledTimeoutSeconds should be made configurable.

rgacogne added 2 commits June 18, 2026 15:42
Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>
…imeout"

Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>
@rgacogne

Copy link
Copy Markdown
Member Author

Logic looks good, but s_stalledTimeoutSeconds should be made configurable.

Done! I plugged that new setting into dnsdist's configuration, but no into the recursor's one, as I wasn't sure how you wanted to handle it.

@rgacogne rgacogne requested a review from omoerbeek June 18, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants