remote logger: Reconnect if writes have been stalled for too long#17528
remote logger: Reconnect if writes have been stalled for too long#17528rgacogne wants to merge 3 commits into
Conversation
Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>
Coverage Report for CI Build 27763802787Warning Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes. Coverage decreased (-0.009%) to 67.034%Details
Uncovered Changes
Coverage Regressions2335 previously-covered lines in 32 files lost coverage.
Coverage Stats💛 - Coveralls |
|
Did a few tests using Recursor, first using the master branch end then with the PR. Test setup: a rec producing many protobuf messages and one or two instances of the python protobuf receiver from When running master, after a ^Z on the running protobuf receiver and starting a second receiver script a switch to the new one does not happen. Rec keeps on writing to the first receiver, until the TCP buffer gets full. It then starts dropping messages. When running the PR, after a while (you can see the TCP buffer getting filled using netstat) rec sees the buffer full and writes no longer succeeding and rec concludes the queue is stalled. Rec closes the connection and connects to the other instance of the protobuf receiver and continues writing protobuf messages. So from a testing perspective this works as advertised. |
|
Thanks, Otto! Do you think the logic is correct? And if so, do you think we need to make the threshold configurable? |
I did not look at the code yet in detail. The 5s used seems arbitrary, making it configurable is probably good. |
omoerbeek
left a comment
There was a problem hiding this comment.
Logic looks good, but s_stalledTimeoutSeconds should be made configurable.
Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>
…imeout" Signed-off-by: Remi Gacogne <remi.gacogne@powerdns.com>
Done! I plugged that new setting into dnsdist's configuration, but no into the recursor's one, as I wasn't sure how you wanted to handle it. |
Short description
In very limited testing this solves the issue reported in #17526
Checklist
I have: