Skip to content

Enable TCP keepalive during connreq processing, otherwise #11058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 30, 2025

Conversation

wangdi1
Copy link
Contributor

@wangdi1 wangdi1 commented May 22, 2025

if the remote peer is restarted between getting connreq and replying, it may hang there to wait for the connreq reply. Since the ep state(req_done) will not allow sending new reqs, and socket state will not be reset, i.e. progress will not be able to detect the disconnection.

@wangdi1
Copy link
Contributor Author

wangdi1 commented May 22, 2025

@shefty @j-xiong @ooststep please check. thanks.

j-xiong
j-xiong previously approved these changes May 22, 2025
Copy link
Contributor

@j-xiong j-xiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

Copy link
Contributor

@ooststep ooststep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change looks good to me, aside from declaring the enable call so that it's accessible to the cm

@j-xiong
Copy link
Contributor

j-xiong commented May 23, 2025

Please squash the commits.

@wangdi1
Copy link
Contributor Author

wangdi1 commented May 23, 2025

Please squash the commits.

sure

@@ -203,9 +203,12 @@ void xnet_req_done(struct xnet_ep *ep)
ret = xnet_req_done_internal(ep);
if (ret)
goto disable;

xnet_disable_keepalive(ep);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this can move to near the top of the function, so it doesn't need a separate call in the disable path below (which probably isn't needed anyway).

if (ret) {
FI_WARN(&xnet_prov, FI_LOG_EP_CTRL, "%p set tcp keepalive failure:%d\n",
ep, ret);
return ofi_sockerr() ? -ofi_sockerr() : -FI_EINVAL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xnet_enable_keepalive() already converts to the correct errno. Just pass through.

@shefty
Copy link
Member

shefty commented May 27, 2025

For the commit message, please add a short subject line, such as: "prov/tcp: Enable keepalive during CM exchange". This should be the top line of the commit.

Enable TCP keepalive during connreq processing, otherwise
if the remote peer is restarted between getting connreq and
replying, it may hang there to wait for the connreq reply.
Since the ep state(req_done) will not allow sending new reqs,
and socket state will not be reset, i.e. progress will not be
able to detect the disconnection.

Signed-off-by: Di Wang <[email protected]>
@wangdi1 wangdi1 requested a review from shefty May 27, 2025 20:52
Comment on lines +178 to +182
int optval = 1;
int idle_time = 5;
int keep_intvl = 2;
int keep_cnt = 2;
int ret;
Copy link
Member

@sydidelot sydidelot May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to make these parameters configurable via environment variables? I implemented a similar feature for the socket provider about 8 years ago in this commit.

Also worth noting: the TCP provider now supports the FI_GET_FD flag with fi_control(), which enables applications to fine-tune the TCP socket for a given endpoint: #11003

Copy link
Contributor Author

@wangdi1 wangdi1 May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to make these parameters configurable via environment variables?

I thought about this, though this will only be used during CM exchange, maybe using fixed value is good enough for now. Though I may miss some use cases.

@j-xiong
Copy link
Contributor

j-xiong commented May 30, 2025

@ooststep Could you update your review?

@j-xiong j-xiong merged commit 1f236dc into ofiwg:main May 30, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants