-
-
Notifications
You must be signed in to change notification settings - Fork 732
Update descriptions for connection timeouts and tcp #6666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ±0 15 suites ±0 6h 42m 8s ⏱️ + 3m 9s For more details on these failures, see this check. Results for commit c6210a4. ± Comparison against base commit 9b8c3e9. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the descriptions should include a note about being potentially useful when resolving deadlocks ?
There is an effect that if the timeouts are configured too large, this could be perceived as a deadlock. Is this what you are referring to?
I think a warning about too large values is appropriate. Something like
Note: If values are chosen too large, a cluster may appear to be stuck if individual workers or the scheduler are waiting for timeouts to expire.
tcp: | ||
type: string | ||
description: | | ||
Timeout after which to error when creating a TCP/Socket connection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not accurate. This timeout sets a couple of kernel level timeouts that take effect once a connection is established.
Specifically, it sets TCP_USER_TIMEOUT
(See https://man7.org/linux/man-pages/man7/tcp.7.html) and configures a KEEPALIVE
probe with appropriate intervals. I'm not sure if it is worth it to go into that much detail, though.
The combination of these two settings ensures that a TCP connection is automatically closed if the remote is dead, or rather, the kernel hasn't acknowledged any TCP package in TCP_USER_TIMEOUT
s which very likely means the remote is dead.
We use this mechanism, for instance, to infer whether or not a worker died.
Timeout after which to error when estabilishing a connection. | ||
For example, when creating a connection between client and worker, | ||
client and scheduler, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeout after which to error when estabilishing a connection. | |
For example, when creating a connection between client and worker, | |
client and scheduler, etc. | |
All connection attempts are retried until this timeout expires before an | |
exception is raised. | |
For example, when creating a connection between client and worker, | |
client and scheduler, etc. |
Closes #6636
pre-commit run --all-files
cc @gjoseph92 @fjetter
Do you think the descriptions should include a note about being potentially useful when resolving deadlocks ?