Skip to content

Update descriptions for connection timeouts and tcp #6666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

quasiben
Copy link
Member

@quasiben quasiben commented Jul 1, 2022

Closes #6636

  • Tests added / passed
  • Passes pre-commit run --all-files

cc @gjoseph92 @fjetter

Do you think the descriptions should include a note about being potentially useful when resolving deadlocks ?

@github-actions
Copy link
Contributor

github-actions bot commented Jul 1, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 42m 8s ⏱️ + 3m 9s
  2 909 tests ±0    2 823 ✔️  - 1    82 💤 ±0  4 +2 
21 540 runs  ±0  20 582 ✔️  - 5  953 💤 +3  5 +3 

For more details on these failures, see this check.

Results for commit c6210a4. ± Comparison against base commit 9b8c3e9.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the descriptions should include a note about being potentially useful when resolving deadlocks ?

There is an effect that if the timeouts are configured too large, this could be perceived as a deadlock. Is this what you are referring to?

I think a warning about too large values is appropriate. Something like

Note: If values are chosen too large, a cluster may appear to be stuck if individual workers or the scheduler are waiting for timeouts to expire.

tcp:
type: string
description: |
Timeout after which to error when creating a TCP/Socket connection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate. This timeout sets a couple of kernel level timeouts that take effect once a connection is established.

Specifically, it sets TCP_USER_TIMEOUT (See https://man7.org/linux/man-pages/man7/tcp.7.html) and configures a KEEPALIVE probe with appropriate intervals. I'm not sure if it is worth it to go into that much detail, though.

The combination of these two settings ensures that a TCP connection is automatically closed if the remote is dead, or rather, the kernel hasn't acknowledged any TCP package in TCP_USER_TIMEOUTs which very likely means the remote is dead.

We use this mechanism, for instance, to infer whether or not a worker died.

Comment on lines +758 to +760
Timeout after which to error when estabilishing a connection.
For example, when creating a connection between client and worker,
client and scheduler, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Timeout after which to error when estabilishing a connection.
For example, when creating a connection between client and worker,
client and scheduler, etc.
All connection attempts are retried until this timeout expires before an
exception is raised.
For example, when creating a connection between client and worker,
client and scheduler, etc.

@quasiben quasiben closed this Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document tcp and connect timeout paramters
2 participants