Update descriptions for connection timeouts and tcp #6666

quasiben · 2022-07-01T19:19:54Z

Tests added / passed
Passes pre-commit run --all-files

Do you think the descriptions should include a note about being potentially useful when resolving deadlocks ?

github-actions · 2022-07-01T20:25:29Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 42m 8s ⏱️ + 3m 9s
  2 909 tests ±0   2 823 ✔️ - 1   82 💤 ±0 4 ❌ +2
21 540 runs ±0 20 582 ✔️ - 5 953 💤 +3 5 ❌ +3

For more details on these failures, see this check.

Results for commit c6210a4. ± Comparison against base commit 9b8c3e9.

fjetter

Do you think the descriptions should include a note about being potentially useful when resolving deadlocks ?

There is an effect that if the timeouts are configured too large, this could be perceived as a deadlock. Is this what you are referring to?

I think a warning about too large values is appropriate. Something like

Note: If values are chosen too large, a cluster may appear to be stuck if individual workers or the scheduler are waiting for timeouts to expire.

fjetter · 2022-07-04T08:38:39Z

distributed/distributed-schema.yaml

              tcp:
                type: string
+                description: |
+                  Timeout after which to error when creating a TCP/Socket connection


This is not accurate. This timeout sets a couple of kernel level timeouts that take effect once a connection is established.

Specifically, it sets TCP_USER_TIMEOUT (See https://man7.org/linux/man-pages/man7/tcp.7.html) and configures a KEEPALIVE probe with appropriate intervals. I'm not sure if it is worth it to go into that much detail, though.

The combination of these two settings ensures that a TCP connection is automatically closed if the remote is dead, or rather, the kernel hasn't acknowledged any TCP package in TCP_USER_TIMEOUTs which very likely means the remote is dead.

We use this mechanism, for instance, to infer whether or not a worker died.

fjetter · 2022-07-04T08:41:35Z

distributed/distributed-schema.yaml

+                  Timeout after which to error when estabilishing a connection.
+                  For example, when creating a connection between client and worker,
+                  client and scheduler, etc.


Suggested change

Timeout after which to error when estabilishing a connection.

For example, when creating a connection between client and worker,

client and scheduler, etc.

All connection attempts are retried until this timeout expires before an

exception is raised.

For example, when creating a connection between client and worker,

client and scheduler, etc.

update descriptions for connection timeouts and tcp

c6210a4

fjetter reviewed Jul 4, 2022

View reviewed changes

quasiben closed this Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Update descriptions for connection timeouts and tcp #6666

Update descriptions for connection timeouts and tcp #6666

Uh oh!

quasiben commented Jul 1, 2022

Uh oh!

github-actions bot commented Jul 1, 2022

Uh oh!

fjetter left a comment

Uh oh!

fjetter Jul 4, 2022

Uh oh!

fjetter Jul 4, 2022

Uh oh!

Uh oh!

Uh oh!

Update descriptions for connection timeouts and tcp #6666

Update descriptions for connection timeouts and tcp #6666

Uh oh!

Conversation

quasiben commented Jul 1, 2022

Uh oh!

github-actions bot commented Jul 1, 2022

Unit Test Results

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

fjetter Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

fjetter Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!