Skip to content

Conversation

@c98
Copy link

@c98 c98 commented May 13, 2025

#824
This commit aims to solve the problem that Worker::poll_next_msg may block service.poll_ready. Especially when the service is a Balance service, while multiple pending endpoints become ready, the task is blocked on Worker::poll_next_msg if there is no other message come which would cause these endpoints to be disconnected.

…ice.

This commit aims to solve the problem that Worker::poll_next_msg may
block service.poll_ready. Especially when the service is a balance service,
while multiple pending endpoints become ready, the task is blocked on
Worker::poll_next_msg if there is no other message come which would cause
these endpoints to be disconnected.
@seanmonstar
Copy link
Collaborator

Thanks! Looks like CI failed...

Also, is there a way to update the tests to check this change?

adjust the buffer tests.

remove `propagates_trace_spans` test due to the worker's service
`poll_ready` is outside the request processing.
@c98
Copy link
Author

c98 commented May 14, 2025

Thanks for your attention, I have updated the tests to check this change.

@seanmonstar
Copy link
Collaborator

@cratelyn do yall have a more extensive suite that includes balancing that can test this change out?

@cratelyn
Copy link
Contributor

@cratelyn do yall have a more extensive suite that includes balancing that can test this change out?

we do! i'll try running the linkerd2-proxy tests with this patch applied, pardon the wait.

Copy link
Contributor

@cratelyn cratelyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i left some notes below. my main questions and concerns about this change are about how many of the test suite's assertions (and a test case) are removed in this change.

if changes to the test suite are required here, is this a breaking change to this middleware's behavior?

Comment on lines -378 to -395
#[tokio::test(flavor = "current_thread")]
async fn propagates_trace_spans() {
use tower::util::ServiceExt;
use tracing::Instrument;

let _t = support::trace_init();

let span = tracing::info_span!("my_span");

let service = support::AssertSpanSvc::new(span.clone());
let (service, worker) = Buffer::pair(service, 5);
let worker = tokio::spawn(worker);

let result = tokio::spawn(service.oneshot(()).instrument(span));

result.await.expect("service panicked").expect("failed");
worker.await.expect("worker panicked");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may i ask why this test was deleted?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this commit change the worker exec order:

  1. service.poll_ready
  2. poll_next_msg
  3. service.call

after 2 poll_next_msg, the span in the msg can be retrieved, the service.call is now in the span scope, but not the servce.poll_ready, as you see, AssertSpanSvc does not fit the case, so i delete this test.

Comment on lines -348 to -353
let err = assert_ready_err!(response2.poll());
assert!(
err.is::<error::ServiceError>(),
"response should fail with a ServiceError, got: {:?}",
err
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this no longer the case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above, the old exec order is

  1. poll_next_msg
  2. service.poll_ready
  3. service.call

the new exec order is

  1. service.poll_ready
  2. poll_next_msg
  3. service.call

there is no 'preload' message, so i delete it.

Comment on lines -220 to -228
assert_pending!(worker.poll());
handle
.next_request()
.await
.unwrap()
.1
.send_response("world");
assert_pending!(worker.poll());
assert_ready_ok!(response4.poll());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these assertions no longer needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because in this case there is no other request will be issued, so it is no need to do this assertion.

@cratelyn
Copy link
Contributor

if this change is focused on solving how this middleware interacts with a Balance service, i might suggest adding a test case that exercises the buffer with a Balance service.

@cratelyn
Copy link
Contributor

@cratelyn do yall have a more extensive suite that includes balancing that can test this change out?

we do! i'll try running the linkerd2-proxy tests with this patch applied, pardon the wait.

@seanmonstar i wasn't able to get a patched version of the linkerd2-proxy building with this patch. we're on tonic v0.12 at the moment, which depends on tower 0.4, and thus builds fail if this branch is used.

i did leave some questions above, however. i'm concerned that trace contexts no longer seem to be propagated, and that this change seems to change the behavior of Buffer. having debugged changes in behavior introduced by other changes to Buffer, e.g. #635, i'm very hesitant about this pr.

@seanmonstar
Copy link
Collaborator

@cratelyn you've since ported all of linkerd to use the newer hyper/tonic, I think, right? If it'd be easier to pop this branch and see if the test suite is still happy, that could help move this along. If you're busy, understandable!

@cratelyn
Copy link
Contributor

@cratelyn you've since ported all of linkerd to use the newer hyper/tonic, I think, right? If it'd be easier to pop this branch and see if the test suite is still happy, that could help move this along. If you're busy, understandable!

i ran the linkerd-proxy test suite against this branch, via a [patch.crates-io] directive.

tests did not pass, due to a new failure in this test: https://github.com/linkerd/linkerd2-proxy/blob/main/linkerd/stack/src/loadshed.rs#L133-L198

test loadshed::tests::buffer_load_shed ... FAILED

failures:

---- loadshed::tests::buffer_load_shed stdout ----
[     0.000764s] DEBUG worker{id=oneshot4}: linkerd_stack::loadshed: Service has become unavailable
[     0.000786s] DEBUG worker{id=oneshot4}: linkerd_stack::loadshed: Service shedding load
[     0.000896s] DEBUG worker{id=oneshot6}: linkerd_stack::loadshed: Service has become unavailable
[     0.000910s] DEBUG worker{id=oneshot6}: linkerd_stack::loadshed: Service shedding load

thread 'loadshed::tests::buffer_load_shed' panicked at linkerd/stack/src/loadshed.rs:181:9:
ready; value = Err(LoadShedError(()))
stack backtrace:
   0:     0x55b0c0df6ab2 - std::backtrace_rs::backtrace::libunwind::trace::h74680e970b6e0712
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/../../backtrace/src/backtrace/libunwind.rs:117:9
   1:     0x55b0c0df6ab2 - std::backtrace_rs::backtrace::trace_unsynchronized::ha3bf590e3565a312
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/../../backtrace/src/backtrace/mod.rs:66:14
   2:     0x55b0c0df6ab2 - std::sys::backtrace::_print_fmt::hcf16024cbdd6c458
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/sys/backtrace.rs:66:9
   3:     0x55b0c0df6ab2 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h46a716bba2450163
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/sys/backtrace.rs:39:26
   4:     0x55b0c0e1b5b3 - core::fmt::rt::Argument::fmt::ha695e732309707b7
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/fmt/rt.rs:181:76
   5:     0x55b0c0e1b5b3 - core::fmt::write::h275e5980d7008551
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/fmt/mod.rs:1446:25
   6:     0x55b0c0df3f63 - std::io::default_write_fmt::h31683a0a922ca2b7
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/io/mod.rs:639:11
   7:     0x55b0c0df3f63 - std::io::Write::write_fmt::hfb552b13b10253dc
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/io/mod.rs:1914:13
   8:     0x55b0c0df6902 - std::sys::backtrace::BacktraceLock::print::hafb9d5969adc39a0
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/sys/backtrace.rs:42:9
   9:     0x55b0c0df7fec - std::panicking::default_hook::{{closure}}::hae2e97a5c4b2b777
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/panicking.rs:300:22
  10:     0x55b0c0df7e42 - std::panicking::default_hook::h3db1b505cfc4eb79
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/panicking.rs:324:9
  11:     0x55b0c0cf0374 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::hb81979808caba656
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/alloc/src/boxed.rs:1980:9
  12:     0x55b0c0cf0374 - test::test_main_with_exit_callback::{{closure}}::h830f77309e9e8595
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/test/src/lib.rs:145:21
  13:     0x55b0c0df8a63 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::hd620b4648521795b
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/alloc/src/boxed.rs:1980:9
  14:     0x55b0c0df8a63 - std::panicking::rust_panic_with_hook::h409da73ddef13937
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/panicking.rs:841:13
  15:     0x55b0c0df873a - std::panicking::begin_panic_handler::{{closure}}::h159b61b27f96a9c2
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/panicking.rs:706:13
  16:     0x55b0c0df6fa9 - std::sys::backtrace::__rust_end_short_backtrace::h5b56844d75e766fc
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/sys/backtrace.rs:168:18
  17:     0x55b0c0df83cd - __rustc[4794b31dd7191200]::rust_begin_unwind
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/panicking.rs:697:5
  18:     0x55b0c09c48f0 - core::panicking::panic_fmt::hc8737e8cca20a7c8
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/panicking.rs:75:14
  19:     0x55b0c0a20ecf - linkerd_stack::loadshed::tests::buffer_load_shed::{{closure}}::h4c89fe60d39fa5fe
                               at /home/katie/linkerd/linkerd2-proxy/linkerd/stack/src/loadshed.rs:181:9
  20:     0x55b0c09cf3d2 - <core::pin::Pin<P> as core::future::future::Future>::poll::h02712d90cefba1f7
                               at /home/katie/.rustup/toolchains/1.88.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/future.rs:124:9
  21:     0x55b0c09cf68d - <core::pin::Pin<P> as core::future::future::Future>::poll::hf04c6c1994405935
                               at /home/katie/.rustup/toolchains/1.88.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/future.rs:124:9
  22:     0x55b0c09ceedf - tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}::hab5da2eeab9b2e73
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/scheduler/current_thread/mod.rs:742:54
  23:     0x55b0c09cee35 - tokio::task::coop::with_budget::h3939f8a60371c392
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/task/coop/mod.rs:167:5
  24:     0x55b0c09cee35 - tokio::task::coop::budget::hec3d193970d2c2ec
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/task/coop/mod.rs:133:5
  25:     0x55b0c09cee35 - tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::hf8ca5f9c6c2994fc
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/scheduler/current_thread/mod.rs:742:25
  26:     0x55b0c09cc0f0 - tokio::runtime::scheduler::current_thread::Context::enter::hce05c34ff5e647db
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/scheduler/current_thread/mod.rs:432:19
  27:     0x55b0c09cd4ed - tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::h5ebc14e706b4f429
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/scheduler/current_thread/mod.rs:741:36
  28:     0x55b0c09cd1c4 - tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}::hebe09deb4ec8fbe0
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/scheduler/current_thread/mod.rs:829:68
  29:     0x55b0c09c66fb - tokio::runtime::context::scoped::Scoped<T>::set::h7422eebb8cd227b1
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/context/scoped.rs:40:9
  30:     0x55b0c09c5679 - tokio::runtime::context::set_scheduler::{{closure}}::haa4cf9f160db5b3b
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/context.rs:176:26
  31:     0x55b0c0a697b2 - std::thread::local::LocalKey<T>::try_with::h8a871093fe078251
                               at /home/katie/.rustup/toolchains/1.88.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/local.rs:315:12
  32:     0x55b0c0a681de - std::thread::local::LocalKey<T>::with::h21f56d9dd6966204
                               at /home/katie/.rustup/toolchains/1.88.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/local.rs:279:15
  33:     0x55b0c09c55ad - tokio::runtime::context::set_scheduler::h228efc4c12bfb4a4
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/context.rs:176:9
  34:     0x55b0c09ccf50 - tokio::runtime::scheduler::current_thread::CoreGuard::enter::h80625cc959d59981
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/scheduler/current_thread/mod.rs:829:27
  35:     0x55b0c09cd1e3 - tokio::runtime::scheduler::current_thread::CoreGuard::block_on::h59dec3f7a6b6328b
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/scheduler/current_thread/mod.rs:729:19
  36:     0x55b0c09c7ea9 - tokio::runtime::scheduler::current_thread::CurrentThread::block_on::{{closure}}::hcf7b0d87b5a8e0a7
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/scheduler/current_thread/mod.rs:200:28
  37:     0x55b0c0a5b078 - tokio::runtime::context::runtime::enter_runtime::h876de5a43a6df25e
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/context/runtime.rs:65:16
  38:     0x55b0c09c77f1 - tokio::runtime::scheduler::current_thread::CurrentThread::block_on::hbbd1aa9f79ee7da5
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/scheduler/current_thread/mod.rs:188:9
  39:     0x55b0c0a22d69 - tokio::runtime::runtime::Runtime::block_on_inner::h502017307180bd8a
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/runtime.rs:356:47
  40:     0x55b0c0a2304f - tokio::runtime::runtime::Runtime::block_on::h99870153e47504f7
                               at /home/katie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.1/src/runtime/runtime.rs:330:13
  41:     0x55b0c0a64611 - linkerd_stack::loadshed::tests::buffer_load_shed::h1f49f63d00c78621
                               at /home/katie/linkerd/linkerd2-proxy/linkerd/stack/src/loadshed.rs:197:9
  42:     0x55b0c0a1fdf7 - linkerd_stack::loadshed::tests::buffer_load_shed::{{closure}}::h784ff89a30b6395b
                               at /home/katie/linkerd/linkerd2-proxy/linkerd/stack/src/loadshed.rs:134:32
  43:     0x55b0c09fee16 - core::ops::function::FnOnce::call_once::h18e57c5b72a2bce5
                               at /home/katie/.rustup/toolchains/1.88.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
  44:     0x55b0c0cf5b9b - core::ops::function::FnOnce::call_once::h6830b7b483df2d7b
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/ops/function.rs:250:5
  45:     0x55b0c0cf5b9b - test::__rust_begin_short_backtrace::h6ad576a367cba051
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/test/src/lib.rs:648:18
  46:     0x55b0c0cf4d82 - test::run_test_in_process::{{closure}}::h282029c456bdb1d0
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/test/src/lib.rs:671:60
  47:     0x55b0c0cf4d82 - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::h92c3da85f1d7f07a
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/panic/unwind_safe.rs:272:9
  48:     0x55b0c0cf4d82 - std::panicking::try::do_call::h569264ff5d41e944
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/panicking.rs:589:40
  49:     0x55b0c0cf4d82 - std::panicking::try::h3253fc2f0f6f9e29
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/panicking.rs:552:19
  50:     0x55b0c0cf4d82 - std::panic::catch_unwind::had653e4cb2e12066
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/panic.rs:359:14
  51:     0x55b0c0cf4d82 - test::run_test_in_process::hd1ecf063ce636af0
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/test/src/lib.rs:671:27
  52:     0x55b0c0cf4d82 - test::run_test::{{closure}}::h2f9e350abac1b079
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/test/src/lib.rs:592:43
  53:     0x55b0c0cb93a4 - test::run_test::{{closure}}::hed24df14dd589f4e
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/test/src/lib.rs:622:41
  54:     0x55b0c0cb93a4 - std::sys::backtrace::__rust_begin_short_backtrace::he0330b8283c070fc
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/sys/backtrace.rs:152:18
  55:     0x55b0c0cbcc9a - std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}::hec3e9e5c0807d052
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/thread/mod.rs:559:17
  56:     0x55b0c0cbcc9a - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::h1b183baa756e4c0a
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/panic/unwind_safe.rs:272:9
  57:     0x55b0c0cbcc9a - std::panicking::try::do_call::h72eba35930bfaae9
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/panicking.rs:589:40
  58:     0x55b0c0cbcc9a - std::panicking::try::h31af7d64fc54fbca
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/panicking.rs:552:19
  59:     0x55b0c0cbcc9a - std::panic::catch_unwind::h64fe4d1919153ee2
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/panic.rs:359:14
  60:     0x55b0c0cbcc9a - std::thread::Builder::spawn_unchecked_::{{closure}}::h3e55c0af18b31fa4
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/thread/mod.rs:557:30
  61:     0x55b0c0cbcc9a - core::ops::function::FnOnce::call_once{{vtable.shim}}::hefd468255a79af59
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/ops/function.rs:250:5
  62:     0x55b0c0dfa3bb - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::he4962534b56a5929
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/alloc/src/boxed.rs:1966:9
  63:     0x55b0c0dfa3bb - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h95af12d5a868b9d0
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/alloc/src/boxed.rs:1966:9
  64:     0x55b0c0dfa3bb - std::sys::pal::unix::thread::Thread::new::thread_start::h1822d22fde68314f
                               at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/sys/pal/unix/thread.rs:97:17
  65:     0x7fadbd622272 - start_thread
  66:     0x7fadbd69ddec - clone3
  67:                0x0 - <unknown>


failures:
    loadshed::tests::buffer_load_shed

test result: FAILED. 25 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.02s

i have not dug further into the cause, but i hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants