Skip to content

Add source location to task and tasktrace object #2707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

radoslawcybulski
Copy link

@radoslawcybulski radoslawcybulski commented Mar 31, 2025

Adds source location of next resume point to task objects. When constructing tasktrace object, source location of resume points will be added to it. When printing, the source location of next resume point will be printed as well, hopefully improving debug experience.

Example:

0x2b3899 /home/y/work/scylladb-test/build/Debug/seastar/libseastar.so+0x454c364 ...
   --------
   test_app/main.cpp:15:15
   seastar::continuation<seastar::internal::promise_base_with_type<void>, ...

Partially fixes #2381

@radoslawcybulski radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from 02aa70a to c0149e3 Compare March 31, 2025 14:28
@@ -35,6 +36,8 @@ class task {
protected:
scheduling_group _sg;
private:
std::source_location resume_point = {};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making a very heavily used object larger.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adding std::source_location

coroutine_traits_base<>::promise_type (which inherits from task) is 64 bytes,
continuation<Promise, Func, Wrapper, T> is around 104-120, depending on exact version.

No difference except for that single unlucky value, who is "promoted" to the next 64 byte cache line in case of continuation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay.

@avikivity
Copy link
Member

Aren't those addresses derivable from the stack trace? Or are those points lost?

How large is std::source_location?

@radoslawcybulski
Copy link
Author

radoslawcybulski commented Mar 31, 2025

Aren't those addresses derivable from the stack trace? Or are those points lost?

Probably with a little bit of magic. You'd still have to store a pointer somewhere, because stacktrace is valid for an actually running task, while my patch adds "next instruction" to tasks, which are suspended.

How large is std::source_location?

8 bytes.

@radoslawcybulski radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from c0149e3 to 08d3c8a Compare March 31, 2025 14:58
@avikivity
Copy link
Member

Aren't those addresses derivable from the stack trace? Or are those points lost?

Probably with a little bit of magic. You'd still have to store a pointer somewhere, because stacktrace is valid for an actually running task, while my patch adds "next instruction" to tasks, which are suspended.

How large is std::source_location?

8 bytes.

Maybe we can tolerate it. @tgrabiec what's your opinion?

Please run scylladb's perf-simple-query and report insn-per-op before and after (although we won't learn of the impact of the size change).

@radoslawcybulski
Copy link
Author

my patch

instructions_per_op: mean=48465.87 standard-deviation=33.03 median=48459.42 median-absolute-deviation=22.63 maximum=48520.81 minimum=48437.81
instructions_per_op: mean=48482.90 standard-deviation=21.16 median=48487.16 median-absolute-deviation=6.97 maximum=48504.13 minimum=48447.34
instructions_per_op: mean=48469.42 standard-deviation=18.34 median=48472.45 median-absolute-deviation=17.26 maximum=48488.53 minimum=48448.81

master

instructions_per_op: mean=48337.41 standard-deviation=21.29 median=48334.08 median-absolute-deviation=19.67 maximum=48364.48 minimum=48317.02
instructions_per_op: mean=48352.36 standard-deviation=27.87 median=48347.35 median-absolute-deviation=16.28 maximum=48395.26 minimum=48322.29
instructions_per_op: mean=48339.83 standard-deviation=26.48 median=48323.12 median-absolute-deviation=20.80 maximum=48372.64 minimum=48319.03

My patch will run more instructions, as there are additional stores to the task object (std::source_location needs to be written).

Why do we measure instructions per operation? We are not billed by instruction count, rather by time, we should measure total running time instead? Does it work as a sanity check or am i missing something?

@avikivity
Copy link
Member

Why do we measure instructions per operation? We are not billed by instruction count, rather by time, we should measure total running time instead? Does it work as a sanity check or am i missing something?

Time is very unstable (just try it) since it depends on temperature and cooling.

Instructions is a more stable proxy for time (but inaccurate since instructions-per-clock can change).

@avikivity
Copy link
Member

So we add 0.25% overhead. Hard to judge if it's worthwhile.

@radoslawcybulski
Copy link
Author

Time is very unstable (just try it) since it depends on temperature and cooling.

Agree.

Instructions is a more stable proxy for time (but inaccurate since instructions-per-clock can change).

It's absolutely stable, but we're counting steps, which - in my opinion - sort-of blinds us. It's like two different walks, the same step count, but step sizes differ, so is distance. Here for example if we're running at max IPC, then we've 0.25% overhead. If otherwise we're waiting for memory, then the write is pretty much free (we've already waited for this cache line, because other stuff is read / written there) and proc is waiting for other memory pieces, so it can do our write.

@avikivity
Copy link
Member

Time is very unstable (just try it) since it depends on temperature and cooling.

Agree.

Instructions is a more stable proxy for time (but inaccurate since instructions-per-clock can change).

It's absolutely stable, but we're counting steps, which - in my opinion - sort-of blinds us. It's like two different walks, the same step count, but step sizes differ, so is distance. Here for example if we're running at max IPC, then we've 0.25% overhead. If otherwise we're waiting for memory, then the write is pretty much free (we've already waited for this cache line, because other stuff is read / written there) and proc is waiting for other memory pieces, so it can do our write.

Writes are more-or-less free, except for the increase in instruction footprint (which are extra reads).

We aren't close to running at max IPC. Typical for mini-benchmarks like perf-simple-query is 2, and for production ~1. The main killer is waiting for instruction fetch (although maybe in production mispredicts would also contribute since the the data is less regular). I'm sure an M4 would give much higher IPC.

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the performance impact could be a concern, can we make it an optional feature which is enabled only in the debug builds?

@@ -30,6 +30,7 @@

#ifndef SEASTAR_MODULE
#include <coroutine>
#include <source_location>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it took me a while to check if it's safe to use <source_location> instead of util/std-compat.hh . the answer is yes. as we support the latest two major versions of the GCC and Clang compilers. in general, we use libstdc++, which added the support to source_location in gcc-mirror/gcc@57d76ee, which is included by gcc 12 and up. and the latest stable release of GCC is GCC 14.2.

@radoslawcybulski
Copy link
Author

radoslawcybulski commented Apr 14, 2025

if the performance impact could be a concern, can we make it an optional feature which is enabled only in the debug builds?

This has been discussed here (#2381) and consensus is that the release is where it's useful the most.

@radoslawcybulski radoslawcybulski self-assigned this Apr 14, 2025
@tgrabiec
Copy link
Contributor

Aren't those addresses derivable from the stack trace? Or are those points lost?

Probably with a little bit of magic. You'd still have to store a pointer somewhere, because stacktrace is valid for an actually running task, while my patch adds "next instruction" to tasks, which are suspended.

How large is std::source_location?

8 bytes.

Maybe we can tolerate it. @tgrabiec what's your opinion?

I think it's probably ok.

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -35,6 +36,8 @@ class task {
protected:
scheduling_group _sg;
private:
std::source_location resume_point = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put the resume_point first since it's larger than scheduling_group. that way the task object size is 4 bytes less.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really - if the class is not packed then the size includes the alignment, so order doesn't matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true. The object will be aligned to 8 bytes boundary. However in case we add something after these variables, less than 5 bytes, then there will be a difference, so having it ordered by size might be beneficial in the future.

@radoslawcybulski radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch 2 times, most recently from 6fe697d to 80ce73a Compare April 22, 2025 09:46
#endif

namespace seastar {

SEASTAR_MODULE_EXPORT
class task {
std::source_location resume_point = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Should not we prepend class member variables with an underscore: resume_point->_resume_point?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Radosław Cybulski added 3 commits April 22, 2025 14:38
Add an empty, default constructed std::source_location object
to the task object and getter / setter.
Add calls to update `resume_point` variable with location of
next resume location to all `await_suspend` functions and `then`
functions.
Add resume point locations to `tasktrace` object.
Update `formatter::format` to print source location of next resume alone
with task type.
@radoslawcybulski radoslawcybulski force-pushed the pr2381-add-source-location-to-tasktrace branch from 80ce73a to 58aa04e Compare April 22, 2025 12:38
@radoslawcybulski
Copy link
Author

@scylladb/seastar-maint please merge.

@denesb
Copy link
Contributor

denesb commented Jun 27, 2025

So we add 0.25% overhead. Hard to judge if it's worthwhile.

@avikivity I think the gain is much higher than 0.25% here. We need to reduce the bar of entry into debugging seastar applications.

@radoslawcybulski
Copy link
Author

@denesb can you tell me, how did you figure out performance differences for my patch (or any patch)? Sub 1% precision looks amazing to me, how did you do that?

@denesb
Copy link
Contributor

denesb commented Jun 30, 2025

@denesb can you tell me, how did you figure out performance differences for my patch (or any patch)? Sub 1% precision looks amazing to me, how did you do that?

I didn't, I just reacted to Avi's comment. I didn't claim that the perf regression is something else, just that 0.25% that Avi claimed is worth it for the gain of seeing line numbers.

@michoecho
Copy link
Contributor

@denesb can you tell me, how did you figure out performance differences for my patch (or any patch)? Sub 1% precision looks amazing to me, how did you do that?

We're talking about the perf_simple_query results posted earlier in this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Looking for stuck/deadlocked fibers could be easier
7 participants