smp: add a function that barriers memory prefault work #2608

tomershafir · 2025-01-06T14:10:04Z

Currently, memory prefault logic is internal and seastar doesnt provide much control to users. In order to improve the situation, I suggest to provide a barrier for the prefault threads. This allows to:

Prefer predictable low latency and high throughput from the start of request serving, at the cost of a startup delay, depending on machine characteristics and application specific requirements. For example, a fixed capacity on prem db setup, where slower startup can be tolerated. From users perspective, they generally cannot tolerate inconsistency (like spikes in latency).
Similarly, improve user scheduling decisions, like running less critical tasks while prefault works.
Reliably test the prefault logic, improving reliability and users trust in seastar.

I tested locally. If you approve, next I will try to submit a prefault test.

avikivity · 2025-01-06T14:44:49Z

Did you observe latency impact from the prefault threads? It was written carefully not to have latency impact, but it's of course possible that some workloads suffer.

tomershafir · 2025-01-06T20:53:49Z

As you described in #1702, page faults can cause deviation, and following up the example, there can be 25sec where latency is variably higher.

avikivity · 2025-01-07T11:20:20Z

As you described in #1702, page faults can cause deviation, and following up the example, there can be 25sec where latency is variably higher.

I said nothing about latency being higher there.

We typically run large machines with a few vcpus not assigned to any shards, and the prefault threads run with low priority.

tomershafir · 2025-01-07T14:25:46Z

There are 2 aspects:

Page faults

In the previous comment, I meant page fault latency. The page faults can cause high latency unpredictably until the prefaulter finishes.

Regarding page faults measurement, it seems I cannot reliably measure on my env.

Prefault threads competition

I tried to non scientifically isolate wall time overhead of prefault threads:

I have a test app that performs file I/O and process memory buffers repeatedly. I used Ubuntu Orbstack VM with 1 NUMA node, 10 cores, --memory=14G - effectively a small NUMA node, and a small input to let the overhead be most visible.

With --lock-memory=1 without waiting, I see that the chrono time of the actual work is significantly higher than with --lock-memory=0. (~1800ms > ~600ms)
When waiting before doing actual work, I see that the overhead is removed.
When building seastar without prefault code and --lock-memory=1 I dont see the overhead.

tomershafir · 2025-01-07T15:14:20Z

By default seastar uses all vcpus, which makes sense for resource efficiency.

Also, do you free specific vcpus? Like one per numa node, the granularity of prefault threads.

avikivity · 2025-01-07T16:12:17Z

By default seastar uses all vcpus, which makes sense for resource efficiency.

Also, do you free specific vcpus? Like one per numa node, the granularity of prefault threads.

1 in 8, with NUMA awareness. They're allocated for kernel network processing. See perftune.py.

tomershafir · 2025-01-07T17:48:21Z

Nice. Let me know if this change makes sense to you

tomershafir · 2025-01-16T09:43:19Z

@avikivity ping

tomershafir · 2025-01-16T12:15:31Z

I also tried to simulate perftune with 1 free vcpu: --cpuset=0-8 given the above setup, and I still observe the overhead, even though its less (~1600ms).

avikivity · 2025-01-21T14:49:12Z

I don't understand what this 1600ms overhead is.

tomershafir · 2025-01-21T15:01:13Z

I tried to non scientifically isolate wall time overhead of prefault threads:

I have a test app that performs file I/O and process memory buffers repeatedly. I used Ubuntu Orbstack VM with 1 NUMA node, 10 cores, --memory=14G - effectively a small NUMA node, and a small input to let the overhead be most visible.

With --lock-memory=1 without waiting, I see that the chrono time of the actual work is significantly higher than with --lock-memory=0. (~1800ms > ~600ms)

When waiting before doing actual work, I see that the overhead is removed.

When building seastar without prefault code and --lock-memory=1 I dont see the overhead.

I mean its the wall time of the work that I observe, with 1 free vcpu: --cpuset=0-8 given the above setup.

avikivity · 2025-01-21T15:11:17Z

I tried to non scientifically isolate wall time overhead of prefault threads:
I have a test app that performs file I/O and process memory buffers repeatedly. I used Ubuntu Orbstack VM with 1 NUMA node, 10 cores, --memory=14G - effectively a small NUMA node, and a small input to let the overhead be most visible.

With --lock-memory=1 without waiting, I see that the chrono time of the actual work is significantly higher than with --lock-memory=0. (~1800ms > ~600ms)

When waiting before doing actual work, I see that the overhead is removed.

When building seastar without prefault code and --lock-memory=1 I dont see the overhead.

I mean its the wall time of the work that I observe, with 1 free vcpu: --cpuset=0-8 given the above setup.

Okay. But what's the problem with that time?

Anyway, is we add future<> seastar::wait_for_background_initialization() we can have application startup code elect to wait for it before opening ports. This way it can let its own initialization work overlap with memory initialization.

tomershafir · 2025-01-21T16:03:46Z

The problem is that it is slower and not consistent/predictable. After memory initialization it is faster and consistent.

Regarding the implementation, the problem is that pthread_join blocks, so in the current implementation can't using a future be misleading?

avikivity · 2025-01-21T16:06:31Z

How is pthread_join relevant?

tomershafir · 2025-01-21T16:08:21Z

Currently, the logical barrier waits for pthread_join on all the threads that perform the prefault work. It will block the reactor thread.

avikivity · 2025-01-21T16:17:07Z

Ah, you're referring to the patch while I was referring to the current state. Don't use join then, instead figure out something else that can satisfy a seastar::promise. Maybe it's as simple as seastar::alien::submit_to(0, [&] { _reactor._prefault_complete.set_value(); }).

tomershafir · 2025-01-21T16:56:04Z

ah, I see. So if I understand correctly, you have just restated the clarified motivation for the patch (pls correct me if I'm wrong). I'll work on a non-blocking method next week.

avikivity · 2025-01-21T16:58:17Z

I don't completely see that it's useful but can't deny that it might be.

I'd be happier with an example of a real application requiring it.

tomershafir · 2025-01-21T17:15:01Z

I have only a test application. How about scylladb?

avikivity · 2025-01-22T19:11:03Z

I have only a test application. How about scylladb?

I'm not aware of reports of problem during the prefault stage. It takes some time for a node to join the cluster, and by that time enough memory was prefaulted for it to work well.

tomershafir · 2025-03-21T15:58:56Z

sorry for the delay and the confusion. I saw you added a join method: #2679 Ill try to rebase my changes

tomershafir · 2025-03-31T11:22:33Z

v2:

Rebase master and rewrite a non-blocking seastar::join_memory_prefault

tomershafir · 2025-03-31T11:25:21Z

@avikivity hey, pls review

avikivity · 2025-03-31T14:58:28Z

src/core/reactor.cc

+future<> join_memory_prefault() {
+    auto& r = engine();
+    if (!r._smp->memory_prefault_initialized()) {
+        seastar_logger.warn("Memory prefaulter is not initialized but joined");


This warning isn't helpful to users; what can they do?

They should fix the configuration to make Seastar actually prefault memory, or remove the redundant join if not intended. But if you prefer otherwise, Ill remove it.

The application doesn't know if the user wants to prefault memory or not. It's common in local testing not to prefault, and in production to prefault.

avikivity · 2025-03-31T15:01:31Z

src/core/smp.cc

+bool
+smp::memory_prefault_initialized() {
+    return _prefaulter != nullptr;
+}


Instead of this, you can promise::set_value() on the promise if you don't initialize the prefaulter.

I agree, this way the promise is not left unresolved. Will call it on smp::configure.

avikivity · 2025-03-31T15:03:32Z

src/core/smp.cc

+internal::memory_prefaulter::alien_on_complete(smp& smp_context) {
+    run_on(smp_context._alien, 0, [this, &smp_context] () noexcept {
+        join_threads();
+        run_in_background(smp_context.broadcast_memory_prefault_completion());


Alternatively, we can document that that the join() should only be run on shard 0. I expect that most applications run initialization code in shard 0 and don't need it to be available anywhere else.

(I want to deprecate and remove run_in_background, I think it's dangerous)

Why not let it work on any shard? We anyway need the new promise member on the reactor class.

Also, maybe make run_in_background internal instead? its needed for such use cases.

@avikivity ping

It seems an unnecessary complication. Applications typically have a main thread that runs on shard 0 that coordinates the startup process.

Currently, memory prefault logic is internal and seastar doesnt provide much control to users. In order to improve the situation, I suggest to provide a barrier for the prefault work. This allows to: * Prefer predictable low latency and high throughput from the start of request serving, at the cost of a startup delay, depending on machine characteristics and application specific requirements. For example, a fixed capacity on prem db setup, where slower startup can be tolerated. From users perspective, they generally cannot tolerate inconsistency (like spikes in latency during startup). * Similarly, improve user scheduling decisions, like running less critical tasks while prefault works. * Reliably test the prefault logic, improving reliability and users trust in seastar. This patch adds memory_prefaulter class as a friend of smp class, and passes a smp context to the prefaulter. The prefulater calls the smp context upon completion using a new broadcast method, which sends a completion event to all the reactor threads. A new promise member on the reactor class is enables to return a per-reactor future that represents the prefault completion state. This way, the mechanism is eventually consistent on all the reactors. The interface is a free function on the seastar namespace.

tomershafir · 2025-04-09T17:45:19Z

@avikivity v3:

Allow seastar::join_memory_prefault only on shard 0, enforce via assertion and document. Keep run_in_background() which is needed.
Remove redundant warning.
If memory prefault is not initialized, instead of checking inside seastar::join_memory_prefault, submit dummy completion immediately in smp::configure.

tomershafir marked this pull request as ready for review January 6, 2025 14:33

tomershafir closed this Mar 21, 2025

tomershafir deleted the smp-join-memory-preafult branch March 21, 2025 15:54

tomershafir restored the smp-join-memory-preafult branch March 21, 2025 15:55

tomershafir reopened this Mar 21, 2025

tomershafir force-pushed the smp-join-memory-preafult branch 4 times, most recently from d681d1e to 8ece4fc Compare March 31, 2025 11:03

avikivity reviewed Mar 31, 2025

View reviewed changes

tomershafir requested a review from avikivity April 6, 2025 06:54

tomershafir force-pushed the smp-join-memory-preafult branch from 8ece4fc to 8b3dc8c Compare April 9, 2025 16:24

tomershafir force-pushed the smp-join-memory-preafult branch from 8b3dc8c to 9156e3f Compare April 9, 2025 17:25

smp: add a function that barriers memory prefault work #2608

Are you sure you want to change the base?

smp: add a function that barriers memory prefault work #2608

Uh oh!

Conversation

tomershafir commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avikivity commented Jan 6, 2025

Uh oh!

tomershafir commented Jan 6, 2025

Uh oh!

avikivity commented Jan 7, 2025

Uh oh!

tomershafir commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomershafir commented Jan 7, 2025

Uh oh!

avikivity commented Jan 7, 2025

Uh oh!

tomershafir commented Jan 7, 2025

Uh oh!

tomershafir commented Jan 16, 2025

Uh oh!

tomershafir commented Jan 16, 2025

Uh oh!

avikivity commented Jan 21, 2025

Uh oh!

tomershafir commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avikivity commented Jan 21, 2025

Uh oh!

tomershafir commented Jan 21, 2025

Uh oh!

avikivity commented Jan 21, 2025

Uh oh!

tomershafir commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avikivity commented Jan 21, 2025

Uh oh!

tomershafir commented Jan 21, 2025

Uh oh!

avikivity commented Jan 21, 2025

Uh oh!

tomershafir commented Jan 21, 2025

Uh oh!

avikivity commented Jan 22, 2025

Uh oh!

tomershafir commented Mar 21, 2025

Uh oh!

tomershafir commented Mar 31, 2025

Uh oh!

tomershafir commented Mar 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomershafir commented Apr 9, 2025

Uh oh!

Uh oh!

tomershafir commented Jan 6, 2025 •

edited

Loading

tomershafir commented Jan 7, 2025 •

edited

Loading

tomershafir commented Jan 21, 2025 •

edited

Loading

tomershafir commented Jan 21, 2025 •

edited

Loading