Skip to content

Conversation

@alex
Copy link
Contributor

@alex alex commented Dec 4, 2025

The blocking pool's task queue was protected by a single mutex, causing severe contention when many threads spawn blocking tasks concurrently. This resulted in nearly linear degradation: 16 concurrent threads took ~18x longer than a single thread.

Replace the single-mutex queue with a sharded queue that distributes tasks across 16 lock-protected shards. The implementation adapts to concurrency levels by using fewer shards when thread count is low, maintaining cache locality while avoiding contention at scale.

Benchmark results (spawning 100 batches of 16 tasks per thread):

Concurrency Before After Improvement
1 thread 13.3ms 17.8ms +34%
2 threads 26.0ms 20.1ms -23%
4 threads 45.4ms 27.5ms -39%
8 threads 111.5ms 20.3ms -82%
16 threads 247.8ms 22.4ms -91%

The slight overhead at 1 thread is due to the sharded infrastructure, but this is acceptable given the dramatic improvement at higher concurrency where the original design suffered from lock contention.

(Notwithstanding that this shows as a commit from claude, every line is human reviewed. If there's a mistake, it's Alex's fault.)

Closes #2528.

@alex alex force-pushed the claude/improve-spawn-blocking-perf-01A5VqgjoFsxUcvmP6eAjdTf branch 3 times, most recently from 9537dda to 016f6ca Compare December 4, 2025 00:19
@alex
Copy link
Contributor Author

alex commented Dec 4, 2025

(FreeBSD failures look unrelated.)

@ADD-SP ADD-SP added A-tokio Area: The main tokio crate M-blocking Module: tokio/task/blocking T-performance Topic: performance and benchmarks labels Dec 4, 2025
@ADD-SP ADD-SP added S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author. and removed S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author. labels Dec 4, 2025
@alex alex force-pushed the claude/improve-spawn-blocking-perf-01A5VqgjoFsxUcvmP6eAjdTf branch 2 times, most recently from f4416fb to 21ff5ce Compare December 4, 2025 17:36
@martin-g
Copy link
Member

martin-g commented Dec 4, 2025

Please rebase to latest master to get the fix for the FreeBSD failures.

@alex alex force-pushed the claude/improve-spawn-blocking-perf-01A5VqgjoFsxUcvmP6eAjdTf branch from 21ff5ce to 694fa6b Compare December 4, 2025 17:39
@alex alex force-pushed the claude/improve-spawn-blocking-perf-01A5VqgjoFsxUcvmP6eAjdTf branch from 694fa6b to edd5e10 Compare December 5, 2025 12:37
@alex alex force-pushed the claude/improve-spawn-blocking-perf-01A5VqgjoFsxUcvmP6eAjdTf branch from edd5e10 to 126cb78 Compare December 6, 2025 02:06
@ADD-SP ADD-SP added S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author. and removed S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author. labels Dec 7, 2025
Comment on lines 117 to 144
// Update max_shard_pushed BEFORE pushing the task.
self.max_shard_pushed.fetch_max(index, Release);

self.shards[index].push(task);
Copy link
Member

@ADD-SP ADD-SP Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the Release ordering, compiler might reorder the self.shards[index].push(task) and fetch_max, which means that the .push(task) might be sequenced-before the fetch_max.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. (I hate atomic orderings :-/) AcqRel I think is what we want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still wrong. Consider the following scenario:

Thread A                   Thread B            Thread C
                                               preferred_shard = 0
                           preferred_shard = 1
max_shard_pushed = 0
shards[0].push(_)
condvar.notify_one()
                           wakes up ...
                           max_shard = 0
max_shard_pushed = 1
shards[1].push(_)
condvar.notify_one()
                                               wakes up ...
                                               max_shard = 1
                                               shards[0].pop() = Some
                           shards[0].pop() = None

In this case, Thread B does not check shards[1] because it read max_shard with the value of zero. This means that two tasks were spawned, but only one gets picked up.

Well, I guess in principle Thread C will see the second task after it finishes executing the first one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think it's always the case that if a push happens concurrently with a pop that the pop might miss it, and we'll have to "fall back" to catching it in the wait_for_task loop.

I think in principle we could address this one by reloading max_shard_pushed, but of course you can still have a race condition.

In the scenario you've got here, what would happen is that after thread B returns None from pop, it'll go wait_for_task and then the task will get picked up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think it's always the case that if a push happens concurrently with a pop that the pop might miss it, and we'll have to "fall back" to catching it in the wait_for_task loop.

Your notify_one() call ensures that for every push(), there will be a subsequent call to pop() that is not concurrent and hence guaranteed to see the pushed message. So there's at least one thread that's guaranteed to pick up each message.

If we imagine that the max_shard_pushed logic was removed, then Thread B would in fact be guaranteed to see the message in shards[1].

  • shards[1].push(_) on A happens-before shards[0].pop() = Some on thread C because thread C is the thread waken up by the second notify_one() call.
  • shards[0].pop() = Some on thread C happens-before shards[0].pop() = None on thread B, since otherwise thread B would have gotten Some when calling pop().
  • After shards[0].pop() = None, thread B would attempt to call shards[1].pop()

So by this logic, the shards[1].pop() call would in fact happen after shards[1].push(_), and is hence guaranteed to see the message that was pushed.

@alex alex force-pushed the claude/improve-spawn-blocking-perf-01A5VqgjoFsxUcvmP6eAjdTf branch from 855e269 to 046d5c6 Compare December 8, 2025 14:12
Comment on lines +181 to +210
// Acquire the condvar mutex before waiting
let guard = self.condvar_mutex.lock();

// Double-check shutdown and tasks after acquiring lock, as state may
// have changed while we were waiting for the lock
if self.is_shutdown() {
return WaitResult::Shutdown;
}
if let Some(task) = self.pop(preferred_shard) {
return WaitResult::Task(task);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we attempt to acquire the shard lock while holding the condvar lock, this is the nested locking pattern. In general, we should avoid this pattern as it is error-prone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is there a preferred pattern to avoid the nested locking?

In this case we want to ensure that when we wait for a notification, there wasn't already a task that's made pending concurrently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested locking is fine as long as locks are always taken in the same order.

@ADD-SP ADD-SP added the S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author. label Dec 11, 2025
@alex
Copy link
Contributor Author

alex commented Dec 15, 2025

(I don't think the netlify failures are related)

@ADD-SP ADD-SP removed the S-waiting-on-author Status: awaiting some action (such as code changes) from the PR or issue author. label Dec 15, 2025
@alex alex force-pushed the claude/improve-spawn-blocking-perf-01A5VqgjoFsxUcvmP6eAjdTf branch from 81d7f25 to 54330c3 Compare December 30, 2025 13:12
Copy link
Member

@ADD-SP ADD-SP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. I'd like to take some time to think about the nested locking issue. We may have a better choice, or not.

@alex
Copy link
Contributor Author

alex commented Dec 30, 2025

No problem -- rebase was to pick up a fix for the netlify failures.

Let me know if there's other experiments it'd be useful for me to try.

Comment on lines 70 to 88
/// Calculate the effective number of shards to use based on thread count.
/// Uses fewer shards at low concurrency for better cache locality.
#[inline]
fn effective_shards(num_threads: usize) -> usize {
match num_threads {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems error-prone and likely to lead to missed tasks. Does it actually matter for your benchmark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It matters at N_THREADS=1 -- I don't personally care about that case at all, if we're ok with a small pessimization there (10% iirc?), I'd be delighted to delete this max shard logic and just always use a fixed number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this sound ok to you? I'd love to delete it because its responsible for a lot of the complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone ahead and dropped this behavior.

@alex alex force-pushed the claude/improve-spawn-blocking-perf-01A5VqgjoFsxUcvmP6eAjdTf branch from 08376f4 to 3fd9e25 Compare January 3, 2026 20:40
claude and others added 4 commits January 3, 2026 15:44
The blocking pool's task queue was protected by a single mutex, causing
severe contention when many threads spawn blocking tasks concurrently.
This resulted in nearly linear degradation: 16 concurrent threads took
~18x longer than a single thread.

Replace the single-mutex queue with a sharded queue that distributes
tasks across 16 lock-protected shards. The implementation adapts to
concurrency levels by using fewer shards when thread count is low,
maintaining cache locality while avoiding contention at scale.

Benchmark results (spawning 100 batches of 16 tasks per thread):

| Concurrency | Before   | After   | Improvement |
|-------------|----------|---------|-------------|
| 1 thread    | 13.3ms   | 17.8ms  | +34%        |
| 2 threads   | 26.0ms   | 20.1ms  | -23%        |
| 4 threads   | 45.4ms   | 27.5ms  | -39%        |
| 8 threads   | 111.5ms  | 20.3ms  | -82%        |
| 16 threads  | 247.8ms  | 22.4ms  | -91%        |

The slight overhead at 1 thread is due to the sharded infrastructure,
but this is acceptable given the dramatic improvement at higher
concurrency where the original design suffered from lock contention.
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@alex alex force-pushed the claude/improve-spawn-blocking-perf-01A5VqgjoFsxUcvmP6eAjdTf branch from 3fd9e25 to a7c341e Compare January 3, 2026 20:44
Use the same approach as sync::watch: prefer thread_rng_n() for shard
selection to reduce contention on the atomic counter, falling back to
round-robin when the RNG is not available (loom mode or missing features).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@alex alex force-pushed the claude/improve-spawn-blocking-perf-01A5VqgjoFsxUcvmP6eAjdTf branch from a7c341e to fa4e0a6 Compare January 3, 2026 20:50
@alex alex force-pushed the claude/improve-spawn-blocking-perf-01A5VqgjoFsxUcvmP6eAjdTf branch from 25439f8 to 98e0071 Compare January 4, 2026 16:06
@alex alex force-pushed the claude/improve-spawn-blocking-perf-01A5VqgjoFsxUcvmP6eAjdTf branch from e0107cc to 502a6d8 Compare January 13, 2026 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-tokio Area: The main tokio crate M-blocking Module: tokio/task/blocking T-performance Topic: performance and benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Heavy contention on blocking

6 participants