Skip to content

Conversation

@hodgesds
Copy link
Contributor

@hodgesds hodgesds commented Dec 7, 2025

Implement a fast-path optimization for WAKE_SYNC wakeups that directly assigns wakees to the waker's CPU when the system has capacity. This provides zero-latency handoff for producer-consumer workloads while gracefully degrading at high utilization.

The optimization checks if:

  1. System is not saturated (!saturated && !overloaded)
  2. Waker CPU is in wakee's affinity mask
  3. Waker CPU has no queued work (both local and LLC DSQs empty)

When these conditions are met, the wakee inherits the waker's CPU immediately.

Performance impact (schbench benchmark on 176 CPU system):

  • 50-70% load: 47-55x wakeup latency improvement (995μs → 18-21μs)
  • 80% load: 41x improvement (995μs → 24μs)
  • 90% load: 18x improvement (995μs → 55μs)
  • 100% load: No change (gracefully disabled)

Pipe workloads (producer-consumer pairs) see even higher trigger rates with up to 174,000 handoffs/sec at 50% load compared to ~1,000/sec for request-response patterns.

The optimization is placed early in pick_idle_cpu() to take priority over the prev_cpu sticky path, and only activates when beneficial. At saturation, it automatically disables to avoid overhead and allows normal pick-2 load balancing.

Changes:

  • Add P2DQ_STAT_WAKE_SYNC_WAKER counter to track handoffs
  • Check both local DSQ and LLC DSQ before handoff (waker consumes from both)
  • Gate optimization with saturation check
  • Expose counter in userspace stats
  • Remove unused idle_smtmask
  • fix bug in can_migrate checking LLC min runs

Tested with schbench and stress-ng across load levels 50-100%.

@hodgesds hodgesds force-pushed the p2dq-wakee-optimize branch 3 times, most recently from de81335 to e48d9bd Compare December 11, 2025 07:19
Implement a fast-path optimization for WAKE_SYNC wakeups that directly
assigns wakees to the waker's CPU when the system has capacity. This
provides zero-latency handoff for producer-consumer workloads while
gracefully degrading at high utilization.

The optimization checks if:
1. System is not saturated (!saturated && !overloaded)
2. Waker CPU is in wakee's affinity mask
3. Waker CPU has no queued work (both local and LLC DSQs empty)

When these conditions are met, the wakee inherits the waker's CPU
immediately.

Performance impact (schbench benchmark on 176 CPU system):
- 50-70% load: 47-55x wakeup latency improvement (995μs → 18-21μs)
- 80% load: 41x improvement (995μs → 24μs)
- 90% load: 18x improvement (995μs → 55μs)
- 100% load: No change (gracefully disabled)

Pipe workloads (producer-consumer pairs) see even higher trigger rates
with up to 174,000 handoffs/sec at 50% load compared to ~1,000/sec for
request-response patterns.

The optimization is placed early in pick_idle_cpu() to take priority
over the prev_cpu sticky path, and only activates when beneficial.
At saturation, it automatically disables to avoid overhead and allows
normal pick-2 load balancing.

Changes:
- Add P2DQ_STAT_WAKE_SYNC_WAKER counter to track handoffs
- Check both local DSQ and LLC DSQ before handoff (waker consumes from both)
- Gate optimization with saturation check
- Expose counter in userspace stats

Tested with schbench and stress-ng across load levels 50-100%.

Signed-off-by: Daniel Hodges <hodgesd@meta.com>
@hodgesds hodgesds force-pushed the p2dq-wakee-optimize branch from e48d9bd to 3172c6d Compare December 20, 2025 11:06
@hodgesds
Copy link
Contributor Author

I did some more testing trying to get the pipe based producer consumer (single producer/consumer) working with logic to detect vs multi consumers and it seems to be hardware/workload dependent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants