Skip to content

Fix wait_on_address_64 on Linux when only upper 32 bits change (use futex2)#590

Open
parth21999 wants to merge 2 commits into
masterfrom
parth/fix-wait-on-address-64-upper-bits
Open

Fix wait_on_address_64 on Linux when only upper 32 bits change (use futex2)#590
parth21999 wants to merge 2 commits into
masterfrom
parth/fix-wait-on-address-64-upper-bits

Conversation

@parth21999
Copy link
Copy Markdown
Member

Summary

The Linux implementation of wait_on_address_64 used the legacy SYS_futex syscall, which fundamentally compares only 32 bits of the value at the address. When *address differed from compare_value only in the upper 32 bits (e.g. both lower 32 bits zero), the kernel's atomic check-before-sleep saw "equal" and slept until timeout, violating the wait_on_address_64 contract (SRS_SYNC_43_002).

In multi-threaded scenarios this also caused lost wakeups in InterlockedHL_WaitForNotValue64 (and the other 64-bit InterlockedHL waiters): if a writer changed the upper 32 bits of the value between the function's interlocked_add_64 read and its wait_on_address_64 call, the kernel's 32-bit check missed the change and the waiter blocked until timeout.

The zrpc multiplexer integration tests hit this race because SUBSTREAM_ID is a 64-bit value whose lower 32 bits are an index starting at 0 — see zrpc PR 15645356, which worked around the bug at the call site.

Fix

Switch the 64-bit wait/wake helpers in c_pal_ll/linux/src/sync_linux.c to the Linux 6.7+ futex2 syscalls (SYS_futex_wait, SYS_futex_wake) with FUTEX2_SIZE_U64 | FUTEX2_PRIVATE, which performs a true 64-bit atomic check-before-sleep. The 32-bit helpers are unchanged.

Notable details:

  • The futex2 wait deadline is absolute CLOCK_MONOTONIC time (per the futex2 ABI), not relative.
  • A NULL deadline is passed when timeout_ms == UINT32_MAX so the waiter sleeps indefinitely.
  • Both wait and wake use the same FUTEX2_SIZE_U64 flag so they hash to the same kernel futex queue.
  • #define fallbacks for SYS_futex_wake, SYS_futex_wait, FUTEX2_SIZE_U64, FUTEX2_PRIVATE are added so the code compiles on glibc versions that don't yet expose these constants.

Regression tests

Two deterministic tests were added that reproduce the bug on the legacy implementation (time out after 5 s) and pass on the new implementation:

Test File What it does
wait_on_address_64_returns_immediately_when_only_upper_32_bits_differ c_pal_ll/interfaces/tests/sync_int/sync_int.c Single-threaded. Sets *var = 0x100000000 then asserts wait_on_address_64(&var, 0, 5000) returns WAIT_ON_ADDRESS_OK promptly.
interlocked_hl_wait_for_not_value_64_with_only_upper_32_bits_change common/tests/interlocked_hl_int/interlocked_hl_int.c Replays the inner loop of InterlockedHL_WaitForNotValue64 deterministically: reads with interlocked_add_64, changes the upper 32 bits, then calls wait_on_address_64 with the original value as compare_value.

Verification

  • ✅ Compiles cleanly on WSL (Ubuntu 22.04, glibc 2.35).
  • strace confirms correct syscalls and arguments:
    • syscall_0x1c7(addr, val=0, mask=~0, flags=0x83, &deadline, clockid=1)SYS_futex_wait with FUTEX2_SIZE_U64 | FUTEX2_PRIVATE, CLOCK_MONOTONIC
    • syscall_0x1c6(addr, mask=~0, nr=1 or INT_MAX, flags=0x83)SYS_futex_wake with FUTEX2_SIZE_U64 | FUTEX2_PRIVATE
  • ⚠️ The regression tests can't be exercised on WSL2 today because Microsoft's stable WSL kernel is still 6.6.87.2 (futex2 wait/wake landed in 6.7). On older kernels the syscall returns ENOSYS and wait_on_address_64 now returns WAIT_ON_ADDRESS_ERROR, surfacing the kernel-version requirement explicitly instead of silently sleeping incorrectly. CI/Linux runners on kernel ≥ 6.7 will run the tests for real.

Behavior change

Scenario Before After
wait_on_address_64 with changed upper 32 bits, lower 32 bits unchanged sleeps to timeout (silently wrong) returns OK immediately (kernel ≥ 6.7) / returns ERROR with ENOSYS (kernel < 6.7)
wait_on_address_64 with full-64-bit difference works (lower bits already differ) works
Normal wake-after-sleep paths works works

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

The Linux implementation of wait_on_address_64 used the legacy SYS_futex
syscall, which fundamentally compares only 32 bits of the value at the
address. When *address differed from compare_value only in the upper
32 bits (e.g. lower 32 bits both zero), the kernel's atomic
check-before-sleep saw "equal" and slept until timeout, violating the
wait_on_address_64 contract (SRS_SYNC_43_002).

In multi-threaded scenarios this also caused lost wakeups in
InterlockedHL_WaitForNotValue64 and the other 64-bit InterlockedHL
waiters: if a writer changed the upper 32 bits of the value between the
function's interlocked_add_64 read and its wait_on_address_64 call, the
kernel's 32-bit check missed the change and the waiter blocked until
timeout. The zrpc multiplexer integration tests hit this race because
SUBSTREAM_IDs are 64-bit values whose lower 32 bits are an index that
starts at 0.

Fix by switching the 64-bit wait/wake helpers to the Linux 6.7+ futex2
syscalls (SYS_futex_wait, SYS_futex_wake) with FUTEX2_SIZE_U64 |
FUTEX2_PRIVATE, which performs a true 64-bit atomic check-before-sleep.
The 32-bit helpers are unchanged. The wait deadline is now an absolute
CLOCK_MONOTONIC time as required by the futex2 ABI; a NULL deadline is
passed when timeout_ms is UINT32_MAX so the waiter sleeps indefinitely.

Add two regression tests:

* sync_int : wait_on_address_64_returns_immediately_when_only_upper_32_bits_differ
  - Single-threaded, deterministic. Sets *var = 0x100000000 then calls
    wait_on_address_64(&var, 0, 5000) and asserts the call returns
    WAIT_ON_ADDRESS_OK promptly.

* interlocked_hl_int : interlocked_hl_wait_for_not_value_64_with_only_upper_32_bits_change
  - Simulates the multiplexer race deterministically by replaying the
    inner loop of InterlockedHL_WaitForNotValue64: read with
    interlocked_add_64, change the upper 32 bits, then call
    wait_on_address_64 with the original value as compare_value.

Both tests reproduce the bug on the legacy implementation (timing out
after 5 s) and pass on the new implementation.

Note: requires a Linux kernel with futex2 support (>= 6.7). On older
kernels the syscall returns ENOSYS and wait_on_address_64 returns
WAIT_ON_ADDRESS_ERROR, surfacing the requirement explicitly rather than
silently sleeping incorrectly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread c_pal_ll/linux/src/sync_linux.c Outdated
#endif
#ifndef FUTEX2_PRIVATE
#define FUTEX2_PRIVATE FUTEX_PRIVATE_FLAG /* 128 */
#endif
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these needed? where are they coming from?

Comment thread c_pal_ll/linux/src/sync_linux.c Outdated

/* Codes_SRS_SYNC_LINUX_05_002: [ wait_on_address_64 shall call syscall to wait on value at address to change to a value different than the one provided in compare_value. ] */
int syscall_result = syscall(SYS_futex, address, FUTEX_WAIT_PRIVATE, compare_value, &timeout, NULL, 0);
/* Codes_SRS_SYNC_LINUX_05_002: [ wait_on_address_64 shall call syscall(SYS_futex_wait) with FUTEX2_SIZE_U64 | FUTEX2_PRIVATE and a CLOCK_MONOTONIC absolute deadline, performing a true 64-bit atomic check-before-sleep. ] */
Copy link
Copy Markdown
Member Author

@parth21999 parth21999 May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_05_002

All specs that are changed should be retagged with my devid 43. Changes are needed in devdoc, code and tests.

make sure repo_validation passes

// Concrete example from the multiplexer integration tests: a SUBSTREAM_ID is a
// 64-bit value where the lower 32 bits are an index. The first substream has
// index = 0, so the lower 32 bits of its SUBSTREAM_ID match the initial value
// (0) used as compare_value, even though the upper 32 bits are non-zero.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this comment smaller.


//act
double start_time = timer_global_get_elapsed_ms();
WAIT_ON_ADDRESS_RESULT return_val = wait_on_address_64(&var, compare_value, timeout);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this timer. The result is enough

* sync_int.c: shrink the regression-test comment block; drop the
  redundant timer + tolerance assertion - the WAIT_ON_ADDRESS_OK check
  alone catches the bug (broken impl returns TIMEOUT).
* interlocked_hl_int.c: same simplification (drop timer block) per the
  "everywhere" intent of the timer comment, and drop the now-unused
  c_pal/timer.h include.
* sync_linux.c: drop the SYS_futex_wake / SYS_futex_wait /
  FUTEX2_SIZE_U64 / FUTEX2_PRIVATE fallback #defines and rely on the
  headers (sys/syscall.h from glibc 2.39+, linux/futex.h from
  linux-libc-dev kernel 6.7+); the build now requires those headers and
  fails compile cleanly otherwise. Add a comment documenting the
  requirement.
* Retag the four SRS items whose text changed with devid 43:
    SRS_SYNC_LINUX_05_001 -> SRS_SYNC_LINUX_43_007
    SRS_SYNC_LINUX_05_002 -> SRS_SYNC_LINUX_43_008
    SRS_SYNC_LINUX_05_007 -> SRS_SYNC_LINUX_43_009
    SRS_SYNC_LINUX_05_008 -> SRS_SYNC_LINUX_43_010
  Updates devdoc (sync_linux_requirements.md), source (sync_linux.c
  Codes_), and unit tests (sync_linux_ut.c Tests_); the four unchanged
  items (_05_003 through _05_006) keep their tags.
* sync_linux_ut.c / mock_sync.{c,h}: introduce mock_futex_wait and
  mock_futex_wake mocks for the futex2 syscalls (the legacy SYS_futex
  mock_syscall is kept for the 32-bit functions). A variadic
  mock_syscall_dispatcher routes calls based on syscall number. The 4
  wait_on_address_64 tests and 2 wake_by_address_*_64 tests are updated
  to expect the new mocks with FUTEX2_SIZE_U64 | FUTEX2_PRIVATE flags
  and CLOCK_MONOTONIC.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant