Skip to content

Conversation

@coreyostrove
Copy link
Contributor

This patches another shared memory bug in lsvec, this time a race condition. The bug is easiest explained by pointing at the relevant lines. Below is the implementation of lsvec for TimeIndependentMDCObjectiveFunction:

  lsvec = self.terms(paramvec, oob_check, "LS OBJECTIVE")
  if _np.any(lsvec < 0):
      bad_locs = _np.where(lsvec < 0)[0]
      msg = f"""
      lsvec is only defined when terms is elementwise nonnegative.
      We encountered negative values for terms[i] for indices i
      in {bad_locs}.
      """
      raise RuntimeError(msg)
  unit_ralloc = self.layout.resource_alloc('atom-processing')
  shared_mem_leader = unit_ralloc.is_host_leader
  if shared_mem_leader:
      lsvec **= 0.5
  if raw_objfn_lsvec_signs:
      if shared_mem_leader:
          raw_lsvec = self.raw_objfn.lsvec(self.probs, self.counts, self.total_counts, self.freqs)
          lsvec[:self.nelements][raw_lsvec < 0] *= -1
  unit_ralloc.host_comm_barrier()

The bug arises due to the interation of the positivity check if _np.any(lsvec < 0) and later on the update of lsvec using the raw_objfn signs lsvec[:self.nelements][raw_lsvec < 0] *= -1. Since the positivity check is happening on all ranks, but is with respect to shared memory, if a rank that is the shared memory leader hits and runs that sign update line before another rank looking at the same memory hits the positivity check this would raise an error.

After staring at this code for a bit I couldn't see any reason why this positivity check (or anything else in this function) should be happening on anything other the shared memory leader, so the patch is to just add a corresponding if statement which enforces that.

Thanks to @juangmendoza19 for reporting this error.

God I hate shared memory sometimes...

Patch a race condition bug triggered when one of the ranks hit a certain positivity check after a point where the shared memory leader had applied a transformation changing the signs of that quantity.
@coreyostrove coreyostrove added this to the 0.9.14.2 milestone Oct 24, 2025
@coreyostrove coreyostrove self-assigned this Oct 24, 2025
@coreyostrove coreyostrove requested review from a team and rileyjmurray as code owners October 24, 2025 01:04
Copy link
Contributor

@rileyjmurray rileyjmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for fixing!

@coreyostrove coreyostrove merged commit 839884b into bugfix Oct 24, 2025
4 checks passed
@coreyostrove coreyostrove deleted the bugfix-sharedmemory-lsvec-race branch October 24, 2025 03:42
rileyjmurray added a commit that referenced this pull request Oct 24, 2025
The automatic ``develop <- bugfix`` merge from Corey's PR yesterday
(#674)
[failed](https://github.com/sandialabs/pyGSTi/actions/runs/18768908655/job/53549598808#step:3:20).

The current state of bugfix includes changes from #674 and the PR I just
merged, #672. There's no way that the automatic ``develop <- bugfix``
merge will work since it includes the changes from yesterday that failed
the auto-merge. So this PR catches develop up with collective changes in
#672 and #674.

Tests for the candidate ``develop <- bugfix`` merge workflow passed (all
that failed was the auto-merge), so I'm going to merge this PR without
waiting for tests to pass.

Notes:

It's not clear to me why the auto-merge for #674 failed. It might have
something to do with the weird commit history I call out in the
screenshot below.

<img width="1168" height="204" alt="image"
src="https://github.com/user-attachments/assets/d8f196f8-26f6-45d1-b77f-e74c63e0a0d1"
/>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants