Skip to content

Conversation

cwperks
Copy link
Member

@cwperks cwperks commented Aug 21, 2025

Description

This PR removes manual tracking of seqNo and priTerm for the LockModel. This repo doesn't use these values as intended and they serve no real purpose, the LockModel can be simplified to just the attributes pertinent to the LockModel. seqNo and priTerm and internal metafields of documents in OpenSearch that help understand if a document has previously been updated or if the primary shard has changed.

In the LockService, we always call findLock before updateLock so we simply feed back the same values that we got on find. I suppose it is possible that an operator can use the admin certificate to do a direct lock update which would then cause a mismatch of the seqNos but I think we should allow the lock to be acquired/released by the LockService anyway and run without checks on the seqNo and priTerm.

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Craig Perkins <[email protected]>
Copy link

codecov bot commented Aug 21, 2025

Codecov Report

❌ Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 69.94%. Comparing base (1a6dbf4) to head (80e4539).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...ch/jobscheduler/rest/action/RestGetLockAction.java 0.00% 1 Missing ⚠️

❌ Your project status has failed because the head coverage (69.94%) is below the target coverage (75.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #822      +/-   ##
==========================================
+ Coverage   69.87%   69.94%   +0.06%     
==========================================
  Files          38       38              
  Lines        1733     1717      -16     
  Branches      156      156              
==========================================
- Hits         1211     1201      -10     
+ Misses        431      426       -5     
+ Partials       91       90       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cwperks
Copy link
Member Author

cwperks commented Aug 22, 2025

Closing this PR and will open a separate one to do this on the History model along.

Keeping track of seqNo and priTerm on the LockModel may be useful to resolve race conditions in rare (but not impossible) scenarios.

Consider a job-scheduler with no jitter:

  1. Multiple jobs start simultaneously
  2. All jobs try to find a lock first to see if its released
  3. The jobs find a released LockModel which includes seqNo and priTerm
  4. One of the jobs "wins" an acquired lock by updating the LockMetadata to set released to false and update the lockTime
  5. Other jobs try to update the LockMetadata, but find that their seqNo (and priTerm) are outdated and the step to acquire fails.

This is known as the thundering herd problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

1 participant