Skip to content

Missing filesystem mutexes & mutex debugging #6536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

LeeBrotherston
Copy link

Background

I noticed a number of filesystem issues which seem to be consistent with data being incorrectly written to disk, potentially due to race conditions and suchlike, and so I added some code to debug the mutexes which are in place to prevent such issues, in case there was an issue.

Thus, this pull request includes several changes to the server package, primarily focusing on adding mutex debugging capabilities with a new error type.

This debugging information demonstrated some locations where functions which require locks were being called without a lock, thus I have also added some mutex locks/unlocks to address these issues.

Mutex Debugging Enhancements:

  • Added hasLock, preEmptLock, hasLockthing, and logUnLock functions to check and log the status of mutex locks, assisting in debugging lock-related issues (server/util.go). Specifically:
    • hasLock checks if a lock is held on a particular mutex. I have added this to a number of functions in server/filestore.go which are labelled as "must hold lock". This causes the function in question to return an error of type ErrNoLockHeld if the lock is not held. This may not prove to be a long term solution, but it is useful for diagnosing where mutex lock acquisition is missing.
    • preEmptLock check if a lock is held on a particular mutex, and acquires one if it is not. I have added this to a number of functions in server/filestore.go which are labelled as "must hold lock" but do not return an error type, and so cannot work quite as gracefully as using hasLock to return an error. This way at least, they will obtain a lock for the duration of their operation. I should stress that this is not intended to be a long term solution, rather a band-aid in order to determine where mutex lock acquisition is missing.
    • hasLockthing is a function that nothing else really needs to use, but provides the functionality of hasLock and preEmptLock
    • logUnlock is a counterpart to these functions and simply unlocks a mutex but includes some logging.

The above functions check the environment variable MUTEX_CHECK_DEBUG_FILE, if this variable is set, then each time hasLock, preEmptLock are called and discover that a mutex lock is not held (i.e. something that was supposed to acquire a lock before calling a function, did not) then it will append a line to the file described in the var. This line is a JSON representation of a stack trace in order to debug which chain of functions ended up calling a function that requires a lock, without a lock. This format it to make quick analysis of the output very easy using jq or similar.

New Error Types:

  • Introduced ErrNoLockHeld error to handle cases where an expected lock is not held.

Configuration Updates:

  • Added a placeholder comment for mutex debugging options in the Options struct (server/opts.go).

Signed-off-by: Lee Brotherston [email protected]

@LeeBrotherston LeeBrotherston changed the base branch from main to release/v2.10.26 February 21, 2025 15:49
@LeeBrotherston LeeBrotherston changed the base branch from release/v2.10.26 to main February 21, 2025 15:49
@neilalexander
Copy link
Member

The instrumentation here is interesting, but given that it will likely have not-insignificant runtime overheads, I am cautious about bringing it all in at this stage.

That said, if you have found some genuine errors in the locking, is it possible to PR those separately? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants