Skip to content

Create deadlock detection plugin #124

@bbockelm

Description

@bbockelm

Create a new class that performs deadlock detection.

A deadlock is defined to be when a blocking call doesn't return after a preset max time (should be in the configurable via xrootd config).

The goal is to create a RAII object that adds itself to a doubly linked list on creation and removes itself on deletion. The linked lists should be created on module load. They should be 15 statically allocated lists; when the object is created, it should get the cpuid of the current thread modulo 15 and select the resulting list for its lifetime. On non-Linux, have a single list and always choose it.

Adding and removing objects from a list should require locking a mutex associated with the list. The array containing the mutex and list head element should be cache line aligned to avoid false sharing.

On module load, a background thread should be created to periodically walk through each list and see if any meet the deadlock criteria (time elapsed since the object was created is over threshold). If it is, then an error should be logged a the process killed with SIGKILL. If configured to do so, atomically write the time of the event to a separate log file that can be monitored.

On module unload, a condition variable should be notified that causes the thread to exit (join the background thread from the main one).

Once the deadlock detection object is designed, create an OSS and xrootd authorization plugin (see github.com/xrootd/xrootd) wrapper that creates the object on the stack and then forwards the call to the wrapped object.

Create unit tests. At least one should launch multiple threads each of which rapidly create and destroy deadlock detection objects. One test should be a "death test" that triggers the SIGKILL.

Create an integration test that wraps both the auth module and OSS. Wrap around dummy modules that simply stall out, triggering the SIGKILL. Ensure that the xrootd process is killed. Have a positive test as well, making sure nothing happens if the runtime is below threshold.

Update the README to indicate how to use the new deadlock detection.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions