Skip to content

Disable registry cleanup for count and list operations#517

Open
mccauleyp wants to merge 3 commits intoParallels:masterfrom
mccauleyp:get-job-count
Open

Disable registry cleanup for count and list operations#517
mccauleyp wants to merge 3 commits intoParallels:masterfrom
mccauleyp:get-job-count

Conversation

@mccauleyp
Copy link

@mccauleyp mccauleyp commented Jan 14, 2026

Description

Fixes #486

This PR fixes an issue that occurs when the dashboard attempts to obtain job registry counts or list job IDs when it is a) not running in the main thread and b) when those actions trigger a "cleanup" side effect that attempts to invoke job callbacks.

Using the .count property, .get_job_count method, or .get_job_ids method of the BaseRegistry class in RQ has a side effect of invoking the .cleanup method (see: https://github.com/rq/rq/blob/master/rq/registry.py#L236). For the some registries, namely StartedJobRegistry and DeferredJobRegistry, this may lead to job callbacks being invoked. This fails if the dashboard is not running in the main thread because it relies on signal handling that can't happen outside of the main thread, leading to this error: #486.

The fix here is to simply use the cleanup=False option to disable the cleanup operation from the dashboard for all registries with the philosophy that the dashboard should generally not be responsible for mutating the state of the job queue as a side effect of displaying it.

Why disable cleanup

Rather than disable the cleanup universally as in this PR, we could instead focus on specific registries that are impacted and/or have different behavior for the dashboard depending on whether or not it's running in the main thread. I can revise this PR if that's the consensus, but I don't think the dashboard is the right place to trigger registry cleanup because:

  1. It runs in a web request context (potentially threaded → Signal Error when UnixSignalDeathPenalty is triggered #486)
  2. Custom exception handlers registered on workers won't be invoked if cleanup runs from the dashboard, because the dashboard doesn't have access to them
  3. Workers already handle cleanup reliably

RQ's built-in cleanup mechanism

Workers run maintenance every 10 minutes (https://github.com/rq/rq/blob/master/rq/defaults.py#L66):

DEFAULT_MAINTENANCE_TASK_INTERVAL = 10 * 60

Workers check if maintenance should run (https://github.com/rq/rq/blob/master/rq/worker.py#L427-L433):

def should_run_maintenance_tasks(self):
    """Maintenance tasks should run on first startup or every 10 minutes."""
    if self.last_cleaned_at is None:
        return True
    if (now() - self.last_cleaned_at) > timedelta(seconds=self.maintenance_interval):
        return True
    return False

Workers call clean_registries() with proper exception handlers (https://github.com/rq/rq/blob/master/rq/worker.py#L458-L469):

def clean_registries(self):
    """Runs maintenance jobs on each Queue's registries."""
    for queue in self.queues:
        if queue.acquire_maintenance_lock():
            self.log.info('Cleaning registries for queue: %s', queue.name)
            clean_registries(queue, self._exc_handlers)
            ...

clean_registries() cleans all registry types (https://github.com/rq/rq/blob/master/rq/registry.py#L551-L571):

  def clean_registries(queue: 'Queue', exception_handlers: Optional[list] = None):
      FinishedJobRegistry(...).cleanup()
      StartedJobRegistry(...).cleanup(exception_handlers=exception_handlers)
      FailedJobRegistry(...).cleanup()
      DeferredJobRegistry(...).cleanup()

A distributed lock prevents redundant cleanup across workers (https://github.com/rq/rq/blob/master/rq/queue.py#L239-L249):

  def acquire_maintenance_lock(self) -> bool:
      """A lock expires in 899 seconds (15 minutes - 1 second)"""
      lock_acquired = self.connection.set(self.registry_cleaning_key, 1, nx=True, ex=899)
      ...

So this PR does not mean that the cleanup won't happen anymore, just that the dashboard won't be triggering it.

Type of change

Please delete options that are not relevant.

  • Documentation Change
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have run tests (pytest) that prove my fix is effective or that my feature works
  • I have updated the CHANGELOG.md file accordingly
  • I have added tests that prove my fix is effective or that my feature works

@mccauleyp
Copy link
Author

I can't seem to add reviewers it seems like the CODEOWNERS didn't kick in as-expected, so tagging that list here: @cjlapao @ducu, @eoranged, @joostdevries, @nvie, @snegov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Signal Error when UnixSignalDeathPenalty is triggered

1 participant