Skip to content

free_memory() race: cleanup_models finalizer mutates current_loaded_models mid-loop → IndexError / 'NoneType' has no attribute 'is_dynamic' #14365

@j2gg0s

Description

@j2gg0s

Summary

free_memory() in comfy/model_management.py caches absolute indices into the global, unlocked current_loaded_models list and dereferences them later in the same function. If the cleanup_models() weakref finalizer (registered via weakref.finalize(real_model, cleanup_models)) pops entries from current_loaded_models between capture and use, the cached index becomes stale, producing either:

  • IndexError: list index out of range, or
  • AttributeError: 'NoneType' object has no attribute 'is_dynamic'

Both originate from the same line. This is a latent race; it became frequent for us under heavy checkpoint-swapping with the threaded loader active. Line numbers below are against current master (comfyui_version 0.24.0).

Mechanism (line refs on master)

  1. free_memory (L801) — first loop appends (... , i) tuples to can_unload, capturing the absolute index i into current_loaded_models (L811).
  2. Second loop for x in can_unload_sorted: sets i = x[-1] (L816) and dereferences current_loaded_models[i]:
    if current_loaded_models[i].model.is_dynamic() and for_dynamic:   # L820
    ...
    if memory_to_free > 0 and current_loaded_models[i].model_unload(memory_to_free):  # L825
  3. model_unload() (and/or a background thread) drops the last strong reference to a model's real_model. When that real_model is collected, the finalizer weakref.finalize(real_model, cleanup_models) (L734) runs cleanup_models() (L988), which pops every dead entry from current_loaded_models (L995):
    def cleanup_models():
        to_delete = []
        for i in range(len(current_loaded_models)):
            if current_loaded_models[i].real_model() is None:
                to_delete = [i] + to_delete
        for i in to_delete:
            x = current_loaded_models.pop(i)   # L995 — mutates the list free_memory is iterating
            del x
  4. After the pop, the cached i in free_memory's loop is stale:
    • the list shrank → current_loaded_models[i] is out of range → IndexError at L820/L825; or
    • i now points at an entry whose ModelPatcher weakref was collected → current_loaded_models[i].model is NoneNone.is_dynamic()AttributeError: 'NoneType' object has no attribute 'is_dynamic' at L820.

current_loaded_models has no lock (the only lock in the module, interrupt_processing_mutex, is unrelated). The finalizer can fire either reentrantly on the same thread during GC inside the loop, or concurrently from a background thread — the threaded loader (#14116) and MultiGPUThreadPool workers (#7063, comfy/multigpu.py) both allocate/free and can trigger collection of a real_model while the main thread is inside free_memory's loop. Either way the unlocked list is mutated mid-iteration.

Representative traceback (mapped to master line numbers)

  File "comfy/sd.py", in load_state_dict_guess_config
    load_models_gpu([model_patcher], force_full_load=True)
  File "comfy/model_management.py", in load_models_gpu
    free_memory(total_memory_required[device] * 1.1 + extra_mem, device, ...)
  File "comfy/model_management.py", line 820, in free_memory
    if current_loaded_models[i].model.is_dynamic() and for_dynamic:
IndexError: list index out of range          # or: AttributeError: 'NoneType' object has no attribute 'is_dynamic'

A cleanup_models() frame (L988/L995) appears in the same log block, confirming the finalizer ran and popped the list during the free_memory loop (no "memory leak" warning is logged, i.e. it fired via ordinary GC reclamation, not the leak-detection path).

Why it is hard to hit on a stock single-load workload

On stock master, free_memory(for_dynamic=True) skips unloading dynamic models (memory_to_free = 0), so the in-loop model_unload calls — which are the most reliable in-loop trigger for the finalizer — only happen for non-dynamic entries. The race is therefore rare on stock, but reachable: any in-loop model_unload or any background-thread collection that fires cleanup_models while free_memory iterates can_unload_sorted is enough. We observe it deterministically on a downstream build (based on v0.24.0) that force-unloads in this loop, ~15 distinct task failures / 12h across many checkpoints and pods, all at L820, zero on a 0.21.1-based build.

Proposed fix

Make free_memory's unload loop independent of the live list's indexing, so any concurrent/reentrant mutation of current_loaded_models is harmless:

  • Carry the LoadedModel object in can_unload (keep i only as a sort tiebreaker), operate on that object, and take a strong local reference to its ModelPatcher for the iteration so its weakref can't go None mid-body.
  • Guard the already-collected case (model = lm.model; if model is None: continue).
  • Remove unloaded entries by identity after the loop (tolerating any the finalizer already popped) instead of pop(i) with stale indices — note a bounds check alone is insufficient, since a prior pop shifts every later index and would unload the wrong model.

Sketch:

# first loop: append the object too
can_unload.append((-lm.model_offloaded_memory(), sys.getrefcount(lm.model), lm.model_memory(), i, lm))

can_unload_sorted = sorted(can_unload, key=lambda x: x[:4])
for x in can_unload_sorted:
    lm = x[-1]
    model = lm.model
    if model is None:           # collected after capture; cleanup_models is tearing it down
        continue
    ...
    if memory_to_free > 0 and lm.model_unload(memory_to_free):
        unloaded.append(lm)

# remove by identity, tolerate entries cleanup_models already popped
ids = {id(m) for m in unloaded}
current_loaded_models[:] = [m for m in current_loaded_models if id(m) not in ids]

A heavier alternative is to guard all current_loaded_models mutations (free_memory, load_models_gpu, cleanup_models) with a threading.RLock (RLock because the finalizer can re-enter on the same thread). The index-independent rewrite above fixes the observed crash with much lower risk.

I'm happy to open a PR with the object-iteration fix if it looks reasonable.

System info / versions

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions