Summary
free_memory() in comfy/model_management.py caches absolute indices into the global, unlocked current_loaded_models list and dereferences them later in the same function. If the cleanup_models() weakref finalizer (registered via weakref.finalize(real_model, cleanup_models)) pops entries from current_loaded_models between capture and use, the cached index becomes stale, producing either:
IndexError: list index out of range, or
AttributeError: 'NoneType' object has no attribute 'is_dynamic'
Both originate from the same line. This is a latent race; it became frequent for us under heavy checkpoint-swapping with the threaded loader active. Line numbers below are against current master (comfyui_version 0.24.0).
Mechanism (line refs on master)
free_memory (L801) — first loop appends (... , i) tuples to can_unload, capturing the absolute index i into current_loaded_models (L811).
- Second loop
for x in can_unload_sorted: sets i = x[-1] (L816) and dereferences current_loaded_models[i]:
if current_loaded_models[i].model.is_dynamic() and for_dynamic: # L820
...
if memory_to_free > 0 and current_loaded_models[i].model_unload(memory_to_free): # L825
model_unload() (and/or a background thread) drops the last strong reference to a model's real_model. When that real_model is collected, the finalizer weakref.finalize(real_model, cleanup_models) (L734) runs cleanup_models() (L988), which pops every dead entry from current_loaded_models (L995):
def cleanup_models():
to_delete = []
for i in range(len(current_loaded_models)):
if current_loaded_models[i].real_model() is None:
to_delete = [i] + to_delete
for i in to_delete:
x = current_loaded_models.pop(i) # L995 — mutates the list free_memory is iterating
del x
- After the pop, the cached
i in free_memory's loop is stale:
- the list shrank →
current_loaded_models[i] is out of range → IndexError at L820/L825; or
i now points at an entry whose ModelPatcher weakref was collected → current_loaded_models[i].model is None → None.is_dynamic() → AttributeError: 'NoneType' object has no attribute 'is_dynamic' at L820.
current_loaded_models has no lock (the only lock in the module, interrupt_processing_mutex, is unrelated). The finalizer can fire either reentrantly on the same thread during GC inside the loop, or concurrently from a background thread — the threaded loader (#14116) and MultiGPUThreadPool workers (#7063, comfy/multigpu.py) both allocate/free and can trigger collection of a real_model while the main thread is inside free_memory's loop. Either way the unlocked list is mutated mid-iteration.
Representative traceback (mapped to master line numbers)
File "comfy/sd.py", in load_state_dict_guess_config
load_models_gpu([model_patcher], force_full_load=True)
File "comfy/model_management.py", in load_models_gpu
free_memory(total_memory_required[device] * 1.1 + extra_mem, device, ...)
File "comfy/model_management.py", line 820, in free_memory
if current_loaded_models[i].model.is_dynamic() and for_dynamic:
IndexError: list index out of range # or: AttributeError: 'NoneType' object has no attribute 'is_dynamic'
A cleanup_models() frame (L988/L995) appears in the same log block, confirming the finalizer ran and popped the list during the free_memory loop (no "memory leak" warning is logged, i.e. it fired via ordinary GC reclamation, not the leak-detection path).
Why it is hard to hit on a stock single-load workload
On stock master, free_memory(for_dynamic=True) skips unloading dynamic models (memory_to_free = 0), so the in-loop model_unload calls — which are the most reliable in-loop trigger for the finalizer — only happen for non-dynamic entries. The race is therefore rare on stock, but reachable: any in-loop model_unload or any background-thread collection that fires cleanup_models while free_memory iterates can_unload_sorted is enough. We observe it deterministically on a downstream build (based on v0.24.0) that force-unloads in this loop, ~15 distinct task failures / 12h across many checkpoints and pods, all at L820, zero on a 0.21.1-based build.
Proposed fix
Make free_memory's unload loop independent of the live list's indexing, so any concurrent/reentrant mutation of current_loaded_models is harmless:
- Carry the
LoadedModel object in can_unload (keep i only as a sort tiebreaker), operate on that object, and take a strong local reference to its ModelPatcher for the iteration so its weakref can't go None mid-body.
- Guard the already-collected case (
model = lm.model; if model is None: continue).
- Remove unloaded entries by identity after the loop (tolerating any the finalizer already popped) instead of
pop(i) with stale indices — note a bounds check alone is insufficient, since a prior pop shifts every later index and would unload the wrong model.
Sketch:
# first loop: append the object too
can_unload.append((-lm.model_offloaded_memory(), sys.getrefcount(lm.model), lm.model_memory(), i, lm))
can_unload_sorted = sorted(can_unload, key=lambda x: x[:4])
for x in can_unload_sorted:
lm = x[-1]
model = lm.model
if model is None: # collected after capture; cleanup_models is tearing it down
continue
...
if memory_to_free > 0 and lm.model_unload(memory_to_free):
unloaded.append(lm)
# remove by identity, tolerate entries cleanup_models already popped
ids = {id(m) for m in unloaded}
current_loaded_models[:] = [m for m in current_loaded_models if id(m) not in ids]
A heavier alternative is to guard all current_loaded_models mutations (free_memory, load_models_gpu, cleanup_models) with a threading.RLock (RLock because the finalizer can re-enter on the same thread). The index-independent rewrite above fixes the observed crash with much lower risk.
I'm happy to open a PR with the object-iteration fix if it looks reasonable.
System info / versions
Summary
free_memory()incomfy/model_management.pycaches absolute indices into the global, unlockedcurrent_loaded_modelslist and dereferences them later in the same function. If thecleanup_models()weakref finalizer (registered viaweakref.finalize(real_model, cleanup_models)) pops entries fromcurrent_loaded_modelsbetween capture and use, the cached index becomes stale, producing either:IndexError: list index out of range, orAttributeError: 'NoneType' object has no attribute 'is_dynamic'Both originate from the same line. This is a latent race; it became frequent for us under heavy checkpoint-swapping with the threaded loader active. Line numbers below are against current
master(comfyui_version0.24.0).Mechanism (line refs on master)
free_memory(L801) — first loop appends(... , i)tuples tocan_unload, capturing the absolute indexiintocurrent_loaded_models(L811).for x in can_unload_sorted:setsi = x[-1](L816) and dereferencescurrent_loaded_models[i]:model_unload()(and/or a background thread) drops the last strong reference to a model'sreal_model. When thatreal_modelis collected, the finalizerweakref.finalize(real_model, cleanup_models)(L734) runscleanup_models()(L988), which pops every dead entry fromcurrent_loaded_models(L995):iinfree_memory's loop is stale:current_loaded_models[i]is out of range →IndexErrorat L820/L825; orinow points at an entry whoseModelPatcherweakref was collected →current_loaded_models[i].modelisNone→None.is_dynamic()→AttributeError: 'NoneType' object has no attribute 'is_dynamic'at L820.current_loaded_modelshas no lock (the only lock in the module,interrupt_processing_mutex, is unrelated). The finalizer can fire either reentrantly on the same thread during GC inside the loop, or concurrently from a background thread — the threaded loader (#14116) andMultiGPUThreadPoolworkers (#7063,comfy/multigpu.py) both allocate/free and can trigger collection of areal_modelwhile the main thread is insidefree_memory's loop. Either way the unlocked list is mutated mid-iteration.Representative traceback (mapped to master line numbers)
A
cleanup_models()frame (L988/L995) appears in the same log block, confirming the finalizer ran and popped the list during thefree_memoryloop (no "memory leak" warning is logged, i.e. it fired via ordinary GC reclamation, not the leak-detection path).Why it is hard to hit on a stock single-load workload
On stock master,
free_memory(for_dynamic=True)skips unloading dynamic models (memory_to_free = 0), so the in-loopmodel_unloadcalls — which are the most reliable in-loop trigger for the finalizer — only happen for non-dynamic entries. The race is therefore rare on stock, but reachable: any in-loopmodel_unloador any background-thread collection that firescleanup_modelswhilefree_memoryiteratescan_unload_sortedis enough. We observe it deterministically on a downstream build (based on v0.24.0) that force-unloads in this loop, ~15 distinct task failures / 12h across many checkpoints and pods, all at L820, zero on a 0.21.1-based build.Proposed fix
Make
free_memory's unload loop independent of the live list's indexing, so any concurrent/reentrant mutation ofcurrent_loaded_modelsis harmless:LoadedModelobject incan_unload(keepionly as a sort tiebreaker), operate on that object, and take a strong local reference to itsModelPatcherfor the iteration so its weakref can't goNonemid-body.model = lm.model; if model is None: continue).pop(i)with stale indices — note a bounds check alone is insufficient, since a prior pop shifts every later index and would unload the wrong model.Sketch:
A heavier alternative is to guard all
current_loaded_modelsmutations (free_memory, load_models_gpu, cleanup_models) with athreading.RLock(RLock because the finalizer can re-enter on the same thread). The index-independent rewrite above fixes the observed crash with much lower risk.I'm happy to open a PR with the object-iteration fix if it looks reasonable.
System info / versions
0.24.0(master00b633f36has the identicalfree_memory, unchanged since v0.24.0).