free_memory() race: cleanup_models finalizer mutates current_loaded_models mid-loop → IndexError / 'NoneType' has no attribute 'is_dynamic'

### Summary

`free_memory()` in `comfy/model_management.py` caches absolute indices into the global, unlocked `current_loaded_models` list and dereferences them later in the same function. If the `cleanup_models()` weakref finalizer (registered via `weakref.finalize(real_model, cleanup_models)`) pops entries from `current_loaded_models` between capture and use, the cached index becomes stale, producing either:

- `IndexError: list index out of range`, or
- `AttributeError: 'NoneType' object has no attribute 'is_dynamic'`

Both originate from the same line. This is a latent race; it became frequent for us under heavy checkpoint-swapping with the threaded loader active. Line numbers below are against current `master` (`comfyui_version` 0.24.0).

### Mechanism (line refs on master)

1. `free_memory` (L801) — first loop appends `(... , i)` tuples to `can_unload`, capturing the **absolute index** `i` into `current_loaded_models` (L811).
2. Second loop `for x in can_unload_sorted:` sets `i = x[-1]` (L816) and dereferences `current_loaded_models[i]`:
   ```python
   if current_loaded_models[i].model.is_dynamic() and for_dynamic:   # L820
   ...
   if memory_to_free > 0 and current_loaded_models[i].model_unload(memory_to_free):  # L825
   ```
3. `model_unload()` (and/or a background thread) drops the last strong reference to a model's `real_model`. When that `real_model` is collected, the finalizer `weakref.finalize(real_model, cleanup_models)` (L734) runs `cleanup_models()` (L988), which **pops every dead entry from `current_loaded_models`** (L995):
   ```python
   def cleanup_models():
       to_delete = []
       for i in range(len(current_loaded_models)):
           if current_loaded_models[i].real_model() is None:
               to_delete = [i] + to_delete
       for i in to_delete:
           x = current_loaded_models.pop(i)   # L995 — mutates the list free_memory is iterating
           del x
   ```
4. After the pop, the cached `i` in `free_memory`'s loop is stale:
   - the list shrank → `current_loaded_models[i]` is out of range → **`IndexError`** at L820/L825; or
   - `i` now points at an entry whose `ModelPatcher` weakref was collected → `current_loaded_models[i].model` is `None` → `None.is_dynamic()` → **`AttributeError: 'NoneType' object has no attribute 'is_dynamic'`** at L820.

`current_loaded_models` has no lock (the only lock in the module, `interrupt_processing_mutex`, is unrelated). The finalizer can fire either **reentrantly** on the same thread during GC inside the loop, or **concurrently** from a background thread — the threaded loader (#14116) and `MultiGPUThreadPool` workers (#7063, `comfy/multigpu.py`) both allocate/free and can trigger collection of a `real_model` while the main thread is inside `free_memory`'s loop. Either way the unlocked list is mutated mid-iteration.

### Representative traceback (mapped to master line numbers)

```
  File "comfy/sd.py", in load_state_dict_guess_config
    load_models_gpu([model_patcher], force_full_load=True)
  File "comfy/model_management.py", in load_models_gpu
    free_memory(total_memory_required[device] * 1.1 + extra_mem, device, ...)
  File "comfy/model_management.py", line 820, in free_memory
    if current_loaded_models[i].model.is_dynamic() and for_dynamic:
IndexError: list index out of range          # or: AttributeError: 'NoneType' object has no attribute 'is_dynamic'
```
A `cleanup_models()` frame (L988/L995) appears in the same log block, confirming the finalizer ran and popped the list during the `free_memory` loop (no "memory leak" warning is logged, i.e. it fired via ordinary GC reclamation, not the leak-detection path).

### Why it is hard to hit on a stock single-load workload

On stock master, `free_memory(for_dynamic=True)` skips unloading dynamic models (`memory_to_free = 0`), so the in-loop `model_unload` calls — which are the most reliable in-loop trigger for the finalizer — only happen for non-dynamic entries. The race is therefore rare on stock, but reachable: any in-loop `model_unload` or any background-thread collection that fires `cleanup_models` while `free_memory` iterates `can_unload_sorted` is enough. We observe it deterministically on a downstream build (based on v0.24.0) that force-unloads in this loop, ~15 distinct task failures / 12h across many checkpoints and pods, all at L820, zero on a 0.21.1-based build.

### Proposed fix

Make `free_memory`'s unload loop independent of the live list's indexing, so any concurrent/reentrant mutation of `current_loaded_models` is harmless:

- Carry the `LoadedModel` **object** in `can_unload` (keep `i` only as a sort tiebreaker), operate on that object, and take a strong local reference to its `ModelPatcher` for the iteration so its weakref can't go `None` mid-body.
- Guard the already-collected case (`model = lm.model; if model is None: continue`).
- Remove unloaded entries by **identity** after the loop (tolerating any the finalizer already popped) instead of `pop(i)` with stale indices — note a bounds check alone is insufficient, since a prior pop shifts every later index and would unload the wrong model.

Sketch:
```python
# first loop: append the object too
can_unload.append((-lm.model_offloaded_memory(), sys.getrefcount(lm.model), lm.model_memory(), i, lm))

can_unload_sorted = sorted(can_unload, key=lambda x: x[:4])
for x in can_unload_sorted:
    lm = x[-1]
    model = lm.model
    if model is None:           # collected after capture; cleanup_models is tearing it down
        continue
    ...
    if memory_to_free > 0 and lm.model_unload(memory_to_free):
        unloaded.append(lm)

# remove by identity, tolerate entries cleanup_models already popped
ids = {id(m) for m in unloaded}
current_loaded_models[:] = [m for m in current_loaded_models if id(m) not in ids]
```

A heavier alternative is to guard all `current_loaded_models` mutations (free_memory, load_models_gpu, cleanup_models) with a `threading.RLock` (RLock because the finalizer can re-enter on the same thread). The index-independent rewrite above fixes the observed crash with much lower risk.

I'm happy to open a PR with the object-iteration fix if it looks reasonable.

### System info / versions

- ComfyUI `0.24.0` (master `00b633f36` has the identical `free_memory`, unchanged since v0.24.0).
- Threaded loader (#14116) and MultiGPU work units (#7063) present.
- Linux, CUDA, multi-pod production workload with frequent checkpoint swaps.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

free_memory() race: cleanup_models finalizer mutates current_loaded_models mid-loop → IndexError / 'NoneType' has no attribute 'is_dynamic' #14365

Summary

Mechanism (line refs on master)

Representative traceback (mapped to master line numbers)

Why it is hard to hit on a stock single-load workload

Proposed fix

System info / versions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

free_memory() race: cleanup_models finalizer mutates current_loaded_models mid-loop → IndexError / 'NoneType' has no attribute 'is_dynamic' #14365

Description

Summary

Mechanism (line refs on master)

Representative traceback (mapped to master line numbers)

Why it is hard to hit on a stock single-load workload

Proposed fix

System info / versions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions