Description
Follow-up to #5936
In a normal situation, a worker holds some in-memory tasks that are relatively at rest while it is actively working on others.
In that case, in case of memory pressure, the tasks at rest are spilled to disk using a LRU algorithm.
In case of extreme memory pressure, however, a key may be spilled to disk while it is in use - either because it's an input to another task that's currently running or because it's being sent to another worker. When that happens, the data is spilled but the memory is not released until the compute or send has finished; in the GUI, its RAM will transition from "managed" to a double effect of "unmanaged recent" plus "spilled".
It would be valuable to separate this kind of memory usage from the opaque unmanaged blob.
This is straightforward after #5936:
class SpillBuffer:
@property
def spilled_but_still_referenced(self) -> int:
if not has_zict_220:
return 0
cache = cast(Cache, self.slow)
slow = cast(Slow, cache.data)
return sum(slow.weight_by_key[key].memory for key in cache.cache)
The above is O(n) to the number of active computations and transfers - so negligible most times.
The output could be sent to the scheduler during heartbeat and contribute to distributed.scheduler.MemoryState
, like it already happens for SpillBuffer.spilled_total
.
TODO
Come up with a good way to visualize this info in the GUI
OUT OF SCOPE
UIse the new metric in algorithms (but feel free to discuss here)