Skip to content

Monitor spilled data that is still referenced #6220

Open
@crusaderky

Description

@crusaderky

Follow-up to #5936

In a normal situation, a worker holds some in-memory tasks that are relatively at rest while it is actively working on others.
In that case, in case of memory pressure, the tasks at rest are spilled to disk using a LRU algorithm.

In case of extreme memory pressure, however, a key may be spilled to disk while it is in use - either because it's an input to another task that's currently running or because it's being sent to another worker. When that happens, the data is spilled but the memory is not released until the compute or send has finished; in the GUI, its RAM will transition from "managed" to a double effect of "unmanaged recent" plus "spilled".

It would be valuable to separate this kind of memory usage from the opaque unmanaged blob.
This is straightforward after #5936:

class SpillBuffer:
    @property
    def spilled_but_still_referenced(self) -> int:
        if not has_zict_220:
            return 0
        cache = cast(Cache, self.slow)
        slow = cast(Slow, cache.data)
        return sum(slow.weight_by_key[key].memory for key in cache.cache)

The above is O(n) to the number of active computations and transfers - so negligible most times.
The output could be sent to the scheduler during heartbeat and contribute to distributed.scheduler.MemoryState, like it already happens for SpillBuffer.spilled_total.

TODO

Come up with a good way to visualize this info in the GUI

OUT OF SCOPE

UIse the new metric in algorithms (but feel free to discuss here)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions