Description
Current design
-
Every time a new key is inserted in Worker.data, if the managed memory (output of sizeof) exceeds the
target
threshold, keys are spilled from the bottom of the LRU cache until the managed memory goes below target.
This is a synchronous process that does not release the event loop. This isn't great, but it's bounded in the sense that it's never going to spill more bytes than the size of the key that has just been inserted. -
Every 100ms (
distributed.worker.memory.monitor-interval
), measure the process memory through psutil. If the process memory exceeds thespill
threshold, start spilling keys until the process memory goes below thetarget
threshold (hysteresis cycle). Re-measure process memory, call garbage collection, and release the event loop multiple times in this process, which can potentially take many seconds.
The intent of this design is to have a very responsive, cheap, but inaccurate first threshold and a slow-to-notice, expensive, but accurate second one. The design however is problematic:
- when unmanaged memory (process - managed) is very high, e.g. due to a leak, high heap from the running user functions, or underestimated output of sizeof(). In the extreme cases of memory leaking, you're going to reach the
spill
threshold without having ever hit the target threshold and then spill the whole contents of Worker.data all at once. - when unmanaged memory is negative, due to overestimated output of sizeof(). This will cause
target
to start spilling too soon, when there's plenty of memory still available.
Proposed design
In zict:
- Add an
offset
property tozict.LRU
. This property is added tototal_weights
for the purpose of eviction.
In distributed.worker_memory:
- Every 100ms, measure process memory and calculate unmanaged memory.
- If process memory is above the
spill
threshold and there is data inWorker.fast
, garbage collect and re-measure it. - Update
Worker.data.fast.offset
to the amount of unmanaged memory. - Manually trigger spilling in zict.
In distributed.worker_state_machine._transition_to_memory
, distributed.Worker.execute
, and distributed.Worker.get_data
: no change, but now the offset is considered every time a key is inserted in fast.
Notes
- This could cause zict to synchronously spill many GiBs at once, without ever releasing the event loop. This change should be paired with Asynchronous Disk Access in Workers #4424.
- Leaving the current thresholds unchanged, you'll start spilling a lot earlier. Effectively, target is the new spill. I think it's safe to bump both by 0.1 (making spill the same as pause)
- We should rename "spill" to "aggressive_gc" to clarify its new meaning.