-
-
Notifications
You must be signed in to change notification settings - Fork 730
[WIP] Optionally use mmap
ing of files with spilling
#6516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This allows the OS to handle mapping pages between file and memory. As a result if memory is getting full and some objects are not being used, the OS can free this memory up behind the scenes. Since the scratch directories being used for spilling are meant to at least be local disks (not NFS) or often solid state, IO between disk and memory can be quite fast. So reloading spilled objects is fairly reasonable.
distributed/spill.py
Outdated
|
||
def __init__(self, spill_directory: str, max_weight: int | Literal[False] = False): | ||
def __init__(self, spill_directory: str, max_weight: int | Literal[False] = False, memmap: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we want this in the config as well
Should note this is incomplete as we need the config value noted above |
super().__init__( | ||
partial(serialize_bytelist, on_error="raise"), | ||
deserialize_bytes, | ||
zict.File(spill_directory), | ||
zict.File(spill_directory, **file_kwargs), # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nitpick: please check for has_zict_210
instead like it already happens above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, your branch from main is several months old)
It doesn't seem to work: import dask
import numpy
from distributed.spill import Slow
dask.config.set({"distributed.comm.compression": False})
d = Slow("zictdump", memmap=True)
d["x"] = numpy.random.random(2**27)
x = d["x"] # RSS memory blows up by 1GB The memory is allocated by Assuming the above is fixed, I can see several fundamental problems:
In conclusion:I think it would be very useful to take a step back and analyse what, exactly, we are trying to achieve here, that the current explicit spill system doesn't already do, and come up with a design as a consequence. |
As noted above, I'm not sure this is ready for review. That said, appreciate the feedback Edit: This came out of the discussion in PR ( #6503 ) |
Am curious why compression wouldn't work. As compressors support the Python Buffer Protocol as does |
If you run
The 3 buffers are then written sequentially to a file in your local directory. When you load it back,
This is the theory - in practice, I showed above that there's a bug somewhere that prevents this. With compression, the second and third buffer are passed through
|
Right decompression would force it into memory. Thought not work meant using the |
This allows the OS to handle mapping pages between file and memory. As a result if memory is getting full and some objects are not being used, the OS can free this memory up behind the scenes. Since the scratch directories being used for spilling are meant to at least be local disks (not NFS) or often solid state, IO between disk and memory can be quite fast. So reloading spilled objects is fairly reasonable.
pre-commit run --all-files