-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
This should help a lot of users running into No space left on device while using datasets. Normally the issue is is that /tmp is too small and the user needs to use another path, which they would normally set as export TMPDIR=/some/big/storage
However, the tempfile facility that datasets and pyarrow use is somewhat broken. If the path doesn't exist it'd ignore it and fall back to using /tmp. Watch this:
$ export TMPDIR='/tmp/username'
$ python -c "\
import os
import tempfile
print(os.environ['TMPDIR'])
print(tempfile.gettempdir())"
/tmp/username
/tmp
Now let's ensure the path exists:
$ export TMPDIR='/tmp/username'
$ mkdir -p $TMPDIR
$ python -c "\
import os
import tempfile
print(os.environ['TMPDIR'])
print(tempfile.gettempdir())"
/tmp/username
/tmp/username
So I recommend datasets do either of the 2:
- assert if
$TMPDIRdir doesn't exist, telling the user to create it - auto-create it
The reason for (1) is that I don't know why tempdir doesn't auto-create the dir - perhaps some security implication? I will let you guys make the decision, but the key is not to let things silently fall through and the user puzzling why no matter what they do they can't break past No space left on device while using datasets
Thank you.
I found this via https://stackoverflow.com/questions/37229398/python-tempfile-gettempdir-does-not-respect-tmpdir while trying to help a colleague to solve this exact issue.