Skip to content

work around tempfile silently ignoring TMPDIR if the dir doesn't exist #7877

@stas00

Description

@stas00

This should help a lot of users running into No space left on device while using datasets. Normally the issue is is that /tmp is too small and the user needs to use another path, which they would normally set as export TMPDIR=/some/big/storage

However, the tempfile facility that datasets and pyarrow use is somewhat broken. If the path doesn't exist it'd ignore it and fall back to using /tmp. Watch this:

$ export TMPDIR='/tmp/username' 

$ python -c "\
import os
import tempfile
print(os.environ['TMPDIR'])
print(tempfile.gettempdir())"
/tmp/username
/tmp

Now let's ensure the path exists:

$ export TMPDIR='/tmp/username' 
$ mkdir -p $TMPDIR
$ python -c "\
import os
import tempfile
print(os.environ['TMPDIR'])
print(tempfile.gettempdir())"
/tmp/username
/tmp/username

So I recommend datasets do either of the 2:

  1. assert if $TMPDIR dir doesn't exist, telling the user to create it
  2. auto-create it

The reason for (1) is that I don't know why tempdir doesn't auto-create the dir - perhaps some security implication? I will let you guys make the decision, but the key is not to let things silently fall through and the user puzzling why no matter what they do they can't break past No space left on device while using datasets

Thank you.

I found this via https://stackoverflow.com/questions/37229398/python-tempfile-gettempdir-does-not-respect-tmpdir while trying to help a colleague to solve this exact issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions