Skip to content

Avoid global umask for setting file mode. #7547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions src/datasets/utils/file_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
import posixpath
import re
import shutil
import stat
import sys
import tarfile
import time
Expand Down Expand Up @@ -411,11 +412,14 @@ def temp_file_manager(mode="w+b"):
# GET file object
fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc, disable_tqdm=disable_tqdm)

# get the permissions of the temp file
temp_file_mode = stat.S_IMODE(os.stat(temp_file.name).st_mode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't necessarily return the user's umask no ?

we need the user's umask since some people use datasets with shared disks and use group permissions (see #2157 #5799)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking more at the history of this code, I am not sure the umask block below is needed at all anymore and could just be deleted.

with FileLock(lock_path):
if resume_download:
incomplete_path = cache_path + ".incomplete"
@contextmanager
def _resumable_file_manager():
with open(incomplete_path, "a+b") as f:
yield f
temp_file_manager = _resumable_file_manager
if os.path.exists(incomplete_path):
resume_size = os.stat(incomplete_path).st_size
else:
resume_size = 0
else:
temp_file_manager = partial(tempfile.NamedTemporaryFile, dir=cache_dir, delete=False)
resume_size = 0
# Download to temporary file, then copy to cache dir once finished.
# Otherwise you get corrupt cache entries if the download gets interrupted.
with temp_file_manager() as temp_file:
logger.info(f"{url} not found in cache or force_download set to True, downloading to {temp_file.name}")
# GET file object
if scheme == "ftp":
ftp_get(url, temp_file)
elif scheme not in ("http", "https"):
fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc)
else:
http_get(
url,
temp_file,
proxies=proxies,
resume_size=resume_size,
headers=headers,
cookies=cookies,
max_retries=max_retries,
desc=download_desc,
)
logger.info(f"storing {url} in cache at {cache_path}")
shutil.move(temp_file.name, cache_path)
umask = os.umask(0o666)
os.umask(umask)
os.chmod(cache_path, 0o666 & ~umask)
is the code just after the umask was added and NamedTemporaryFile was used to create the files. NamedTemporaryFile always set 600 permissions and didn't respect umask so umask code was added to fix the permissions.

d536e37 removed the NamedTemporaryFile and moved to open(), which respects the umask.

We should just be able to remove this temp_file_mode + updated chmod lines and then remove:

umask = os.umask(0o666)
os.umask(umask)
os.chmod(cache_path, 0o666 & ~umask)

Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq - I updated the PR to remove this outdated code, seems to work as expected and umask is respected (see output below from looking at files produced from tests/test_file_utils.py).

If this looks good to you, can we get this merged for 3.5.2?

Default umask:

$ ls -la | grep a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0
-rw-r--r--@ 1 ryan  staff   39 May  5 13:14 a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0035db883c7050ae847
-rw-r--r--@ 1 ryan  staff   35 May  5 13:14 a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0035db883c7050ae847.json
-rw-r--r--@ 1 ryan  staff    0 May  5 13:14 a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0035db883c7050ae847.lock

Modified to umask 002:

$ ls -la | grep a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0
-rw-rw-r--@ 1 ryan  staff   39 May  5 13:15 a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0035db883c7050ae847
-rw-rw-r--@ 1 ryan  staff   35 May  5 13:15 a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0035db883c7050ae847.json
-rw-rw-r--@ 1 ryan  staff    0 May  5 13:15 a6ecb420baa10e46bf84ed0960dee4e9093fee1aa58dc0035db883c7050ae847.lock


logger.info(f"storing {url} in cache at {cache_path}")
shutil.move(temp_file.name, cache_path)
umask = os.umask(0o666)
os.umask(umask)
os.chmod(cache_path, 0o666 & ~umask)

# make sure make sure permissions on cache_path are the same as temp_file, shutil.move may not preserve them
os.chmod(cache_path, temp_file_mode)

logger.info(f"creating metadata file for {cache_path}")
meta = {"url": url, "etag": etag}
Expand Down