Skip to content

Conversation

@liam-lloyd
Copy link

In the rclone model's _ensure_container_exists method, rclone ls is used to check whether a container exists on an rclone remote. If not, rclone mkdir is used to create it. However, rclone mkdir returns success without making any changes if the directory already exists, so calling rclone ls and then rclone mkdir if the container doesn't exist is equivalent to just calling rclone mkdir. Further, rclone ls can take quite some time to respond if the remote has a large number of items in its root directory. This commit removes the ls call and updates _ensure_container_exists to simply call mkdir in all cases.

@liam-lloyd
Copy link
Author

For reference, the rclone mkdir docs

@liam-lloyd
Copy link
Author

This addresses archivematica/Issues#1743

subprocess.assert_called_with(["mkdir", "testremote:testcontainer"])

args, _ = subprocess.Popen.call_args
assert args[0] == ["rclone", "mkdir", "testremote:testcontainer"]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this assertion out of the with block because with pytest.raises() blocks terminate once they encounter the exception, so this assertion wasn't running. I also updated the assertion to validate the call I think we intend to validate here.

@replaceafill
Copy link
Member

@liam-lloyd I noticed the linting job failed. If it's of any help in the Archivematica related repositories we have configuration files for pre-commit. So once you install pre-commit you can cd into any Archivematica repository and run pre-commit install which will set a git hook that can help you detect this type of problem.

@liam-lloyd liam-lloyd force-pushed the dev/issue-1743-remove-unnecessary-rclone-ls-call branch from ceaaec5 to c6ea564 Compare July 22, 2025 00:06
@sarah-mason sarah-mason added the Community Pull requests that have been contributed from community members outside Artefactual label Jul 24, 2025
@liam-lloyd liam-lloyd force-pushed the dev/issue-1743-remove-unnecessary-rclone-ls-call branch from c6ea564 to 03d3807 Compare July 24, 2025 16:46
In the rclone model's _ensure_container_exists method, rclone ls is used
to check whether a container exists on an rclone remote. If not, rclone
mkdir is used to create it. However, rclone mkdir returns success without
making any changes if the directory already exists, so calling rclone ls
and then rclone mkdir if the container doesn't exist is equivalent to
just calling rclone mkdir. Further, rclone ls can take quite some time
to respond if the remote has a large number of items in its root directory.
This commit removes the ls call and updates _ensure_container_exists to
simply call mkdir in all cases.
@liam-lloyd liam-lloyd force-pushed the dev/issue-1743-remove-unnecessary-rclone-ls-call branch from 03d3807 to 03f8ada Compare September 4, 2025 21:54
@replaceafill
Copy link
Member

replaceafill commented Oct 18, 2025

@liam-lloyd During testing we found this change might potentially break transfer source locations with the following conditions:

  • The RClone space of the location is set up with an S3 backend
  • The IAM role associated with the access key of the S3 backend has no write permissions on the buckets

I've created a new integration test using MinIO to demonstrate this scenario. Let me break it down:

  1. Unlike the rest of the integration tests this one uses an access key with a very limited policy applied:
    monkeypatch.setenv("RCLONE_CONFIG_MYS3_ACCESS_KEY_ID", "minio-user")
    monkeypatch.setenv("RCLONE_CONFIG_MYS3_SECRET_ACCESS_KEY", "minio-password")
  2. It creates a RClone space and transfer source location in it:
    # Create space.
    resp = client.add_space(
    {
    "access_protocol": Space.RCLONE,
    "path": "/",
    "staging_path": "/var/archivematica/storage_service",
    "remote_name": "mys3",
    "container": s3_browse_bucket,
    }
    )
    assert resp.status_code == 201
    space = json.loads(resp.content)
    # Create transfer source location.
    resp = client.add_location(
    {
    "relative_path": "ts",
    "purpose": Location.TRANSFER_SOURCE,
    "space": space["resource_uri"],
    "pipeline": [pipeline["resource_uri"]],
    }
    )
    assert resp.status_code == 201
    location = json.loads(resp.content)
  3. The container of the location is associated with a pre-populated bucket.
    objects = [
    "ts/file1.txt",
    "ts/file2.txt",
    "ts/dir1/file3.txt",
    "ts/dir2/file4.txt",
    "ts/dir2/file5.txt",
    "ts/dir2/subdir1/file6.txt",
    "ts/dir2/subdir2/file7.txt",
    "ts/dir2/subdir2/subdir3/file8.txt",
    ]
    for key in objects:
    s3_resource.Object(bucket_name, key).put(Body=b"content")
  4. The test browses a "subdirectory" in the transfer source location:
    resp = client.browse_location(
    uuid.UUID(location["uuid"]), {"path": base64.b64encode(b"/ts/dir2").decode()}
    )
    assert resp.status_code == 200
    browse_result = json.loads(resp.content)
    entries = {base64.b64decode(entry).decode() for entry in browse_result["entries"]}
    directories = {
    base64.b64decode(directory).decode()
    for directory in browse_result["directories"]
    }
    assert entries == {"file4.txt", "file5.txt", "subdir1", "subdir2"}
    assert directories == {"subdir1", "subdir2"}
  5. The rest simulates the beginning of the transfer process where the Archivematica pipeline requests the Storage Service to copy a set of files from the transfer source location to an internal processing location directory and ensures the file is successfully copied and it's readable:
    # File content is set in the s3_browse_bucket fixture.
    assert (transfer_dir / source_file_name).read_text() == "content"

The test passes in that pull request and if I cherry pick your commit on top of the branch it fails with:

DEBUG    archivematica.storage_service.locations.models.rclone:rclone.py:117 rclone remote selected: mys3:
DEBUG    archivematica.storage_service.locations.models.rclone:rclone.py:74 rclone cmd: ['rclone', 'mkdir', 'mys3:storage-service-browse-935398cc88774cefa9b615dc1565a17f']
DEBUG    archivematica.storage_service.locations.models.rclone:rclone.py:75 rclone stdout: 
WARNING  archivematica.storage_service.locations.models.rclone:rclone.py:77 rclone stderr: 2025/10/17 19:43:15 NOTICE: Config file "/var/archivematica/.config/rclone/rclone.conf" not found - using defaults
2025/10/17 19:43:15 ERROR : Attempt 1/3 failed with 1 errors and: AccessDenied: Access Denied.
        status code: 403, request id: 186F75B4177B0870, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
2025/10/17 19:43:15 ERROR : Attempt 2/3 failed with 1 errors and: AccessDenied: Access Denied.
        status code: 403, request id: 186F75B4178FA58E, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
2025/10/17 19:43:15 ERROR : Attempt 3/3 failed with 1 errors and: AccessDenied: Access Denied.
        status code: 403, request id: 186F75B4179D7F65, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8
2025/10/17 19:43:15 Failed to mkdir: AccessDenied: Access Denied.
        status code: 403, request id: 186F75B4179D7F65, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8

ERROR    archivematica.storage_service.locations.models.rclone:rclone.py:106 Error running rclone command. Command called: ['rclone', 'mkdir', 'mys3:storage-service-browse-935398cc88774cefa9b615dc1565a17f']. Details: ('rclone returned non-zero return code: %s. stderr: %s', 1, '2025/10/17 19:43:15 NOTICE: Config file "/var/archivematica/.config/rclone/rclone.conf" not found - using defaults\n2025/10/17 19:43:15 ERROR : Attempt 1/3 failed with 1 errors and: AccessDenied: Access Denied.\n\tstatus code: 403, request id: 186F75B4177B0870, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8\n2025/10/17 19:43:15 ERROR : Attempt 2/3 failed with 1 errors and: AccessDenied: Access Denied.\n\tstatus code: 403, request id: 186F75B4178FA58E, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8\n2025/10/17 19:43:15 ERROR : Attempt 3/3 failed with 1 errors and: AccessDenied: Access Denied.\n\tstatus code: 403, request id: 186F75B4179D7F65, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8\n2025/10/17 19:43:15 Failed to mkdir: AccessDenied: Access Denied.\n\tstatus code: 403, request id: 186F75B4179D7F65, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8\n')
ERROR    archivematica.storage_service.locations.models.rclone:rclone.py:132 Unable to find or create container mys3:storage-service-browse-935398cc88774cefa9b615dc1565a17f
ERROR    django.request:log.py:246 Internal Server Error: /api/v2/location/a970ada2-1afc-49b2-994c-5146cf649f43/browse/
Traceback (most recent call last):
  File "/src/src/archivematica/storage_service/locations/models/rclone.py", line 94, in _execute_rclone_subcommand
    raise StorageException(
archivematica.storage_service.locations.models.StorageException: ('rclone returned non-zero return code: %s. stderr: %s', 1, '2025/10/17 19:43:15 NOTICE: Config file "/var/archivematica/.config/rclone/rclone.conf" not found - using defaults\n2025/10/17 19:43:15 ERROR : Attempt 1/3 failed with 1 errors and: AccessDenied: Access Denied.\n\tstatus code: 403, request id: 186F75B4177B0870, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8\n2025/10/17 19:43:15 ERROR : Attempt 2/3 failed with 1 errors and: AccessDenied: Access Denied.\n\tstatus code: 403, request id: 186F75B4178FA58E, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8\n2025/10/17 19:43:15 ERROR : Attempt 3/3 failed with 1 errors and: AccessDenied: Access Denied.\n\tstatus code: 403, request id: 186F75B4179D7F65, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8\n2025/10/17 19:43:15 Failed to mkdir: AccessDenied: Access Denied.\n\tstatus code: 403, request id: 186F75B4179D7F65, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8\n')

@mamedin and I have been testing other approaches to check if the container exists, like the lsjson command passing a --max-depth 1 but I'm wondering if you can think of a better approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community Pull requests that have been contributed from community members outside Artefactual

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants