Skip to content

Make mirror_file fail if file object already exists (#7134) #7141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

nadove-ucsc
Copy link
Contributor

@nadove-ucsc nadove-ucsc commented May 19, 2025

Connected issues: #7134

Checklist

Author

  • PR is a draft
  • Target branch is develop
  • Name of PR branch matches issues/<GitHub handle of author>/<issue#>-<slug>
  • On ZenHub, PR is connected to all issues it (partially) resolves
  • PR description links to connected issues
  • PR title matches1 that of a connected issue or comment in PR explains why they're different
  • PR title references all connected issues
  • For each connected issue, there is at least one commit whose title references that issue

1 when the issue title describes a problem, the corresponding PR
title is Fix: followed by the issue title

Author (partiality)

  • Added p tag to titles of partial commits
  • This PR is labeled partial or completely resolves all connected issues
  • This PR partially resolves each of the connected issues or does not have the partial label

Author (chains)

  • This PR is blocked by previous PR in the chain or is not chained to another PR
  • The blocking PR is labeled base or this PR is not chained to another PR
  • This PR is labeled chained or is not chained to another PR

Author (reindex, API changes)

  • Added r tag to commit title or the changes introduced by this PR will not require reindexing of any deployment
  • This PR is labeled reindex:dev or the changes introduced by it will not require reindexing of dev
  • This PR is labeled reindex:anvildev or the changes introduced by it will not require reindexing of anvildev
  • This PR is labeled reindex:anvilprod or the changes introduced by it will not require reindexing of anvilprod
  • This PR is labeled reindex:prod or the changes introduced by it will not require reindexing of prod
  • This PR is labeled reindex:partial and its description documents the specific reindexing procedure for dev, anvildev, anvilprod and prod or requires a full reindex or carries none of the labels reindex:dev, reindex:anvildev, reindex:anvilprod and reindex:prod
  • This PR and its connected issues are labeled API or this PR does not modify a REST API
  • Added a (A) tag to commit title for backwards (in)compatible changes or this PR does not modify a REST API
  • Updated REST API version number in app.py or this PR does not modify a REST API

Author (upgrading deployments)

  • Ran make docker_images.json and committed the resulting changes or this PR does not modify azul_docker_images, or any other variables referenced in the definition of that variable
  • Documented upgrading of deployments in UPGRADING.rst or this PR does not require upgrading deployments
  • Added u tag to commit title or this PR does not require upgrading deployments
  • This PR is labeled upgrade or does not require upgrading deployments
  • This PR is labeled deploy:shared or does not modify docker_images.json, and does not require deploying the shared component for any other reason
  • This PR is labeled deploy:gitlab or does not require deploying the gitlab component
  • This PR is labeled deploy:runner or does not require deploying the runner image

Author (hotfixes)

  • Added F tag to main commit title or this PR does not include permanent fix for a temporary hotfix
  • Reverted the temporary hotfixes for any connected issues or the none of the stable branches (anvilprod and prod) have temporary hotfixes for any of the issues connected to this PR

Author (before every review)

  • Rebased PR branch on develop, squashed old fixups
  • Ran make requirements_update or this PR does not modify requirements*.txt, common.mk, Makefile and Dockerfile
  • Added R tag to commit title or this PR does not modify requirements*.txt
  • This PR is labeled reqs or does not modify requirements*.txt
  • make integration_test passes in personal deployment or this PR does not modify functionality that could affect the IT outcome

Peer reviewer (after approval)

  • PR is marked as approved
  • PR is not a draft
  • Ticket is in Review requested column
  • PR is awaiting requested review from system administrator
  • PR is assigned to only the system administrator

System administrator (after approval)

  • Actually approved the PR
  • Labeled connected issues as demo or no demo
  • Commented on connected issues about demo expectations or all connected issues are labeled no demo
  • Decided if PR can be labeled no sandbox
  • A comment to this PR details the completed security design review
  • PR title is appropriate as title of merge commit
  • N reviews label is accurate
  • Moved connected issues to Approved column
  • PR is assigned to only the operator

Operator (before pushing merge the commit)

  • Checked reindex:… labels and r commit title tag
  • Checked that demo expectations are clear or all connected issues are labeled no demo
  • Squashed PR branch and rebased onto develop
  • Sanity-checked history
  • Pushed PR branch to GitHub
  • Ran _select dev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused or this PR is not labeled deploy:shared
  • Ran _select dev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply or this PR is not labeled deploy:gitlab
  • Ran _select anvildev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unused or this PR is not labeled deploy:shared
  • Ran _select anvildev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab apply or this PR is not labeled deploy:gitlab
  • Checked the items in the next section or this PR is labeled deploy:gitlab
  • PR is assigned to only the system administrator or this PR is not labeled deploy:gitlab

System administrator

  • Background migrations for dev.gitlab are complete or this PR is not labeled deploy:gitlab
  • Background migrations for anvildev.gitlab are complete or this PR is not labeled deploy:gitlab
  • PR is assigned to only the operator

Operator (before pushing merge the commit)

  • Ran _select dev.gitlab && make -C terraform/gitlab/runner or this PR is not labeled deploy:runner
  • Ran _select anvildev.gitlab && make -C terraform/gitlab/runner or this PR is not labeled deploy:runner
  • Added sandbox label or PR is labeled no sandbox
  • Pushed PR branch to GitLab dev or PR is labeled no sandbox
  • Pushed PR branch to GitLab anvildev or PR is labeled no sandbox
  • Build passes in sandbox deployment or PR is labeled no sandbox
  • Build passes in anvilbox deployment or PR is labeled no sandbox
  • Reviewed build logs for anomalies in sandbox deployment or PR is labeled no sandbox
  • Reviewed build logs for anomalies in anvilbox deployment or PR is labeled no sandbox
  • Deleted unreferenced indices in sandbox or this PR does not remove catalogs or otherwise causes unreferenced indices in dev
  • Deleted unreferenced indices in anvilbox or this PR does not remove catalogs or otherwise causes unreferenced indices in anvildev
  • Started reindex in sandbox or this PR is not labeled reindex:dev
  • Started reindex in anvilbox or this PR is not labeled reindex:anvildev
  • Checked for failures in sandbox or this PR is not labeled reindex:dev
  • Checked for failures in anvilbox or this PR is not labeled reindex:anvildev
  • Confirmed all checks in PR are OK and the PR is mergeable
  • The title of the merge commit starts with the title of this PR
  • Added PR # reference to merge commit title
  • Collected commit title tags in merge commit title but only included p if the PR is also labeled partial
  • Moved connected issues to Merged lower column in ZenHub
  • Moved blocked issues to Triage or no issues are blocked on the connected issues
  • Pushed merge commit to GitHub

Operator (chain shortening)

  • Changed the target branch of the blocked PR to develop or this PR is not labeled base
  • Removed the chained label from the blocked PR or this PR is not labeled base
  • Removed the blocking relationship from the blocked PR or this PR is not labeled base
  • Removed the base label from this PR or this PR is not labeled base

Operator (after pushing the merge commit)

  • Pushed merge commit to GitLab dev
  • Pushed merge commit to GitLab anvildev
  • Build passes on GitLab dev
  • Reviewed build logs for anomalies on GitLab dev
  • Build passes on GitLab anvildev
  • Reviewed build logs for anomalies on GitLab anvildev
  • Ran _select dev.shared && make -C terraform/shared apply or this PR is not labeled deploy:shared
  • Ran _select anvildev.shared && make -C terraform/shared apply or this PR is not labeled deploy:shared
  • Deleted PR branch from GitHub
  • Deleted PR branch from GitLab dev
  • Deleted PR branch from GitLab anvildev

Operator (reindex)

  • Deindexed all unreferenced catalogs in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Deindexed all unreferenced catalogs in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Deindexed specific sources in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Deindexed specific sources in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Indexed specific sources in dev or this PR is neither labeled reindex:partial nor reindex:dev
  • Indexed specific sources in anvildev or this PR is neither labeled reindex:partial nor reindex:anvildev
  • Started reindex in dev or this PR does not require reindexing dev
  • Started reindex in anvildev or this PR does not require reindexing anvildev
  • Checked for, triaged and possibly requeued messages in both fail queues in dev or this PR does not require reindexing dev
  • Checked for, triaged and possibly requeued messages in both fail queues in anvildev or this PR does not require reindexing anvildev
  • Emptied fail queues in dev or this PR does not require reindexing dev
  • Emptied fail queues in anvildev or this PR does not require reindexing anvildev
  • Restarted the Data Browser pipeline for the ucsc/hca/dev branch on GitLab in dev or this PR does not require reindexing dev
  • Restarted the Data Browser pipeline for the ucsc/lungmap/dev branch on GitLab in dev or this PR does not require reindexing dev
  • Restarted deploy_browser job in the GitLab pipeline for this PR in dev or this PR does not require reindexing dev
  • Restarted the Data Browser pipeline for the ucsc/anvil/anvildev branch on GitLab in anvildev or this PR does not require reindexing anvildev
  • Restarted deploy_browser job in the GitLab pipeline for this PR in anvildev or this PR does not require reindexing anvildev

Operator

  • Propagated the deploy:shared, deploy:gitlab, deploy:runner, API, reindex:partial, reindex:anvilprod and reindex:prod labels to the next promotion PRs or this PR carries none of these labels
  • Propagated any specific instructions related to the deploy:shared, deploy:gitlab, deploy:runner, API, reindex:partial, reindex:anvilprod and reindex:prod labels, from the description of this PR to that of the next promotion PRs or this PR carries none of these labels
  • PR is assigned to no one

Shorthand for review comments

  • L line is too long
  • W line wrapping is wrong
  • Q bad quotes
  • F other formatting problem

@github-actions github-actions bot added the orange [process] Done by the Azul team label May 19, 2025
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/7134-mirror-file-fail-object-exists branch from 2aa836e to 5ea6493 Compare May 20, 2025 02:54
Copy link

codecov bot commented May 20, 2025

Codecov Report

Attention: Patch coverage is 81.35593% with 11 lines in your changes missing coverage. Please review.

Project coverage is 85.22%. Comparing base (0088ec9) to head (f0d2f2b).
Report is 7 commits behind head on develop.

Files with missing lines Patch % Lines
src/azul/service/storage_service.py 60.86% 9 Missing ⚠️
src/azul/indexer/mirror_controller.py 75.00% 1 Missing ⚠️
src/azul/indexer/mirror_service.py 92.85% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7141      +/-   ##
===========================================
- Coverage    85.24%   85.22%   -0.02%     
===========================================
  Files          152      152              
  Lines        22060    22099      +39     
===========================================
+ Hits         18804    18834      +30     
- Misses        3256     3265       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coveralls
Copy link

coveralls commented May 20, 2025

Coverage Status

coverage: 85.402% (-0.02%) from 85.417%
when pulling f0d2f2b on issues/nadove-ucsc/7134-mirror-file-fail-object-exists
into 0088ec9 on develop.

Copy link
Contributor

@dsotirho-ucsc dsotirho-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, an attempt to mirror a file that already exists in the destination will cause S3 to return a 413 (due to IfNoneMatch = *), however there is no R assertion as described in the ticket description. Was it decided the 413 was sufficient?

Also, the two parts of the reupload subtest both start by calling self._s3.delete_object(). Shouldn't this include a test where the file is not deleted and then reuploaded?

@dsotirho-ucsc dsotirho-ucsc removed their assignment May 20, 2025
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/7134-mirror-file-fail-object-exists branch 2 times, most recently from 12ff57e to e5cc410 Compare May 21, 2025 02:12
@nadove-ucsc
Copy link
Contributor Author

Also, the two parts of the reupload subtest both start by calling self._s3.delete_object(). Shouldn't this include a test where the file is not deleted and then reuploaded?

There are two objects, the file object and the info object. If the info object is present, the mirror service will skip trying to upload the file object, so all subtests after the first will skip the parts of the code we need coverage for. The two ways to circumvent this are to delete the info object or patch the method that checks for it. For some reason I had concluded the former was preferable but I can't remember my reasoning so I switched back to the latter.

dsotirho-ucsc
dsotirho-ucsc previously approved these changes May 21, 2025
Copy link
Contributor

@dsotirho-ucsc dsotirho-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

@dsotirho-ucsc dsotirho-ucsc marked this pull request as ready for review May 21, 2025 16:36
@@ -181,12 +198,16 @@ def upload(self,

def _object_creation_kwargs(self, *,
content_type: str | None = None,
tagging: Tagging | None = None):
tagging: Tagging | None = None,
exists_okay: bool = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
exists_okay: bool = True
overwrite: bool = True

and everywhere else.

**kwargs)
except botocore.exceptions.ClientError as e:
error = e.response['Error']
if error['Code'] == 'PreconditionFailed' and error['Condition'] == 'If-None-Match':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L

@hannes-ucsc hannes-ucsc added the 1 review [process] Lead requested changes once label May 21, 2025
@hannes-ucsc hannes-ucsc removed their assignment May 21, 2025
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/7134-mirror-file-fail-object-exists branch from e5cc410 to fc3d6c1 Compare May 21, 2025 19:03
@nadove-ucsc nadove-ucsc requested a review from hannes-ucsc May 21, 2025 20:51
parts = [
{
'PartNumber': index + 1,
'ETag': etag
}
for index, etag in enumerate(etags)
]
upload.complete(MultipartUpload={'Parts': parts})
upload.complete(MultipartUpload={'Parts': parts},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That same exception needs to be handled here as well.

Additionally, we don't want to create a big MP upload only to then realize at the end that the object already exists. The proper way to deal with this for large files is to check explicitly with HeadObject. I don't know if PutObject with If-None-Match fails early before processing the entire request body or after. Unless you can find documentation about this, please add the HeadObject for small files as well.

@hannes-ucsc hannes-ucsc removed their assignment May 21, 2025
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/7134-mirror-file-fail-object-exists branch 2 times, most recently from f36a492 to d3318e2 Compare May 23, 2025 07:43
@nadove-ucsc nadove-ucsc requested a review from hannes-ucsc May 23, 2025 08:08
Copy link
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subject: [PATCH] Clean-up mirroring fixture duplication
---
Index: test/indexer/test_mirror_controller.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/test/indexer/test_mirror_controller.py b/test/indexer/test_mirror_controller.py
--- a/test/indexer/test_mirror_controller.py	(revision 1c0a00197edc274bb9804375a59b3ecf3a9dfe3a)
+++ b/test/indexer/test_mirror_controller.py	(date 1748042321516)
@@ -78,12 +78,14 @@
                     with self.subTest('mirror_file', corrupted=False, exists=False):
                         self._test_mirror_file(file, file_message)
 
-                    # Force reupload attempts even if the info object is present
-                    with patch.object(MirrorService, 'info_exists', return_value=False):
-                        with self.subTest('mirror_file', corrupted=True):
-                            self._test_corrupted_download(file_message)
-                        with self.subTest('mirror_file', corrupted=False, exists=True):
-                            self._test_reuploaded_file(file_message)
+                    self._s3.delete_object(Bucket=self.mirror_bucket,
+                                           Key=self.mirror_controller.service.info_object_key(file))
+
+                    with self.subTest('mirror_file', corrupted=True):
+                        self._test_corrupted_download(file_message)
+
+                    with self.subTest('mirror_file', corrupted=False, exists=True):
+                        self._test_reuploaded_file(file_message)
 
     _file_contents = b'lorem ipsum dolor sit\n'
 
Index: src/azul/indexer/mirror_controller.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/indexer/mirror_controller.py b/src/azul/indexer/mirror_controller.py
--- a/src/azul/indexer/mirror_controller.py	(revision 1c0a00197edc274bb9804375a59b3ecf3a9dfe3a)
+++ b/src/azul/indexer/mirror_controller.py	(date 1748042570717)
@@ -152,12 +152,12 @@
         deployment_is_stable = (config.deployment.is_stable
                                 and not config.deployment.is_unit_test
                                 and catalog not in config.integration_test_catalogs)
-        if self.service.info_exists(catalog, file):
+        if file_is_large and not deployment_is_stable:
+            log.info('Not mirroring file to save cost: %r', file)
+        elif self.service.info_exists(catalog, file):
             log.info('File is already mirrored, skipping upload: %r', file)
         elif self.service.file_exists(catalog, file):
             assert False, R('File object is already present', file)
-        elif file_is_large and not deployment_is_stable:
-            log.info('Not mirroring file to save cost: %r', file)
         else:
             # Ensure we test with multiple parts on lower deployments
             part_size = FilePart.default_size if deployment_is_stable else FilePart.min_size
Index: src/azul/service/storage_service.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/service/storage_service.py b/src/azul/service/storage_service.py
--- a/src/azul/service/storage_service.py	(revision 1c0a00197edc274bb9804375a59b3ecf3a9dfe3a)
+++ b/src/azul/service/storage_service.py	(date 1748043612861)
@@ -77,6 +77,10 @@
     pass
 
 
+class StorageObjectExists(Exception):
+    pass
+
+
 class StorageService:
 
     def __init__(self, bucket_name: str | None = None):
@@ -94,7 +98,8 @@
                                         Key=object_key)
         except self._s3.exceptions.ClientError as e:
             if int(e.response['Error']['Code']) == 404:
-                raise StorageObjectNotFound
+                # REVIEW: separate commit
+                raise StorageObjectNotFound(object_key)
             else:
                 raise e
 
@@ -103,7 +108,8 @@
             response = self._s3.get_object(Bucket=self.bucket_name,
                                            Key=object_key)
         except self._s3.exceptions.NoSuchKey:
-            raise StorageObjectNotFound
+            # REVIEW: same commit as above
+            raise StorageObjectNotFound(object_key)
         else:
             return response['Body'].read()
 
@@ -309,7 +315,7 @@
         error = exception.response['Error']
         code, condition = error['Code'], error['Condition']
         if code == 'PreconditionFailed' and condition == 'If-None-Match':
-            assert False, R('Object exists', object_key)
+            raise StorageObjectExists(object_key)
         else:
             raise exception
 

The last two commits should either be squashed or made into a split commit. Please change title StorageService supports IfNoneMatch param to Optionally prevent StorageService from overwriting objects

@hannes-ucsc hannes-ucsc removed their assignment May 23, 2025
@nadove-ucsc nadove-ucsc force-pushed the issues/nadove-ucsc/7134-mirror-file-fail-object-exists branch from d3318e2 to 19410cd Compare May 24, 2025 03:26
@nadove-ucsc nadove-ucsc requested a review from hannes-ucsc May 24, 2025 03:54
@hannes-ucsc
Copy link
Member

Security design review

  • Security design review completed; this PR does not
    • … affect authentication; for example:
      • OAuth 2.0 with the application (API or Swagger UI)
      • Authentication of developers with Google Cloud APIs
      • Authentication of developers with AWS APIs
      • Authentication with a GitLab instance in the system
      • Password and 2FA authentication with GitHub
      • API access token authentication with GitHub
      • Authentication with Terra
    • … affect the permissions of internal users like access to
      • Cloud resources on AWS and GCP
      • GitLab repositories, projects and groups, administration
      • an EC2 instance via SSH
      • GitHub issues, pull requests, commits, commit statuses, wikis, repositories, organizations
    • … affect the permissions of external users like access to
      • TDR snapshots
    • … affect permissions of service or bot accounts
      • Cloud resources on AWS and GCP
    • … affect audit logging in the system, like
      • adding, removing or changing a log message that represents an auditable event
      • changing the routing of log messages through the system
    • … affect monitoring of the system
    • … introduce a new software dependency like
      • Python packages on PYPI
      • Command-line utilities
      • Docker images
      • Terraform providers
    • … add an interface that exposes sensitive or confidential data at the security boundary
    • … affect the encryption of data at rest
    • … require persistence of sensitive or confidential data that might require encryption at rest
    • … require unencrypted transmission of data within the security boundary
    • … affect the network security layer; for example by
      • modifying, adding or removing firewall rules
      • modifying, adding or removing security groups
      • changing or adding a port a service, proxy or load balancer listens on
  • Documentation on any unchecked boxes is provided in comments below

@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/nadove-ucsc/7134-mirror-file-fail-object-exists branch from 19410cd to f0d2f2b Compare May 28, 2025 18:53
@dsotirho-ucsc dsotirho-ucsc added the sandbox [process] Resolution is being verified in sandbox deployment label May 28, 2025
@dsotirho-ucsc dsotirho-ucsc merged commit bac4526 into develop May 28, 2025
11 checks passed
@dsotirho-ucsc dsotirho-ucsc deleted the issues/nadove-ucsc/7134-mirror-file-fail-object-exists branch May 28, 2025 22:01
@dsotirho-ucsc dsotirho-ucsc removed their assignment May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 review [process] Lead requested changes once orange [process] Done by the Azul team sandbox [process] Resolution is being verified in sandbox deployment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants