Skip to content

push: skip already existing files in the s3 bucket on push#158

Draft
mvo5 wants to merge 1 commit intoosbuild:mainfrom
mvo5:skip-already-uploaded
Draft

push: skip already existing files in the s3 bucket on push#158
mvo5 wants to merge 1 commit intoosbuild:mainfrom
mvo5:skip-already-uploaded

Conversation

@mvo5
Copy link
Copy Markdown
Contributor

@mvo5 mvo5 commented Mar 11, 2026

[I leave this in draft for a bit, I'm using this code and will monitor its behavior. I tweaked it to use COPY-to-self so that the timestamp gets an update, I suspect that is the reason why we did the full upload before(?) - i.e. have time based expiration policies in S3]

When we push data (i.e. sha254-abcd) files to the s3 bucket we
currently do this unconditionally. But if the file is already in
the bucket this is unnecessary. So try to "touch" it first and
only upload if the file is not already there. This saves some
time even on a fast connection (and bandwidth of course).

Note that we use "touch" here because we want to make sure the
object metadata gets updated so that any time-based expiration
policies are still honored.

@mvo5 mvo5 requested a review from a team as a code owner March 11, 2026 11:42
@mvo5 mvo5 requested review from achilleas-k, croissanne and ondrejbudai and removed request for a team March 11, 2026 11:42
achilleas-k
achilleas-k previously approved these changes Mar 11, 2026
Copy link
Copy Markdown
Member

@achilleas-k achilleas-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no case where a file with the same path/name needs to be updated since they're all content-addressable (IIRC), so this LGTM!

@mvo5 mvo5 marked this pull request as draft March 12, 2026 08:24
@mvo5
Copy link
Copy Markdown
Contributor Author

mvo5 commented Mar 12, 2026

Actually - one reason why this might be desired is that if a "older-than-n-days" policy to auto-delete objects is used. The constant rewrites ensures that they are "fresh" I suppose. Its just a guess but it might be a rason - if it is we should add a comment :) I tweaked the code now to s3c.copy_object() instead of head_object() which should fix this concern.

When we push data (i.e. sha254-abcd) files to the s3 bucket we
currently do this unconditionally. But if the file is already in
the bucket this is unnecessary. So try to "touch" it first and
only upload if the file is not already there. This saves some
time even on a fast connection (and bandwidth of course).

Note that we use "touch" here because we want to make sure the
object metadata gets updated so that any time-based expiration
policies are still honored.
@mvo5 mvo5 force-pushed the skip-already-uploaded branch from b71426a to 19104b9 Compare March 13, 2026 09:16
@achilleas-k
Copy link
Copy Markdown
Member

FTR, we don't have a retention policy, but this is a good idea regardless.

Comment on lines +53 to +59
s3c.copy_object(
Bucket="rpmrepo-storage",
Key=key,
CopySource={"Bucket": "rpmrepo-storage", "Key": key},
MetadataDirective="COPY",
)
print(f"[{i_total}/{n_total}] '{key}' (exists, touched)")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that copying an object onto itself is a common way to touch files in S3, but a comment here would be helpful for the future.

In the images CI cache, I also did a metadata key touched=<timestamp> because I think I read somewhere that without any metadata change, the actual modification time of the object wont change. Not sure if I'm remembering that correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants