push: skip already existing files in the s3 bucket on push#158
push: skip already existing files in the s3 bucket on push#158mvo5 wants to merge 1 commit intoosbuild:mainfrom
Conversation
achilleas-k
left a comment
There was a problem hiding this comment.
There should be no case where a file with the same path/name needs to be updated since they're all content-addressable (IIRC), so this LGTM!
|
|
e1c5570 to
b71426a
Compare
When we push data (i.e. sha254-abcd) files to the s3 bucket we currently do this unconditionally. But if the file is already in the bucket this is unnecessary. So try to "touch" it first and only upload if the file is not already there. This saves some time even on a fast connection (and bandwidth of course). Note that we use "touch" here because we want to make sure the object metadata gets updated so that any time-based expiration policies are still honored.
b71426a to
19104b9
Compare
|
FTR, we don't have a retention policy, but this is a good idea regardless. |
| s3c.copy_object( | ||
| Bucket="rpmrepo-storage", | ||
| Key=key, | ||
| CopySource={"Bucket": "rpmrepo-storage", "Key": key}, | ||
| MetadataDirective="COPY", | ||
| ) | ||
| print(f"[{i_total}/{n_total}] '{key}' (exists, touched)") |
There was a problem hiding this comment.
I know that copying an object onto itself is a common way to touch files in S3, but a comment here would be helpful for the future.
In the images CI cache, I also did a metadata key touched=<timestamp> because I think I read somewhere that without any metadata change, the actual modification time of the object wont change. Not sure if I'm remembering that correctly.
[I leave this in draft for a bit, I'm using this code and will monitor its behavior. I tweaked it to use COPY-to-self so that the timestamp gets an update, I suspect that is the reason why we did the full upload before(?) - i.e. have time based expiration policies in S3]
When we push data (i.e. sha254-abcd) files to the s3 bucket we
currently do this unconditionally. But if the file is already in
the bucket this is unnecessary. So try to "touch" it first and
only upload if the file is not already there. This saves some
time even on a fast connection (and bandwidth of course).
Note that we use "touch" here because we want to make sure the
object metadata gets updated so that any time-based expiration
policies are still honored.