Hi, before investing time in an implementation I wanted to check whether this direction aligns with the project's goals.
Problem
When running collectstatic with S3Storage, every file is re-uploaded on each run regardless of whether its content has changed. For projects with a non-trivial number of static assets, this adds 2–3+ minutes to every deployment.
I verified this by reading through the current _save() and exists() implementations — _save() always calls obj.upload_fileobj() unconditionally, and exists() only checks for object presence without retrieving or comparing the ETag.
Historical context
collectfast historically solved this by comparing local MD5 hashes against S3 ETags and skipping unchanged files. However, that project is now archived (last release 2020, repository archived May 2025), leaving a gap in the ecosystem.
Proposed direction
Add an opt-in behaviour to S3Storage that, before uploading a file, compares its MD5 hash with the ETag of the existing S3 object and skips the upload if they match. No new dependencies would be required since boto3 already exposes the ETag via head_object.
Rough sketch:
def _should_skip_upload(self, name, content):
try:
obj = self.connection.meta.client.head_object(
Bucket=self.bucket_name, Key=self._normalize_name(name)
)
etag = obj["ETag"].strip('"')
content.seek(0)
local_md5 = hashlib.md5(content.read()).hexdigest()
content.seek(0)
return etag == local_md5
except ClientError:
return False
Controlled by a new setting, e.g. AWS_S3_SKIP_UNCHANGED = False (opt-in, default off to preserve current behaviour).
Questions
- Is this the kind of optimisation that belongs in
django-storages, or is it intentionally out of scope?
- If it fits, is
AWS_S3_SKIP_UNCHANGED an acceptable API surface, or would you prefer a different approach (subclass, mixin, etc.)?
- Are there edge cases I should be aware of — e.g. multipart uploads where the ETag is not a plain MD5?
Happy to submit a PR if the direction makes sense to the maintainer. Raising this as a discussion first to avoid wasted effort on both sides.
Hi, before investing time in an implementation I wanted to check whether this direction aligns with the project's goals.
Problem
When running
collectstaticwithS3Storage, every file is re-uploaded on each run regardless of whether its content has changed. For projects with a non-trivial number of static assets, this adds 2–3+ minutes to every deployment.I verified this by reading through the current
_save()andexists()implementations —_save()always callsobj.upload_fileobj()unconditionally, andexists()only checks for object presence without retrieving or comparing the ETag.Historical context
collectfasthistorically solved this by comparing local MD5 hashes against S3 ETags and skipping unchanged files. However, that project is now archived (last release 2020, repository archived May 2025), leaving a gap in the ecosystem.Proposed direction
Add an opt-in behaviour to
S3Storagethat, before uploading a file, compares its MD5 hash with the ETag of the existing S3 object and skips the upload if they match. No new dependencies would be required sinceboto3already exposes the ETag viahead_object.Rough sketch:
Controlled by a new setting, e.g.
AWS_S3_SKIP_UNCHANGED = False(opt-in, default off to preserve current behaviour).Questions
django-storages, or is it intentionally out of scope?AWS_S3_SKIP_UNCHANGEDan acceptable API surface, or would you prefer a different approach (subclass, mixin, etc.)?Happy to submit a PR if the direction makes sense to the maintainer. Raising this as a discussion first to avoid wasted effort on both sides.