Skip to content

[Discussion] Incremental S3 uploads in collectstatic via ETag comparison — is this in scope? #1561

@toracle

Description

@toracle

Hi, before investing time in an implementation I wanted to check whether this direction aligns with the project's goals.

Problem

When running collectstatic with S3Storage, every file is re-uploaded on each run regardless of whether its content has changed. For projects with a non-trivial number of static assets, this adds 2–3+ minutes to every deployment.

I verified this by reading through the current _save() and exists() implementations — _save() always calls obj.upload_fileobj() unconditionally, and exists() only checks for object presence without retrieving or comparing the ETag.

Historical context

collectfast historically solved this by comparing local MD5 hashes against S3 ETags and skipping unchanged files. However, that project is now archived (last release 2020, repository archived May 2025), leaving a gap in the ecosystem.

Proposed direction

Add an opt-in behaviour to S3Storage that, before uploading a file, compares its MD5 hash with the ETag of the existing S3 object and skips the upload if they match. No new dependencies would be required since boto3 already exposes the ETag via head_object.

Rough sketch:

def _should_skip_upload(self, name, content):
    try:
        obj = self.connection.meta.client.head_object(
            Bucket=self.bucket_name, Key=self._normalize_name(name)
        )
        etag = obj["ETag"].strip('"')
        content.seek(0)
        local_md5 = hashlib.md5(content.read()).hexdigest()
        content.seek(0)
        return etag == local_md5
    except ClientError:
        return False

Controlled by a new setting, e.g. AWS_S3_SKIP_UNCHANGED = False (opt-in, default off to preserve current behaviour).

Questions

  1. Is this the kind of optimisation that belongs in django-storages, or is it intentionally out of scope?
  2. If it fits, is AWS_S3_SKIP_UNCHANGED an acceptable API surface, or would you prefer a different approach (subclass, mixin, etc.)?
  3. Are there edge cases I should be aware of — e.g. multipart uploads where the ETag is not a plain MD5?

Happy to submit a PR if the direction makes sense to the maintainer. Raising this as a discussion first to avoid wasted effort on both sides.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions