Skip to content

Deterministically normalize wheel ZIP metadata #2344

Open
@tabbyrobin

Description

@tabbyrobin

Description

It would be nice for cibuildwheel to include by default a post-processing step
normalizing wheels for determinism/reproducibility.

This could be a significant step toward widespread verifiably-reproducible
builds of PyPI-hosted wheels.

Background

When using cibuildwheel to build a straightforward Cython package, I found that
by default, the resulting wheels were never bit-for-bit reproducible, because of
ZIP metadata (timestamps etc).

That is, the wheels always came out with a different checksum after each run. An
inspection of the wheels showed that the files contained in the archive were in
fact bit-for-bit reproducible, and the differences were purely due to ZIP
metadata. In particular, the problem was with timestamps (and potentially also
the ordering of entries).

When I added a post-processing step using either python-stripzip or Debian's
strip-nondeterminism, wheels were bit-for-bit reproducible.

The cibuildwheel docs mention: "Because the builds are happening in manylinux
Docker containers, they're perfectly reproducible." This is generally true for
the build itself, but is not true for the final artifacts, because of the ZIP
timestamps.

Considerations

There are a number of tools for this.

To my knowledge, Debian's strip-nondeterminism is the most mature and featureful one.

Also of particular interest is python-stripzip.

Other tools include:

There are some issues which arise, notably:

  • Ordering of ZIP entries.
  • What timestamp(s) should be used in the ZIP metadata.
  • Whether to respect SOURCE_DATE_EPOCH for this usage.

Note that if desiring to implement reproducible builds for a specific project,
one can just pick a strategy, stick with it, and be done with it. But if aiming
to implement a blanket solution in a centralized tool, it's probably worth
investigating the details.

I might tentatively suggest python-stripzip for use in cibuildwheel, because
it does less modifications than strip-nondeterminism. In particular, it
doesn't change order of entries, so if .dist-info files are placed at end of
the ZIP (as is best practice), it will leave them so.

Here is a script using cibuildwheel and python-stripzip which demonstrates
successfully generating bit-for-bit reproducible wheels:
https://gist.github.com/tabbyrobin/d6c5cf5323fe54a50004c1291da39315#file-build-wheels-sh

Build log

No response

CI config

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions