-
Notifications
You must be signed in to change notification settings - Fork 341
Description
We use git lfs to track various binary files in the repository. However, this is not a fool-proof way to prevent unwanted (e.g., large) binary files from being added, because:
- Someone could git commit a binary file without git lfs installed
- Even if someone has git lfs installed, they could commit a binary file whose extension isn't currently tracked by git lfs (according to the .gitattributes file in the top level of the repository)
We currently rely on a PR review to pick up any such issues, but this also isn't fool-proof because:
- The reviewer could overlook an added binary file hidden in a large PR
- If the PR author committed but later deleted a binary file, this would not show up in the PR review
So I feel it would be helpful to add a check to catch these binary files and prevent them from being added to the repository. Even a single accidental addition of a large binary file could bloat the repository forever, and it's very difficult to recover from this situation once an offending commit has been merged to the main (master) branch.
I can imagine two ways of doing this:
- Via a GitHub check that runs on every Pull Request, as long as this can be set up to check all commits in the PR, not just the final diff. This is probably the ideal solution.
- Advantages
- Easy, once it's installed
- Catches issues early
- Disadvantages
- Wouldn't catch problems in commits that don't go through PRs
- May be harder to set up
- Via a pre-commit hook, if this can be set up to check all merge commits to master (so that problems are caught when merging a problematic branch to master)
- Advantages
- May be easier to create a check this way
- Would catch problems that don't go through PRs
- Disadvantages
- Requires all integrators (those with push permission to our main branches) to install the necessary pre-commit hook in whatever repository they use for doing merges to master.
- I'm not sure if we could set this up to catch binaries that were added then later deleted, during a merge commit. i.e., if a collaborator added then deleted a binary file on their branch, then we merge it to master, then a pre-commit hook installed in our repository might not catch it in that merge. (Catching this might require that the original collaborator had the pre-commit hook installed, which I don't see as a reasonable solution.)
The closest off-the-shelf solution I found for (1) from a bit of searching is this GitHub action to catch large files https://github.com/marketplace/actions/lfs-warning. Ideally I would like the check to be based on whether it's a binary file, and not its file size, because we have some large source code files and I'd like to ensure that pretty much all binary files are tracked via git lfs. But this could be a reasonable fall-back solution, or maybe we could adapt that to check for whether a file is a binary file instead of or in addition to the size check.
Here are some other (partial?) solutions I found that seem at least close to one of these solutions:
- https://medium.com/@lorenzen.jacob/checking-git-repository-for-binaries-e18d912f9fd7
- https://github.com/michaelgwelch/lfs-check
- https://stackoverflow.com/questions/20226132/is-there-a-git-hook-which-can-prevent-binary-check-ins
- https://github.com/avar/pre-receive-reject-binaries
- https://github.com/patrickvacek/git-reject-binaries-and-large-files/blob/master/pre-receive
Here's some discussion of a similar idea, making sure that all files that should be tracked by git lfs are: git-lfs/git-lfs#2824
I should note that manage_externals has some small binary files (git repos used for its testing), so we'd either want to (a) ignore very small binary files, and/or (b) implement this check in a way that it's possibly to bypass it.