Skip to content

Remove big binary files #622

Open
Open
@rcannood

Description

@rcannood

If we use BFG to remove all blobs larger than 1M, we can reduce the openpipeline repo from 200MiB to around 44MiB. We can probably reduce it even further if we set the threshold even lower. @DriesSchaumont WDYT?

$  git clone --mirror [email protected]:openpipelines-bio/openpipeline.git lfs_test.git
Cloning into bare repository 'lfs_test.git'...
remote: Enumerating objects: 397073, done.
remote: Counting objects: 100% (6019/6019), done.
remote: Compressing objects: 100% (2307/2307), done.
remote: Total 397073 (delta 3644), reused 5873 (delta 3512), pack-reused 391054
Receiving objects: 100% (397073/397073), 200.99 MiB | 5.97 MiB/s, done.
Resolving deltas: 100% (269042/269042), done.

$ java -jar ~/Downloads/bfg-1.14.0.jar --strip-blobs-bigger-than 1M lfs_test.git

Using repo : /home/rcannood/workspace/openpipelines-bio/lfs_test.git


This repo has been processed by The BFG before! Will prune repo before proceeding - to avoid unnecessary cleaning work on unused objects...
Completed prune of old objects - will now proceed with the main job!

Scanning packfile for large blobs: 1588292
Scanning packfile for large blobs completed in 6,443 ms.
Found 6 blob ids for large blobs - biggest=14395908 smallest=1521437
Total size (unpacked)=47515450
Found 443 objects to protect
Found 512 commit-pointing refs : HEAD, refs/heads/481-add-leiden-clustering-to-scvi-pipeline, refs/heads/590-clusterleiden-config-contains-incorrect-markdown-references, ...
Found 4 tag-pointing refs : refs/tags/0.3.0, refs/tags/0.3.1, refs/tags/0.4.0, refs/tags/0.4.1

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit 5fb2a9e0 (protected by 'HEAD')

Cleaning
--------

Found 4459 commits
Cleaning commits:       100% (4459/4459)
Cleaning commits completed in 3,003 ms.

Updating 156 Refs
-----------------

	Ref                                                                          Before     After   
	------------------------------------------------------------------------------------------------
	refs/heads/481-add-leiden-clustering-to-scvi-pipeline                      | 700bffd6 | 6d0b9eec
	refs/heads/590-clusterleiden-config-contains-incorrect-markdown-references | 772769ee | 7abac021
	refs/heads/604-use-the-viash-dependencies-config-value-for-workflows       | 843009e8 | 8b7b78ba
	refs/heads/concat_dtypes                                                   | c8f1e5f8 | e92cbea4
	refs/heads/feature/ataq-demux                                              | 5dcebba7 | 1666af0f
	refs/heads/feature/ataq-qc                                                 | dde357ff | 98d64cbd
	refs/heads/feature/scpoli_implementation                                   | b17c3a84 | 3ee6bc23
	refs/heads/increase_ci_memory                                              | 1464e7aa | 9b6af876
	refs/heads/integration_build                                               | b225d951 | d1eaab7b
	refs/heads/main                                                            | 5fb2a9e0 | 56ac0431
	refs/heads/main_build                                                      | 8a9894a6 | cc0001cd
	refs/heads/main_build_datasets_schema                                      | 5022c403 | 901839ca
	refs/heads/more_memory_tests                                               | fe5188fa | 7608da95
	refs/heads/release                                                         | 98678513 | 0594ac36
	refs/heads/review_cellxgene                                                | f881710c | 475cecfc
	...

Updating references:    100% (156/156)
...Ref update completed in 38 ms.

Commit Tree-Dirt History
------------------------

	Earliest                                              Latest
	|                                                          |
	..............................................DDDDDDDDDmmDmm

	D = dirty commits (file tree fixed)
	m = modified commits (commit message or parents changed)
	. = clean commits (no changes to file tree)

	                        Before     After   
	-------------------------------------------
	First modified commit | 6455c1d6 | fae7b4ab
	Last dirty commit     | e27f9172 | 3ffb155c

Deleted files
-------------

	Filename                                                                  Git id            
	--------------------------------------------------------------------------------------------
	cellranger-tiny-bcl-1.2.0.tar.gz                                        | 4b3e7995 (13.4 MB)
	cl-base.obo                                                             | af96cc47 (1.5 MB) 
	matrix.mtx.gz                                                           | 9e469be2 (4.0 MB) 
	pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5                        | eade8772 (5.2 MB) 
	pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5ad                      | 145b611c (13.7 MB)
	pbmc_1k_protein_v3_filtered_feature_bc_matrix.norm.hvg.pca.nn.umap.h5ad | de2901dd (7.6 MB) 


In total, 22327 object ids were changed. Full details are logged here:

	/home/rcannood/workspace/openpipelines-bio/lfs_test.git.bfg-report/2023-11-24/14-40-05

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive

$ cd lfs_test.git

$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
Enumerating objects: 397073, done.
Counting objects: 100% (397073/397073), done.
Delta compression using up to 32 threads
Compressing objects: 100% (379869/379869), done.
Writing objects: 100% (397073/397073), done.
Selecting bitmap commits: 4368, done.
Building bitmaps: 100% (148/148), done.
Total 397073 (delta 268875), reused 124073 (delta 0), pack-reused 0

$ git push
Enumerating objects: 397073, done.
Writing objects: 100% (397073/397073), 44.70 MiB | 3.69 MiB/s, done.
Total 397073 (delta 0), reused 0 (delta 0), pack-reused 397073
remote: Resolving deltas: 100% (268875/268875), done.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions