PATH WALK III: Add 'git backfill' command #1820

derrickstolee · 2024-10-29T02:37:47Z

This is based on 'master' now that ds/path-walk [1] is merged. An earlier version was part of my initial path-walk RFC [2].

[1] https://lore.kernel.org/git/[email protected]/

[2] https://lore.kernel.org/git/[email protected]/

This series adds a new 'git backfill' command that uses the path-walk API to download missing blobs in a blobless partial clone. Users can specify interaction with the sparse-checkout using '--[no-]sparse' but the '--sparse' option is implied by the existence of a sparse-checkout.

The reason to use the path-walk API is to make sure that the missing objects are grouped by a common path, giving a reasonable process for batching requests and expecting the server to compress the resulting packfile nicely together.

I first prototyped this feature in June 2024 as an exploration and created the path-walk algorithm for this purpose. It was only my intuition that led me to believe that batching by path would lead to better packfiles. This has been proven out as a very important feature due to recent investigations to compressing full repositories by doing a better job of grouping objects by path. See the --name-hash-version series [3] or the 'git pack-objects --path-walk' series [4] (currently on hold as it conflicts with the --name-hash-version series).

[3] https://lore.kernel.org/git/[email protected]/

[4] https://lore.kernel.org/git/[email protected]/

This idea can be further demonstrated by the evidence in testing this feature: by downloading objects in small batch sizes, the client can force the server to repack things more efficiently than a full repack.

The example repository I have used in multiple places is the microsoft/fluentui repo [5] as it has many CHANGELOG.md files that cause name hash collisions that make the full repack inefficient.

[5] https://github.com/microsoft/fluentui

If we create a blobless clone of the fluentui repo, then this downloads 105 MB across two packfiles (the commits and trees pack, followed by the blobs needed for an initial checkout). Running 'git backfill --batch-size=' for different sizes leads to some interesting results:

| Batch Size      | Pack Count | Pack Size | Time   |
|-----------------|------------|-----------|--------|
| (Initial clone) | 2          | 105 MB    |        |
| 5K              | 53         | 348 MB    | 2m 26s |
| 10K             | 28         | 365 MB    | 2m 22s |
| 15K             | 19         | 407 MB    | 2m 21s |
| 20K             | 15         | 393 MB    | 2m 28s |
| 25K             | 13         | 417 MB    | 2m 06s |
| 50K             | 8          | 509 MB    | 1m 34s |
| 100K            | 5          | 535 MB    | 1m 56s |
| 250K            | 4          | 698 MB    | 1m 33s |
| 500K            | 3          | 696 MB    | 1m 42s |

The smaller batches cause the server to realize that the existing deltas cannot be reused and it finds better deltas. This takes some extra time for the small batches, but halves the size of the repo. Even in the 500K batch size, we get less data than the 738 MB of a full clone.

Implementing the --sparse feature is best done by augmenting the path-walk API to be aware of a pattern list. This works for both cone and non-cone mode sparse-checkouts.

There are future directions we could take this command, especially to run the command with a user-specified pathspec. The tricky case for that additional feature is trying to make the path-walk more efficient by skipping tree paths that would not lead to a match of the pathspec. It would likely need optimization in a small subset of pathspec features (such as prefix matches) to work as efficiently as possible. I did prototype a version that puts the pathspec match in the callback function within builtin/backfill.c, but I found that uninspiring and unnecessary for now.

Updates in v2

Thanks for the review on v1.

Memory leaks should be plugged now due to the introduction of a
destructor in v4 of the path-walk API and its extension here to clear
the pattern list.
Documentation is expanded to demonstrate that 'git backfill' can also
approximate incremental/resumable clones of repositories.
Method renames to better match style.
--batch-size is renamed to --min-batch-size for clarity and future
extensibility.

Updates in v3

Rebased onto 'master' now that the path-walk API is merged.
New builtin boilerplate is updated with new standards, including:

Doc formatting uses [synopsis] formatting.
Add builtin/backfill.c to meson.build.
Add Documentation/git-backfill.txt to Documentation/meson.build.
Add t/t5620-backfill.sh to t/meson.build.
Update handling of -h due to f66d142 (builtin: send usage() help text to standard output, 2025-01-16).

Doc formatting is updated to use back-ticks on options and mark the builtin as experimental.
The batch_size member of 'struct backfill_context' is now named 'min_batch_size' in all patches.
Some mentions of '--batch-size' are updated to '--min-batch-size'.
An additional test is included for non-cone-mode sparse-checkout patterns to further check the return values of path_matches_pattern_list() within the path-walk API with sparse mode.
A use of oid_object_info_extended() is replaced with has_object().
The backfill_context_clear() method is called by the proper owner of the struct.

Thanks, -Stolee

cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: Jean-Noël AVILA [email protected]

derrickstolee · 2024-12-06T20:06:12Z

/submit

gitgitgadget · 2024-12-06T20:07:34Z

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1820/derrickstolee/backfill-upstream-v1

To fetch this version to local tag pr-1820/derrickstolee/backfill-upstream-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1820/derrickstolee/backfill-upstream-v1

gitgitgadget · 2024-12-07T12:31:26Z

This patch series was integrated into seen via git@e1db44d.

gitgitgadget · 2024-12-08T03:28:25Z

This patch series was integrated into seen via git@e1dd1e4.

gitgitgadget · 2024-12-08T10:58:49Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

It seems that the result of calling init_revisions() from backfill
is leaked?

https://github.com/git/git/actions/runs/12218430154/job/34083929479 

I did not dig further but the below is from my local leaksanitizer
run.

Thanks.

=================================================================
==git==182342==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 576 byte(s) in 1 object(s) allocated from:
    #0 0x55d4b0e42915 in __interceptor_realloc (git+0x84915) (BuildId: c861e65ec43b0a3ef46b9555a81d6ddfc2358e8e)
    #1 0x55d4b119ce6d in xrealloc wrapper.c:140:8
    #2 0x55d4b11204d4 in add_rev_cmdline revision.c:1563:2
    #3 0x55d4b11187a1 in handle_revision_arg_1 revision.c:2263:2
    #4 0x55d4b1118398 in handle_revision_arg revision.c:2275:12
    #5 0x55d4b0e5233b in do_backfill builtin/backfill.c:100:2
    #6 0x55d4b0e52253 in cmd_backfill builtin/backfill.c:146:9
    #7 0x55d4b0e46b80 in run_builtin git.c:480:11
    #8 0x55d4b0e45502 in handle_builtin git.c:741:9
    #9 0x55d4b0e4649c in run_argv git.c:808:4
    #10 0x55d4b0e45274 in cmd_main git.c:948:19
    #11 0x55d4b0f6da8a in main common-main.c:9:11
    #12 0x7ff25f97ac89 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #13 0x7ff25f97ad44 in __libc_start_main csu/../csu/libc-start.c:360:3
    #14 0x55d4b0e171e0 in _start (git+0x591e0) (BuildId: c861e65ec43b0a3ef46b9555a81d6ddfc2358e8e)

DEDUP_TOKEN: __interceptor_realloc--xrealloc--add_rev_cmdline--handle_revision_arg_1--handle_revision_arg--do_backfill--cmd_backfill--run_builtin--handle_builtin--run_argv--cmd_main--main--__libc_start_call_main--__libc_start_main--_start
Indirect leak of 5 byte(s) in 1 object(s) allocated from:
    #0 0x55d4b0e424b6 in __interceptor_malloc (git+0x844b6) (BuildId: c861e65ec43b0a3ef46b9555a81d6ddfc2358e8e)
    #1 0x7ff25f9f34f9 in strdup string/strdup.c:42:15
    #2 0x55d4b119cb14 in xstrdup wrapper.c:43:14
    #3 0x55d4b1120506 in add_rev_cmdline revision.c:1565:23
    #4 0x55d4b11187a1 in handle_revision_arg_1 revision.c:2263:2
    #5 0x55d4b1118398 in handle_revision_arg revision.c:2275:12
    #6 0x55d4b0e5233b in do_backfill builtin/backfill.c:100:2
    #7 0x55d4b0e52253 in cmd_backfill builtin/backfill.c:146:9
    #8 0x55d4b0e46b80 in run_builtin git.c:480:11
    #9 0x55d4b0e45502 in handle_builtin git.c:741:9
    #10 0x55d4b0e4649c in run_argv git.c:808:4
    #11 0x55d4b0e45274 in cmd_main git.c:948:19
    #12 0x55d4b0f6da8a in main common-main.c:9:11
    #13 0x7ff25f97ac89 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #14 0x7ff25f97ad44 in __libc_start_main csu/../csu/libc-start.c:360:3
    #15 0x55d4b0e171e0 in _start (git+0x591e0) (BuildId: c861e65ec43b0a3ef46b9555a81d6ddfc2358e8e)

DEDUP_TOKEN: __interceptor_malloc--strdup--xstrdup--add_rev_cmdline--handle_revision_arg_1--handle_revision_arg--do_backfill--cmd_backfill--run_builtin--handle_builtin--run_argv--cmd_main--main--__libc_start_call_main--__libc_start_main--_start
SUMMARY: LeakSanitizer: 581 byte(s) leaked in 2 allocation(s).
Our logs revealed a memory leak...
++ rmdir /home/gitster/w/git.git/t/test-results/t5620-backfill.leak
++ :
++ exit 134
++ eval_ret=134

gitgitgadget · 2024-12-09T00:37:01Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

Junio C Hamano <[email protected]> writes:

> It seems that the result of calling init_revisions() from backfill
> is leaked?
>
> https://github.com/git/git/actions/runs/12218430154/job/34083929479 
>
> I did not dig further but the below is from my local leaksanitizer
> run.

The attached patch seems to plug the leaks observed by your backfill
tests.  If you agree with the implementation of the change, you are
welcome to squash it in.  I may be missing a better mechanism in the
path-walk API that I could use to plug the leaks, in which case, of
course a fix using such a better mechanism is very much welcomed.

A few things I noticed about path-walk API during my poking are:

 - struct path_walk_info has PATH_WALK_INFO_INIT() macro, but not a
   matching path_walk_info_init() helper function to help initialize
   any heap-allocated instances.  In general, anything that requires
   _INIT() macro by definition wants to be initialized with more
   than nul-initialized (otherwise, it would be fine to leave it
   uninitialized in the BSS, clear with = {0} initialization on the
   stack), so lack of matching path_walk_info_init() raises an
   eyebrow.

 - struct path_walk_info seems to lack a destructor.  In general,
   anything complex enough to require initializer should have one.

 - lack of destructor for path_walk_info requires users to release
   resources held by "struct pattern_list" instance at pwi.pl; if we
   had a destructor, we wouldn't have 2 of the three leaks we had to
   fix here.

Thanks.


diff --git c/builtin/backfill.c w/builtin/backfill.c
index 225764f17e..1cf2df9303 100644
--- c/builtin/backfill.c
+++ w/builtin/backfill.c
@@ -92,8 +92,11 @@ static int do_backfill(struct backfill_context *ctx)
 
 	if (ctx->sparse) {
 		CALLOC_ARRAY(info.pl, 1);
-		if (get_sparse_checkout_patterns(info.pl))
+		if (get_sparse_checkout_patterns(info.pl)) {
+			clear_pattern_list(info.pl);
+			free(info.pl);
 			return error(_("problem loading sparse-checkout"));
+		}
 	}
 
 	repo_init_revisions(ctx->repo, &revs, "");
@@ -113,6 +116,11 @@ static int do_backfill(struct backfill_context *ctx)
 		download_batch(ctx);
 
 	clear_backfill_context(ctx);
+	release_revisions(&revs);
+	if (info.pl) {
+		clear_pattern_list(info.pl);
+		free(info.pl);
+	}
 	return ret;
 }

gitgitgadget · 2024-12-09T01:00:52Z

This patch series was integrated into seen via git@ef760f5.

gitgitgadget · 2024-12-09T05:56:22Z

This patch series was integrated into seen via git@6eef8e2.

gitgitgadget · 2024-12-10T11:09:18Z

This patch series was integrated into seen via git@7668c77.

gitgitgadget · 2024-12-11T06:24:15Z

This patch series was integrated into seen via git@8a5ce5d.

gitgitgadget · 2024-12-12T08:41:56Z

This patch series was integrated into seen via git@28b8993.

gitgitgadget · 2024-12-13T23:55:19Z

This patch series was integrated into seen via git@3c075bc.

gitgitgadget · 2024-12-15T05:31:12Z

This patch series was integrated into seen via git@015ab1a.

gitgitgadget · 2024-12-16T01:04:08Z

This patch series was integrated into seen via git@8def160.

gitgitgadget · 2024-12-16T08:05:12Z

Documentation/git-backfill.txt

+git-backfill - Download missing objects in a partial clone
+
+
+SYNOPSIS


On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Fri, Dec 06, 2024 at 08:07:16PM +0000, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <[email protected]> > > Users may want to specify a minimum batch size for their needs. This is only > a minimum: the path-walk API provides a list of OIDs that correspond to the > same path, and thus it is optimal to allow delta compression across those > objects in a single server request. Okay, here you explicitly say that this is a minimum batch size, so this is by design and with a proper reason. Good. > We could consider limiting the request to have a maximum batch size in the > future. For now, we let the path-walk API batches determine the > boundaries. Should we maybe rename `--batch-size` to `--min-batch-size` so that it does not become awkward if we ever want to have a maximum batch size, as well? Also helps to set expectations with the user. [snip] > Based on these experiments, a batch size of 50,000 was chosen as the > default value. Thanks for all the data, this is really helpful! > diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt > index 0e10f066fef..9b0bae04e9d 100644 > --- a/Documentation/git-backfill.txt > +++ b/Documentation/git-backfill.txt > @@ -38,6 +38,14 @@ delta compression in the packfile sent by the server. > By default, `git backfill` downloads all blobs reachable from the `HEAD` > commit. This set can be restricted or expanded using various options. > > +OPTIONS > +------- > + > +--batch-size=<n>:: > + Specify a minimum size for a batch of missing objects to request > + from the server. This size may be exceeded by the last set of > + blobs seen at a given path. Default batch size is 16,000. This is stale: s/16,000/50,000/ > diff --git a/builtin/backfill.c b/builtin/backfill.c > index e5f2000d5e0..127333daef8 100644 > --- a/builtin/backfill.c > +++ b/builtin/backfill.c > @@ -112,6 +112,8 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, > struct reposit > .batch_size = 50000, > }; > struct option options[] = { > + OPT_INTEGER(0, "batch-size", &ctx.batch_size, > + N_("Minimun number of objects to request at a time")), s/Minimun/Minimum Patrick

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 12/16/24 3:01 AM, Patrick Steinhardt wrote: > On Fri, Dec 06, 2024 at 08:07:16PM +0000, Derrick Stolee via GitGitGadget wrote: >> From: Derrick Stolee <[email protected]> >> >> Users may want to specify a minimum batch size for their needs. This is only >> a minimum: the path-walk API provides a list of OIDs that correspond to the >> same path, and thus it is optimal to allow delta compression across those >> objects in a single server request. > > Okay, here you explicitly say that this is a minimum batch size, so this > is by design and with a proper reason. Good. > >> We could consider limiting the request to have a maximum batch size in the >> future. For now, we let the path-walk API batches determine the >> boundaries. > > Should we maybe rename `--batch-size` to `--min-batch-size` so that it > does not become awkward if we ever want to have a maximum batch size, as > well? Also helps to set expectations with the user. Good idea. Will do. >> diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt >> index 0e10f066fef..9b0bae04e9d 100644 >> --- a/Documentation/git-backfill.txt >> +++ b/Documentation/git-backfill.txt >> @@ -38,6 +38,14 @@ delta compression in the packfile sent by the server. >> By default, `git backfill` downloads all blobs reachable from the `HEAD` >> commit. This set can be restricted or expanded using various options. >> >> +OPTIONS >> +------- >> + >> +--batch-size=<n>:: >> + Specify a minimum size for a batch of missing objects to request >> + from the server. This size may be exceeded by the last set of >> + blobs seen at a given path. Default batch size is 16,000. > > This is stale: s/16,000/50,000/ Thanks! >> diff --git a/builtin/backfill.c b/builtin/backfill.c >> index e5f2000d5e0..127333daef8 100644 >> --- a/builtin/backfill.c >> +++ b/builtin/backfill.c >> @@ -112,6 +112,8 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, >> struct reposit >> .batch_size = 50000, >> }; >> struct option options[] = { >> + OPT_INTEGER(0, "batch-size", &ctx.batch_size, >> + N_("Minimun number of objects to request at a time")), > > s/Minimun/Minimum Thanks for the careful eye for detail. -Stolee

gitgitgadget · 2024-12-16T08:05:14Z

Documentation/git-backfill.txt

+[verse]
+'git backfill' [<options>]
+
+DESCRIPTION


On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Fri, Dec 06, 2024 at 08:07:15PM +0000, Derrick Stolee via GitGitGadget wrote: > diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt > index 640144187d3..0e10f066fef 100644 > --- a/Documentation/git-backfill.txt > +++ b/Documentation/git-backfill.txt > @@ -14,6 +14,30 @@ SYNOPSIS > DESCRIPTION > ----------- > > +Blobless partial clones are created using `git clone --filter=blob:none` > +and then configure the local repository such that the Git client avoids > +downloading blob objects unless they are required for a local operation. > +This initially means that the clone and later fetches download reachable > +commits and trees but no blobs. Later operations that change the `HEAD` > +pointer, such as `git checkout` or `git merge`, may need to download > +missing blobs in order to complete their operation. Okay. > +In the worst cases, commands that compute blob diffs, such as `git blame`, > +become very slow as they download the missing blobs in single-blob > +requests to satisfy the missing object as the Git command needs it. This > +leads to multiple download requests and no ability for the Git server to > +provide delta compression across those objects. > + > +The `git backfill` command provides a way for the user to request that > +Git downloads the missing blobs (with optional filters) such that the > +missing blobs representing historical versions of files can be downloaded > +in batches. The `backfill` command attempts to optimize the request by > +grouping blobs that appear at the same path, hopefully leading to good > +delta compression in the packfile sent by the server. Hm. So we're asking the user to fix a usability issue of git-blame(1), don't we? Ideally, git-blame(1) itself should know to transparently batch the blobs it requires to compute its output, shouldn't it? That usecase alone doesn't yet convince me that git-backfill(1) is a good idea as I'd think we should rather fix the underlying issue. So are there other usecases for git-backfill(1)? I can imagine that it might be helpful in the context of scripts that know they'll operate on a bunch of blobs. > diff --git a/builtin/backfill.c b/builtin/backfill.c > index 38e6aaeaa03..e5f2000d5e0 100644 > --- a/builtin/backfill.c > +++ b/builtin/backfill.c > @@ -1,16 +1,116 @@ > #include "builtin.h" > +#include "git-compat-util.h" > #include "config.h" > #include "parse-options.h" > #include "repository.h" > +#include "commit.h" > +#include "hex.h" > +#include "tree.h" > +#include "tree-walk.h" > #include "object.h" > +#include "object-store-ll.h" > +#include "oid-array.h" > +#include "oidset.h" > +#include "promisor-remote.h" > +#include "strmap.h" > +#include "string-list.h" > +#include "revision.h" > +#include "trace2.h" > +#include "progress.h" > +#include "packfile.h" > +#include "path-walk.h" > > static const char * const builtin_backfill_usage[] = { > N_("git backfill [<options>]"), > NULL > }; > > +struct backfill_context { > + struct repository *repo; > + struct oid_array current_batch; > + size_t batch_size; > +}; > + > +static void clear_backfill_context(struct backfill_context *ctx) > +{ > + oid_array_clear(&ctx->current_batch); > +} Nit: our style guide says that this should rather be `backfill_context_clear()`. > +static void download_batch(struct backfill_context *ctx) > +{ > + promisor_remote_get_direct(ctx->repo, > + ctx->current_batch.oid, > + ctx->current_batch.nr); > + oid_array_clear(&ctx->current_batch); > + > + /* > + * We likely have a new packfile. Add it to the packed list to > + * avoid possible duplicate downloads of the same objects. > + */ > + reprepare_packed_git(ctx->repo); > +} > + > +static int fill_missing_blobs(const char *path UNUSED, > + struct oid_array *list, > + enum object_type type, > + void *data) > +{ > + struct backfill_context *ctx = data; > + > + if (type != OBJ_BLOB) > + return 0; > + > + for (size_t i = 0; i < list->nr; i++) { > + off_t size = 0; > + struct object_info info = OBJECT_INFO_INIT; > + info.disk_sizep = &size; > + if (oid_object_info_extended(ctx->repo, > + &list->oid[i], > + &info, > + OBJECT_INFO_FOR_PREFETCH) || > + !size) > + oid_array_append(&ctx->current_batch, &list->oid[i]); > + } > + > + if (ctx->current_batch.nr >= ctx->batch_size) > + download_batch(ctx); Okay, so the batch size is just "best effort". If we walk a tree that makes us exceed the batch size then we wouldn't issue a fetch during the tree walk. Is there any specific reason for this behaviour? In any case, as long as this is properly documented I think this should be fine in general. > + return 0; > +} > + > +static int do_backfill(struct backfill_context *ctx) > +{ > + struct rev_info revs; > + struct path_walk_info info = PATH_WALK_INFO_INIT; > + int ret; > + > + repo_init_revisions(ctx->repo, &revs, ""); > + handle_revision_arg("HEAD", &revs, 0, 0); > + > + info.blobs = 1; > + info.tags = info.commits = info.trees = 0; > + > + info.revs = &revs; > + info.path_fn = fill_missing_blobs; > + info.path_fn_data = ctx; > + > + ret = walk_objects_by_path(&info); > + > + /* Download the objects that did not fill a batch. */ > + if (!ret) > + download_batch(ctx); > + > + clear_backfill_context(ctx); Are we leaking `revs` and `info`? > + return ret; > +} > + > int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo) > { > + struct backfill_context ctx = { > + .repo = repo, > + .current_batch = OID_ARRAY_INIT, > + .batch_size = 50000, > + }; > struct option options[] = { > OPT_END(), > }; > @@ -23,7 +123,5 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit > > repo_config(repo, git_default_config, NULL); > > - die(_("not implemented")); > - > - return 0; > + return do_backfill(&ctx); > } The current iteration only backfills blobs as far as I can see. Do we maybe want to keep the door open for future changes in git-backfill(1) by implementing this via a "blob" subcommand? Patrick

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 12/16/24 3:01 AM, Patrick Steinhardt wrote: > On Fri, Dec 06, 2024 at 08:07:15PM +0000, Derrick Stolee via GitGitGadget wrote: >> +The `git backfill` command provides a way for the user to request that >> +Git downloads the missing blobs (with optional filters) such that the >> +missing blobs representing historical versions of files can be downloaded >> +in batches. The `backfill` command attempts to optimize the request by >> +grouping blobs that appear at the same path, hopefully leading to good >> +delta compression in the packfile sent by the server. > > Hm. So we're asking the user to fix a usability issue of git-blame(1), > don't we? Ideally, git-blame(1) itself should know to transparently > batch the blobs it requires to compute its output, shouldn't it? That > usecase alone doesn't yet convince me that git-backfill(1) is a good > idea as I'd think we should rather fix the underlying issue. I've looked into making this change for 'git blame' and it is a nontrivial change. I'm not sure on the timeline that we could expect 'git blame' to be improved. But you're right that this is not enough justification on its own. It's an example of a kind of command that has these problems, including 'git log [-p|-L]'. > So are there other usecases for git-backfill(1)? I can imagine that it > might be helpful in the context of scripts that know they'll operate on > a bunch of blobs. One major motivation is that it can essentially provide a way to do resumable clones: get the commits and trees in one go with a blobless clone, then download the blobs in batches. If something interrupts the 'git backfill' command, then restarting it will only repeat the most recent batch. Finally, in a later patch we can see that the --sparse option allows a user to operate as if they have a full clone but where they only include blob data within their sparse-checkout, providing reduced disk space and network time to get to that state. >> + if (ctx->current_batch.nr >= ctx->batch_size) >> + download_batch(ctx); > > Okay, so the batch size is just "best effort". If we walk a tree that > makes us exceed the batch size then we wouldn't issue a fetch during the > tree walk. Is there any specific reason for this behaviour? > > In any case, as long as this is properly documented I think this should > be fine in general. The main reason is that we expect the server to return a packfile that has many delta relationships within the objects at a given path. If we split the batch in the middle of a path batch, then we are artificially introducing breaks in the delta chains that could be wasteful. However, this batching pattern could be problematic if there are a million versions of a single file and the batch is too large to download efficiently. This "absolute max batch size" is currently left as a future extension. >> + clear_backfill_context(ctx); > > Are we leaking `revs` and `info`? At the moment. Will fix. >> + return ret; >> +} >> + >> int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo) >> { >> + struct backfill_context ctx = { >> + .repo = repo, >> + .current_batch = OID_ARRAY_INIT, >> + .batch_size = 50000, >> + }; >> struct option options[] = { >> OPT_END(), >> }; >> @@ -23,7 +123,5 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit >> >> repo_config(repo, git_default_config, NULL); >> >> - die(_("not implemented")); >> - >> - return 0; >> + return do_backfill(&ctx); >> } > > The current iteration only backfills blobs as far as I can see. Do we > maybe want to keep the door open for future changes in git-backfill(1) > by implementing this via a "blob" subcommand? I think that one tricky part is to ask "what could be missing?". With Git's partial clone, it seems that we could have treeless or depth- based tree restrictions. Technically, there could also be clones that restrict to a set of sparse-checkout patterns, but I'm not aware of any server that will respect those kinds of clones. At the moment, the tree walk would fault in any missing trees as they are seen, but this is extremely inefficient. I think that the path-walk API could be adjusted to be more careful to check for the existence of an object before automatically loading it. That would allow for batched downloads of missing trees, though a second scan would be required to get the next layer of objects. I'm not sure a subcommand is the right way to solve for this potential future, but instead later we could adjust the logic to be better for these treeless or tree-restricted clones. Thanks, -Stolee

gitgitgadget · 2024-12-16T08:05:16Z

Documentation/git-backfill.txt

+git-backfill - Download missing objects in a partial clone
+
+
+SYNOPSIS


On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Fri, Dec 06, 2024 at 08:07:17PM +0000, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <[email protected]> > > One way to significantly reduce the cost of a Git clone and later fetches is > to use a blobless partial clone and combine that with a sparse-checkout that > reduces the paths that need to be populated in the working directory. Not > only does this reduce the cost of clones and fetches, the sparse-checkout > reduces the number of objects needed to download from a promisor remote. > > However, history investigations can be expensie as computing blob diffs will s/expensie/expensive Patrick

gitgitgadget · 2024-12-16T22:15:12Z

This patch series was integrated into seen via git@311b2d3.

gitgitgadget · 2024-12-17T22:08:44Z

This patch series was integrated into seen via git@b26b074.

One way to significantly reduce the cost of a Git clone and later fetches is to use a blobless partial clone and combine that with a sparse-checkout that reduces the paths that need to be populated in the working directory. Not only does this reduce the cost of clones and fetches, the sparse-checkout reduces the number of objects needed to download from a promisor remote. However, history investigations can be expensive as computing blob diffs will trigger promisor remote requests for one object at a time. This can be avoided by downloading the blobs needed for the given sparse-checkout using 'git backfill' and its new '--sparse' mode, at a time that the user is willing to pay that extra cost. Note that this is distinctly different from the '--filter=sparse:<oid>' option, as this assumes that the partial clone has all reachable trees and we are using client-side logic to avoid downloading blobs outside of the sparse-checkout cone. This avoids the server-side cost of walking trees while also achieving a similar goal. It also downloads in batches based on similar path names, presenting a resumable download if things are interrupted. This augments the path-walk API to have a possibly-NULL 'pl' member that may point to a 'struct pattern_list'. This could be more general than the sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently the only consumer. Be sure to test this in both cone mode and not cone mode. Cone mode has the benefit that the path-walk can skip certain paths once they would expand beyond the sparse-checkout. Non-cone mode can describe the included files using both positive and negative patterns, which changes the possible return values of path_matches_pattern_list(). Test both kinds of matches for increased coverage. To test this, we can create a blobless sparse clone, expand the sparse-checkout slightly, and then run 'git backfill --sparse' to see how much data is downloaded. The general steps are 1. git clone --filter=blob:none --sparse <url> 2. git sparse-checkout set <dir1> ... <dirN> 3. git backfill --sparse For the Git repository with the 'builtin' directory in the sparse-checkout, we get these results for various batch sizes: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|-------| | (Initial clone) | 3 | 110 MB | | | 10K | 12 | 192 MB | 17.2s | | 15K | 9 | 192 MB | 15.5s | | 20K | 8 | 192 MB | 15.5s | | 25K | 7 | 192 MB | 14.7s | This case matters less because a full clone of the Git repository from GitHub is currently at 277 MB. Using a copy of the Linux repository with the 'kernel/' directory in the sparse-checkout, we get these results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|------| | (Initial clone) | 2 | 1,876 MB | | | 10K | 11 | 2,187 MB | 46s | | 25K | 7 | 2,188 MB | 43s | | 50K | 5 | 2,194 MB | 44s | | 100K | 4 | 2,194 MB | 48s | This case is more meaningful because a full clone of the Linux repository is currently over 6 GB, so this is a valuable way to download a fraction of the repository and no longer need network access for all reachable objects within the sparse-checkout. Choosing a batch size will depend on a lot of factors, including the user's network speed or reliability, the repository's file structure, and how many versions there are of the file within the sparse-checkout scope. There will not be a one-size-fits-all solution. Signed-off-by: Derrick Stolee <[email protected]>

The previous change introduced the '--[no-]sparse' option for the 'git backfill' command, but did not assume it as enabled by default. However, this is likely the behavior that users will most often want to happen. Without this default, users with a small sparse-checkout may be confused when 'git backfill' downloads every version of every object in the full history. However, this is left as a separate change so this decision can be reviewed independently of the value of the '--[no-]sparse' option. Add a test of adding the '--sparse' option to a repo without sparse-checkout to make it clear that supplying it without a sparse-checkout is an error. Signed-off-by: Derrick Stolee <[email protected]>

derrickstolee · 2025-02-03T17:09:56Z

/submit

gitgitgadget · 2025-02-03T17:11:23Z

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1820/derrickstolee/backfill-upstream-v3

To fetch this version to local tag pr-1820/derrickstolee/backfill-upstream-v3:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1820/derrickstolee/backfill-upstream-v3

gitgitgadget · 2025-02-03T20:14:08Z

This patch series was integrated into seen via git@97c37fc.

gitgitgadget · 2025-02-04T00:22:30Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

> Updates in v3
> =============
>
>  * Rebased onto 'master' now that the path-walk API is merged.

I was going to object to this rebase, as the same path-walk was
contained already while building the base of the series for the
previous rounds.  IOW, "now that the path-walk API is merged" is not
a good excuse to rebase onto 'master'.

But then I forgot that there are other topics, like 'meson based
build' and 'synopsis formatting', that were in flight at the same
time that have been merged to 'master'.  They are good reasons why
we may want to rebase the updated version to 'master'.

IOW ...

>  * New builtin boilerplate is updated with new standards, including:
>
>  * Doc formatting uses [synopsis] formatting.
>  * Add builtin/backfill.c to meson.build.
>  * Add Documentation/git-backfill.txt to Documentation/meson.build.
>  * Add t/t5620-backfill.sh to t/meson.build.
>  * Update handling of -h due to f66d1423f5 (builtin: send usage() help text
>    to standard output, 2025-01-16).

... these are all good reasons, even if path-walk were still cooking
in 'next' (in which case, we'd prepare a custom base by merging path-walk
into 'master' and then apply these patches).

>  * Doc formatting is updated to use back-ticks on options and mark the
>    builtin as experimental.
>
>  * The batch_size member of 'struct backfill_context' is now named
>    'min_batch_size' in all patches.
>
>  * Some mentions of '--batch-size' are updated to '--min-batch-size'.
>
>  * An additional test is included for non-cone-mode sparse-checkout patterns
>    to further check the return values of path_matches_pattern_list() within
>    the path-walk API with sparse mode.
>
>  * A use of oid_object_info_extended() is replaced with has_object().
>
>  * The backfill_context_clear() method is called by the proper owner of the
>    struct.
>
> Thanks, -Stolee

Everything looked great from a quick look.  I'll have a more
detailed look later, but this round looks quite promising.

Thanks.

gitgitgadget · 2025-02-04T20:27:56Z

This patch series was integrated into seen via git@769bc7a.

gitgitgadget · 2025-02-05T02:12:35Z

There was a status update in the "Cooking" section about the branch ds/backfill on the Git mailing list:

Lazy-loading missing files in a blobless clone on demand is costly
as it tends to be one-blob-at-a-time.  "git backfill" is introduced
to help bulk-download necessary files beforehand.

Will merge to 'next'?
source: <[email protected]>

gitgitgadget · 2025-02-05T07:18:28Z

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Mon, Feb 03, 2025 at 04:18:44PM -0800, Junio C Hamano wrote:
> Everything looked great from a quick look.  I'll have a more
> detailed look later, but this round looks quite promising.

I didn't do another deep dive, but mostly had a look at my last comments
and the range-diff. As far as I can see my comments have been addressed,
so this version looks good to me.

Thanks!

Patrick

gitgitgadget · 2025-02-05T17:10:00Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

Patrick Steinhardt <[email protected]> writes:

> On Mon, Feb 03, 2025 at 04:18:44PM -0800, Junio C Hamano wrote:
>> Everything looked great from a quick look.  I'll have a more
>> detailed look later, but this round looks quite promising.
>
> I didn't do another deep dive, but mostly had a look at my last comments
> and the range-diff. As far as I can see my comments have been addressed,
> so this version looks good to me.

Thanks.

gitgitgadget · 2025-02-05T23:05:42Z

This patch series was integrated into seen via git@0968664.

gitgitgadget · 2025-02-06T22:21:59Z

This patch series was integrated into seen via git@c9e5cf4.

gitgitgadget · 2025-02-07T19:35:19Z

This patch series was integrated into seen via git@7f1600c.

gitgitgadget · 2025-02-07T21:33:32Z

There was a status update in the "Cooking" section about the branch ds/backfill on the Git mailing list:

Lazy-loading missing files in a blobless clone on demand is costly
as it tends to be one-blob-at-a-time.  "git backfill" is introduced
to help bulk-download necessary files beforehand.

Will merge to 'next'.
cf. <[email protected]>
source: <[email protected]>

gitgitgadget · 2025-02-10T20:02:32Z

This patch series was integrated into seen via git@31fe539.

gitgitgadget · 2025-02-10T20:02:33Z

This patch series was integrated into next via git@d6348c9.

gitgitgadget · 2025-02-11T02:41:58Z

There was a status update in the "Cooking" section about the branch ds/backfill on the Git mailing list:

Lazy-loading missing files in a blobless clone on demand is costly
as it tends to be one-blob-at-a-time.  "git backfill" is introduced
to help bulk-download necessary files beforehand.

Will merge to 'master'.
cf. <[email protected]>
source: <[email protected]>

gitgitgadget · 2025-02-12T00:11:48Z

This patch series was integrated into seen via git@e055b4d.

gitgitgadget · 2025-02-12T18:43:11Z

This patch series was integrated into seen via git@81dd73c.

gitgitgadget · 2025-02-13T22:37:38Z

This patch series was integrated into seen via git@d5941b7.

gitgitgadget · 2025-02-15T18:17:57Z

This patch series was integrated into seen via git@53a8157.

gitgitgadget · 2025-02-15T18:51:25Z

There was a status update in the "Cooking" section about the branch ds/backfill on the Git mailing list:

Lazy-loading missing files in a blobless clone on demand is costly
as it tends to be one-blob-at-a-time.  "git backfill" is introduced
to help bulk-download necessary files beforehand.

Will merge to 'master'.
cf. <[email protected]>
source: <[email protected]>

gitgitgadget · 2025-02-19T00:13:16Z

This patch series was integrated into seen via git@e565f37.

gitgitgadget · 2025-02-19T00:13:17Z

This patch series was integrated into master via git@e565f37.

gitgitgadget · 2025-02-19T00:13:17Z

This patch series was integrated into next via git@e565f37.

gitgitgadget · 2025-02-19T00:13:20Z

Closed via e565f37.

derrickstolee self-assigned this Oct 29, 2024

derrickstolee force-pushed the backfill-upstream branch from 8d964b9 to 52ef44c Compare October 30, 2024 20:06

derrickstolee force-pushed the api-upstream branch from 97d669a to 5252076 Compare October 30, 2024 20:07

derrickstolee force-pushed the backfill-upstream branch from 52ef44c to cd88221 Compare October 30, 2024 22:20

derrickstolee force-pushed the api-upstream branch from 5252076 to 0bb607e Compare October 30, 2024 22:20

derrickstolee mentioned this pull request Oct 31, 2024

PATH WALK I: The path-walk API #1818

Closed

derrickstolee force-pushed the backfill-upstream branch from cd88221 to c3a46e3 Compare November 8, 2024 16:01

derrickstolee force-pushed the backfill-upstream branch from c3a46e3 to b5f4390 Compare December 6, 2024 19:41

derrickstolee force-pushed the api-upstream branch from 0bb607e to e716672 Compare December 6, 2024 19:42

derrickstolee force-pushed the backfill-upstream branch from b5f4390 to 35a7e4e Compare December 6, 2024 20:05

gitgitgadget bot added the seen label Dec 7, 2024

gitgitgadget bot reviewed Dec 16, 2024

View reviewed changes

derrickstolee added 2 commits February 3, 2025 11:16

derrickstolee force-pushed the backfill-upstream branch from ce7360d to f22cf8b Compare February 3, 2025 16:16

gitgitgadget bot added the next label Feb 10, 2025

gitgitgadget bot added the master label Feb 19, 2025

gitgitgadget bot closed this Feb 19, 2025

		git-backfill - Download missing objects in a partial clone


		SYNOPSIS

PATH WALK III: Add 'git backfill' command #1820

PATH WALK III: Add 'git backfill' command #1820

Uh oh!

Conversation

derrickstolee commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updates in v2

Updates in v3

Uh oh!

derrickstolee commented Dec 6, 2024

Uh oh!

gitgitgadget bot commented Dec 6, 2024

Uh oh!

gitgitgadget bot commented Dec 7, 2024

Uh oh!

gitgitgadget bot commented Dec 8, 2024

Uh oh!

gitgitgadget bot commented Dec 8, 2024

Uh oh!

gitgitgadget bot commented Dec 9, 2024

Uh oh!

gitgitgadget bot commented Dec 9, 2024

Uh oh!

gitgitgadget bot commented Dec 9, 2024

Uh oh!

gitgitgadget bot commented Dec 10, 2024

Uh oh!

gitgitgadget bot commented Dec 11, 2024

Uh oh!

gitgitgadget bot commented Dec 12, 2024

Uh oh!

gitgitgadget bot commented Dec 13, 2024

Uh oh!

gitgitgadget bot commented Dec 15, 2024

Uh oh!

gitgitgadget bot commented Dec 16, 2024

Uh oh!

gitgitgadget bot Dec 16, 2024

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Dec 18, 2024

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Dec 16, 2024

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Dec 18, 2024

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Dec 16, 2024

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot commented Dec 16, 2024

Uh oh!

gitgitgadget bot commented Dec 17, 2024

Uh oh!

derrickstolee commented Feb 3, 2025

Uh oh!

gitgitgadget bot commented Feb 3, 2025

Uh oh!

gitgitgadget bot commented Feb 3, 2025

Uh oh!

gitgitgadget bot commented Feb 4, 2025

Uh oh!

gitgitgadget bot commented Feb 4, 2025

Uh oh!

gitgitgadget bot commented Feb 5, 2025

Uh oh!

gitgitgadget bot commented Feb 5, 2025

Uh oh!

gitgitgadget bot commented Feb 5, 2025

Uh oh!

gitgitgadget bot commented Feb 5, 2025

Uh oh!

gitgitgadget bot commented Feb 6, 2025

Uh oh!

gitgitgadget bot commented Feb 7, 2025

Uh oh!

gitgitgadget bot commented Feb 7, 2025

Uh oh!

derrickstolee commented Oct 29, 2024 •

edited

Loading