Skip to content

PATH WALK II: Add --path-walk option to 'git pack-objects' #1819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Documentation/config/feature.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ walking fewer objects.
+
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Mon, Mar 24, 2025 at 03:22:44PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <[email protected]>
>
> Users may want to enable the --path-walk option for 'git pack-objects' by
> default, especially underneath commands like 'git push' or 'git repack'.
>
> This should be limited to client repositories, since the --path-walk option
> disables bitmap walks, so would be bad to include in Git servers when
> serving fetches and clones. There is potential that it may be helpful to
> consider when repacking the repository, to take advantage of improved deltas
> across historical versions of the same files.
>
> Much like how "pack.useSparse" was introduced and included in
> "feature.experimental" before being enabled by default, use the repository
> settings infrastructure to make the new "pack.usePathWalk" config enabled by
> "feature.experimental" and "feature.manyFiles".
>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  Documentation/config/feature.adoc | 4 ++++
>  Documentation/config/pack.adoc    | 8 ++++++++
>  builtin/pack-objects.c            | 3 +++
>  repo-settings.c                   | 3 +++
>  repo-settings.h                   | 1 +
>  5 files changed, 19 insertions(+)
>
> diff --git a/Documentation/config/feature.adoc b/Documentation/config/feature.adoc
> index f061b64b748..cb49ff2604a 100644
> --- a/Documentation/config/feature.adoc
> +++ b/Documentation/config/feature.adoc
> @@ -20,6 +20,10 @@ walking fewer objects.
>  +
>  * `pack.allowPackReuse=multi` may improve the time it takes to create a pack by
>  reusing objects from multiple packs instead of just one.
> ++
> +* `pack.usePathWalk` may speed up packfile creation and make the packfiles be
> +significantly smaller in the presence of certain filename collisions with Git's
> +default name-hash.
>
>  feature.manyFiles::
>  	Enable config options that optimize for repos with many files in the
> diff --git a/Documentation/config/pack.adoc b/Documentation/config/pack.adoc
> index da527377faf..08d06271177 100644
> --- a/Documentation/config/pack.adoc
> +++ b/Documentation/config/pack.adoc
> @@ -155,6 +155,14 @@ pack.useSparse::
>  	commits contain certain types of direct renames. Default is
>  	`true`.
>
> +pack.usePathWalk::
> +	When true, git will default to using the '--path-walk' option in
> +	'git pack-objects' when the '--revs' option is present. This
> +	algorithm groups objects by path to maximize the ability to
> +	compute delta chains across historical versions of the same
> +	object. This may disable other options, such as using bitmaps to
> +	enumerate objects.
> +

Same note here as in the previous patch, I think it's fine to refer
readers to the git-pack-objects[1] documentation instead of repeating
yourself here. (And it removes the risk of these three descriptions
falling out of sync with one another).

>  pack.preferBitmapTips::
>  	When selecting which commits will receive bitmaps, prefer a
>  	commit at the tip of any reference that is a suffix of any value
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index a6b8a78d42a..0ea85754c52 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -4652,6 +4652,9 @@ int cmd_pack_objects(int argc,
>  		if (use_bitmap_index > 0 ||
>  		    !use_internal_rev_list)
>  			path_walk = 0;
> +		else if (the_repository->gitdir &&
> +			 the_repository->settings.pack_use_path_walk)
> +			path_walk = 1;
>  		else
>  			path_walk = git_env_bool("GIT_TEST_PACK_PATH_WALK", 0);
>  	}

The limited diff context makes it hard for me to tell for sure, but this
takes place after git_config(), right? If so, I think we can avoid using
the repository settings machinery here and just use the config API
directly.

(FWIW, I typically think of repository settings as a way to expose
config information to some part of the codebase that doesn't otherwise
have easy access to, e.g., a static field that was set by a git_config()
callback).

Separate from the above: should this have a sanity test to makes sure
that we read the config setting correctly?

Thanks,
Taylor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 5/2/25 7:42 PM, Taylor Blau wrote:
> On Mon, Mar 24, 2025 at 03:22:44PM +0000, Derrick Stolee via GitGitGadget wrote:

>> +pack.usePathWalk::
>> +	When true, git will default to using the '--path-walk' option in
>> +	'git pack-objects' when the '--revs' option is present. This
>> +	algorithm groups objects by path to maximize the ability to
>> +	compute delta chains across historical versions of the same
>> +	object. This may disable other options, such as using bitmaps to
>> +	enumerate objects.
>> +
> > Same note here as in the previous patch, I think it's fine to refer
> readers to the git-pack-objects[1] documentation instead of repeating
> yourself here. (And it removes the risk of these three descriptions
> falling out of sync with one another).

Sure.

>>   pack.preferBitmapTips::
>>   	When selecting which commits will receive bitmaps, prefer a
>>   	commit at the tip of any reference that is a suffix of any value
>> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
>> index a6b8a78d42a..0ea85754c52 100644
>> --- a/builtin/pack-objects.c
>> +++ b/builtin/pack-objects.c
>> @@ -4652,6 +4652,9 @@ int cmd_pack_objects(int argc,
>>   		if (use_bitmap_index > 0 ||
>>   		    !use_internal_rev_list)
>>   			path_walk = 0;
>> +		else if (the_repository->gitdir &&
>> +			 the_repository->settings.pack_use_path_walk)
>> +			path_walk = 1;
>>   		else
>>   			path_walk = git_env_bool("GIT_TEST_PACK_PATH_WALK", 0);
>>   	}
> > The limited diff context makes it hard for me to tell for sure, but this
> takes place after git_config(), right? If so, I think we can avoid using
> the repository settings machinery here and just use the config API
> directly.

As I'm paging this thread back into my memory, I internally thought
"obviously the test demonstrates that it works..."

> (FWIW, I typically think of repository settings as a way to expose
> config information to some part of the codebase that doesn't otherwise
> have easy access to, e.g., a static field that was set by a git_config()
> callback).
> > Separate from the above: should this have a sanity test to makes sure
> that we read the config setting correctly?
...but of course I neglected to create such a test. Will fix.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 5/6/25 3:46 PM, Derrick Stolee wrote:
> On 5/2/25 7:42 PM, Taylor Blau wrote:
>>>   pack.preferBitmapTips::
>>>       When selecting which commits will receive bitmaps, prefer a
>>>       commit at the tip of any reference that is a suffix of any value
>>> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
>>> index a6b8a78d42a..0ea85754c52 100644
>>> --- a/builtin/pack-objects.c
>>> +++ b/builtin/pack-objects.c
>>> @@ -4652,6 +4652,9 @@ int cmd_pack_objects(int argc,
>>>           if (use_bitmap_index > 0 ||
>>>               !use_internal_rev_list)
>>>               path_walk = 0;
>>> +        else if (the_repository->gitdir &&
>>> +             the_repository->settings.pack_use_path_walk)
>>> +            path_walk = 1;
>>>           else
>>>               path_walk = git_env_bool("GIT_TEST_PACK_PATH_WALK", 0);
>>>       }
>>
>> The limited diff context makes it hard for me to tell for sure, but this
>> takes place after git_config(), right? If so, I think we can avoid using
>> the repository settings machinery here and just use the config API
>> directly.
>>
>> (FWIW, I typically think of repository settings as a way to expose
>> config information to some part of the codebase that doesn't otherwise
>> have easy access to, e.g., a static field that was set by a git_config()
>> callback).

Rereading these comments, I didn't adequately reply to "why use the
repo settings?" and the reason comes due to the use of assigning the
value in feature.experimental=true. This provides a common place for
the logic around both path.usePathWalk and feature.experimental.

A similar behavior is already present for pack.useSparse, which was
in feature.experimental for a while before it became enabled by
default.

Thanks,
-Stolee

* `pack.allowPackReuse=multi` may improve the time it takes to create a pack by
reusing objects from multiple packs instead of just one.
+
* `pack.usePathWalk` may speed up packfile creation and make the packfiles be
significantly smaller in the presence of certain filename collisions with Git's
default name-hash.

feature.manyFiles::
Enable config options that optimize for repos with many files in the
Expand Down
4 changes: 4 additions & 0 deletions Documentation/config/pack.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,10 @@ pack.useSparse::
commits contain certain types of direct renames. Default is
`true`.

pack.usePathWalk::
Enable the `--path-walk` option by default for `git pack-objects`
processes. See linkgit:git-pack-objects[1] for full details.

pack.preferBitmapTips::
When selecting which commits will receive bitmaps, prefer a
commit at the tip of any reference that is a suffix of any value
Expand Down
25 changes: 18 additions & 7 deletions Documentation/git-pack-objects.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@ SYNOPSIS
--------
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Mon, Mar 10, 2025 at 01:50:45AM +0000, Derrick Stolee via GitGitGadget wrote:
> ---
>  Documentation/git-pack-objects.adoc | 14 +++++++-------
>  builtin/pack-objects.c              | 10 ++++++++--
>  t/t0450/adoc-help-mismatches        |  1 -
>  3 files changed, 15 insertions(+), 10 deletions(-)

Thanks for cleaning these up.

Thanks,
Taylor

[verse]
'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
[--no-reuse-delta] [--delta-base-offset] [--non-empty]
[--local] [--incremental] [--window=<n>] [--depth=<n>]
[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
[--cruft] [--cruft-expiration=<time>]
[--stdout [--filter=<filter-spec>] | <base-name>]
[--shallow] [--keep-true-parents] [--[no-]sparse]
[--name-hash-version=<n>] < <object-list>
[--no-reuse-delta] [--delta-base-offset] [--non-empty]
[--local] [--incremental] [--window=<n>] [--depth=<n>]
[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
[--cruft] [--cruft-expiration=<time>]
[--stdout [--filter=<filter-spec>] | <base-name>]
[--shallow] [--keep-true-parents] [--[no-]sparse]
[--name-hash-version=<n>] [--path-walk] < <object-list>


DESCRIPTION
Expand Down Expand Up @@ -375,6 +375,17 @@ many different directories. At the moment, this version is not allowed
when writing reachability bitmap files with `--write-bitmap-index` and it
will be automatically changed to version `1`.

--path-walk::
Perform compression by first organizing objects by path, then a
second pass that compresses across paths as normal. This has the
potential to improve delta compression especially in the presence
of filenames that cause collisions in Git's default name-hash
algorithm.
+
Incompatible with `--delta-islands`, `--shallow`, or `--filter`. The
`--use-bitmap-index` option will be ignored in the presence of
`--path-walk.`


DELTA ISLANDS
-------------
Expand Down
5 changes: 4 additions & 1 deletion Documentation/git-repack.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ SYNOPSIS
[verse]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Mon, Mar 24, 2025 at 03:22:43PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/Documentation/git-repack.adoc b/Documentation/git-repack.adoc
> index 5852a5c9736..2199eb3b94f 100644
> --- a/Documentation/git-repack.adoc
> +++ b/Documentation/git-repack.adoc
> @@ -11,7 +11,7 @@ SYNOPSIS
>  [verse]
>  'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
>  	[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
> -	[--write-midx] [--name-hash-version=<n>]
> +	[--write-midx] [--name-hash-version=<n>] [--path-walk]
>
>  DESCRIPTION
>  -----------
> @@ -255,6 +255,18 @@ linkgit:git-multi-pack-index[1]).
>  	Provide this argument to the underlying `git pack-objects` process.
>  	See linkgit:git-pack-objects[1] for full details.
>
> +--path-walk::
> +	This option passes the `--path-walk` option to the underlying
> +	`git pack-options` process (see linkgit:git-pack-objects[1]).
> +	By default, `git pack-objects` walks objects in an order that
> +	presents trees and blobs in an order unrelated to the path they
> +	appear relative to a commit's root tree. The `--path-walk` option
> +	enables a different walking algorithm that organizes trees and
> +	blobs by path. This has the potential to improve delta compression
> +	especially in the presence of filenames that cause collisions in
> +	Git's default name-hash algorithm. Due to changing how the objects
> +	are walked, this option is not compatible with `--delta-islands`
> +	or `--filter`.

I was going to make a similar comment here as I did in an earlier commit
message about describing the direct effect of a command-line flag
instead of describing it in contrast to the default behavior. But I
think here that I would recommend instead abbreviating the description
to just:

    --path-walk::
    Pass the `--path-walk` option to `git pack-objects`, see
    linkgit:git-pack-objects[1].

, which is consistent with options like `-l`, `-f`, -F`, etc.

Does it makes sense to have a separate `--[no-]cruft-path-walk` here to
let callers determine separate behavior when repack is tasked with
generating cruft packs?

Thanks,
Taylor

'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
[--write-midx] [--name-hash-version=<n>]
[--write-midx] [--name-hash-version=<n>] [--path-walk]

DESCRIPTION
-----------
Expand Down Expand Up @@ -255,6 +255,9 @@ linkgit:git-multi-pack-index[1]).
Provide this argument to the underlying `git pack-objects` process.
See linkgit:git-pack-objects[1] for full details.

--path-walk::
Pass the `--path-walk` option to the underlying `git pack-objects`
process. See linkgit:git-pack-objects[1] for full details.

CONFIGURATION
-------------
Expand Down
9 changes: 9 additions & 0 deletions Documentation/technical/api-path-walk.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,14 @@ better off using the revision walk API instead.
the revision walk so that the walk emits commits marked with the
`UNINTERESTING` flag.

`edge_aggressive`::
For performance reasons, usually only the boundary commits are
explored to find UNINTERESTING objects. However, in the case of
shallow clones it can be helpful to mark all trees and blobs
reachable from UNINTERESTING tip commits as UNINTERESTING. This
matches the behavior of `--objects-edge-aggressive` in the
revision API.

`pl`::
This pattern list pointer allows focusing the path-walk search to
a set of patterns, only emitting paths that match the given
Expand All @@ -69,4 +77,5 @@ Examples

See example usages in:
`t/helper/test-path-walk.c`,
`builtin/pack-objects.c`,
`builtin/backfill.c`
Loading
Loading