-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PATH WALK II: Add --path-walk option to 'git pack-objects' #1819
base: master
Are you sure you want to change the base?
PATH WALK II: Add --path-walk option to 'git pack-objects' #1819
Conversation
2b762f3
to
7ae9a40
Compare
97d669a
to
5252076
Compare
7ae9a40
to
389c18f
Compare
5252076
to
0bb607e
Compare
389c18f
to
bc37596
Compare
bc37596
to
68bc637
Compare
0bb607e
to
e716672
Compare
68bc637
to
bc37596
Compare
781b2ea
to
ef54342
Compare
bc37596
to
785dfb3
Compare
785dfb3
to
c288df6
Compare
08b317c
to
26e1afb
Compare
This will be helpful in a future change, which will reuse this logic. Signed-off-by: Derrick Stolee <[email protected]>
In order to more easily compute delta bases among objects that appear at the exact same path, add a --path-walk option to 'git pack-objects'. This option will use the path-walk API instead of the object walk given by the revision machinery. Since objects will be provided in batches representing a common path, those objects can be tested for delta bases immediately instead of waiting for a sort of the full object list by name-hash. This has multiple benefits, including avoiding collisions by name-hash. The objects marked as UNINTERESTING are included in these batches, so we are guaranteeing some locality to find good delta bases. After the individual passes are done on a per-path basis, the default name-hash is used to find other opportunistic delta bases that did not match exactly by the full path name. The current implementation performs delta calculations while walking objects, which is not ideal for a few reasons. First, this will cause the "Enumerating objects" phase to be much longer than usual. Second, it does not take advantage of threading during the path-scoped delta calculations. Even with this lack of threading, the path-walk option is sometimes faster than the usual approach. Future changes will refactor this code to allow for threading, but that complexity is deferred until later to keep this patch as simple as possible. This new walk is incompatible with some features and is ignored by others: * Object filters are not currently integrated with the path-walk API, such as sparse-checkout or tree depth. A blobless packfile could be integrated easily, but that is deferred for later. * Server-focused features such as delta islands, shallow packs, and using a bitmap index are incompatible with the path-walk API. * The path walk API is only compatible with the --revs option, not taking object lists or pack lists over stdin. These alternative ways to specify the objects currently ignores the --path-walk option without even a warning. Future changes will create performance tests that demonstrate the power of this approach. Signed-off-by: Derrick Stolee <[email protected]>
The t0450 test script verifies that builtin usage matches the synopsis in the documentation. Adjust the builtin to match and then remove 'git pack-objects' from the exception list. Signed-off-by: Derrick Stolee <[email protected]>
The previous change added a --path-walk option to 'git pack-objects'. Create a performance test that demonstrates the time and space benefits of the feature. In order to get an appropriate comparison, we need to avoid reusing deltas and recompute them from scratch. Compare the creation of a thin pack representing a small push and the creation of a relatively large non-thin pack. Running on my copy of the Git repository results in this data (removing the repack tests for --name-hash-version): Test this tree ------------------------------------------------------------------------ 5313.2: thin pack with --name-hash-version=1 0.02(0.01+0.01) 5313.3: thin pack size with --name-hash-version=1 1.6K 5313.4: big pack with --name-hash-version=1 2.55(4.20+0.26) 5313.5: big pack size with --name-hash-version=1 16.4M 5313.6: shallow fetch pack with --name-hash-version=1 1.24(2.03+0.08) 5313.7: shallow pack size with --name-hash-version=1 12.2M 5313.10: thin pack with --name-hash-version=2 0.03(0.01+0.01) 5313.11: thin pack size with --name-hash-version=2 1.6K 5313.12: big pack with --name-hash-version=2 1.91(3.23+0.20) 5313.13: big pack size with --name-hash-version=2 16.4M 5313.14: shallow fetch pack with --name-hash-version=2 1.06(1.57+0.10) 5313.15: shallow pack size with --name-hash-version=2 12.5M 5313.18: thin pack with --path-walk 0.03(0.01+0.01) 5313.19: thin pack size with --path-walk 1.6K 5313.20: big pack with --path-walk 2.05(3.24+0.27) 5313.21: big pack size with --path-walk 16.3M 5313.22: shallow fetch pack with --path-walk 1.08(1.66+0.07) 5313.23: shallow pack size with --path-walk 12.4M This can be reformatted as follows: Pack Type Hash v1 Hash v2 Path Walk --------------------------------------------------- thin pack (time) 0.02s 0.03s 0.03s (size) 1.6K 1.6K 1.6K big pack (time) 2.55s 1.91s 2.05s (size) 16.4M 16.4M 16.3M shallow pack (time) 1.24s 1.06s 1.08s (size) 12.2M 12.5M 12.4M Note that the timing is slower because there is no threading in the --path-walk case (yet). Also, the shallow pack cases are really not using the --path-walk logic right now because it is disabled until some additions are made to the path walk API. The cases where the --path-walk option really shines is when the default name-hash is overwhelmed with collisions. An open source example can be found in the microsoft/fluentui repo [1] at a certain commit [2]. [1] https://github.com/microsoft/fluentui [2] e70848ebac1cd720875bccaa3026f4a9ed700e08 Running the tests on this repo results in the following comparison table: Pack Type Hash v1 Hash v2 Path Walk --------------------------------------------------- thin pack (time) 0.36s 0.12s 0.08s (size) 1.2M 22.0K 18.4K big pack (time) 2.00s 2.90s 2.21s (size) 20.4M 25.9M 19.5M shallow pack (time) 1.41s 1.80s 1.65s (size) 34.4M 33.7M 33.6M Notice in particular that in the small thin pack, the time performance has improved from 0.36s for --name-hash-version=1 to 0.08s and this is likely due to the improved size of the resulting pack: 18.4K instead of 1.2M. The relatively new --name-hash-version=2 is competitive with --path-walk (0.12s and 22.0K) but not quite as successful. Finally, running this on a copy of the Linux kernel repository results in these data points: Pack Type Hash v1 Hash v2 Path Walk --------------------------------------------------- thin pack (time) 0.03s 0.13s 0.03s (size) 4.6K 4.6K 4.6K big pack (time) 15.29s 12.32s 13.92s (size) 201.1M 159.1M 158.5M shallow pack (time) 10.88s 22.93s 22.74s (size) 269.2M 273.8M 267.7M Signed-off-by: Derrick Stolee <[email protected]>
There are many tests that validate whether 'git pack-objects' works as expected. Instead of duplicating these tests, add a new test environment variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default when specified. This was useful in testing the implementation of the --path-walk implementation, especially in conjunction with test such as: - t0411-clone-from-partial.sh : One test fetches from a repo that does not have the boundary objects. This causes the path-based walk to fail. Disable the variable for this test. - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo without a boundary object. - t5310-pack-bitmaps.sh : One test compares the case when packing with bitmaps to the case when packing without them. Since we disable the test variable when writing bitmaps, this causes a difference in the object list (the --path-walk option adds an extra object). Specify --no-path-walk in both processes for the comparison. Another test checks for a specific delta base, but when computing dynamically without using bitmaps, the base object it too small to be considered in the delta calculations so no base is used. - t5316-pack-delta-depth.sh : This script cares about certain delta choices and their chain lengths. The --path-walk option changes how these chains are selected, and thus changes the results of this test. - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of the --sparse option and how it combines with --path-walk. - t5332-multi-pack-reuse.sh : This test verifies that the preferred pack is used for delta reuse when possible. The --path-walk option is not currently aware of the preferred pack at all, so finds a different delta base. - t7406-submodule-update.sh : When using the variable, the --depth option collides with the --path-walk feature, resulting in a warning message. Disable the variable so this warning does not appear. I want to call out one specific test change that is only temporary: - t5530-upload-pack-error.sh : One test cares specifically about an "unable to read" error message. Since the current implementation performs delta calculations within the path-walk API callback, a different "unable to get size" error message appears. When this is changed in a future refactoring, this test change can be reverted. Similar to GIT_TEST_NAME_HASH_VERSION, we do not add this option to the linux-TEST-vars CI build as that's already an overloaded build. Signed-off-by: Derrick Stolee <[email protected]>
It can be notoriously difficult to detect if delta bases are being computed properly during 'git push'. Construct an example where it will make a kilobyte worth of difference when a delta base is not found. We can then use the progress indicators to distinguish between bytes and KiB depending on whether the delta base is found and used. Signed-off-by: Derrick Stolee <[email protected]>
Since 'git pack-objects' supports a --path-walk option, allow passing it through in 'git repack'. This presents interesting testing opportunities for comparing the different repacking strategies against each other. Add the --path-walk option to the performance tests in p5313. For the microsoft/fluentui repo [1] checked out at a specific commit [2], the --path-walk tests in p5313 look like this: Test this tree ------------------------------------------------------------------------- 5313.18: thin pack with --path-walk 0.08(0.06+0.02) 5313.19: thin pack size with --path-walk 18.4K 5313.20: big pack with --path-walk 2.10(7.80+0.26) 5313.21: big pack size with --path-walk 19.8M 5313.22: shallow fetch pack with --path-walk 1.62(3.38+0.17) 5313.23: shallow pack size with --path-walk 33.6M 5313.24: repack with --path-walk 81.29(96.08+0.71) 5313.25: repack size with --path-walk 142.5M [1] https://github.com/microsoft/fluentui [2] e70848ebac1cd720875bccaa3026f4a9ed700e08 Along with the earlier tests in p5313, I'll instead reformat the comparison as follows: Repack Method Pack Size Time --------------------------------------- Hash v1 439.4M 87.24s Hash v2 161.7M 21.51s Path Walk 142.5M 81.29s There are a few things to notice here: 1. The benefits of --name-hash-version=2 over --name-hash-version=1 are significant, but --path-walk still compresses better than that option. 2. The --path-walk command is still using --name-hash-version=1 for the second pass of delta computation, using the increased name hash collisions as a potential method for opportunistic compression on top of the path-focused compression. 3. The --path-walk algorithm is currently sequential and does not use multiple threads for delta compression. Threading will be implemented in a future change so the computation time will improve to better compete in this metric. There are small benefits in size for my copy of the Git repository: Repack Method Pack Size Time --------------------------------------- Hash v1 248.8M 30.44s Hash v2 249.0M 30.15s Path Walk 213.2M 142.50s As well as in the nodejs/node repository [3]: Repack Method Pack Size Time --------------------------------------- Hash v1 739.9M 71.18s Hash v2 764.6M 67.82s Path Walk 698.1M 208.10s [3] https://github.com/nodejs/node This benefit also repeats in my copy of the Linux kernel repository: Repack Method Pack Size Time --------------------------------------- Hash v1 2.5G 554.41s Hash v2 2.5G 549.62s Path Walk 2.2G 1562.36s It is important to see that even when the repository shape does not have many name-hash collisions, there is a slight space boost to be found using this method. As this repacking strategy was released in Git for Windows 2.47.0, some users have reported cases where the --path-walk compression is slightly worse than the --name-hash-version=2 option. In those cases, it may be beneficial to combine the two options. However, there has not been a released version of Git that has both options and I don't have access to these repos for testing. Signed-off-by: Derrick Stolee <[email protected]>
Users may want to enable the --path-walk option for 'git pack-objects' by default, especially underneath commands like 'git push' or 'git repack'. This should be limited to client repositories, since the --path-walk option disables bitmap walks, so would be bad to include in Git servers when serving fetches and clones. There is potential that it may be helpful to consider when repacking the repository, to take advantage of improved deltas across historical versions of the same files. Much like how "pack.useSparse" was introduced and included in "feature.experimental" before being enabled by default, use the repository settings infrastructure to make the new "pack.usePathWalk" config enabled by "feature.experimental" and "feature.manyFiles". Signed-off-by: Derrick Stolee <[email protected]>
Repositories registered with Scalar are expected to be client-only repositories that are rather large. This means that they are more likely to be good candidates for using the --path-walk option when running 'git pack-objects', especially under the hood of 'git push'. Enable this config in Scalar repositories. Signed-off-by: Derrick Stolee <[email protected]>
Previously, the --path-walk option to 'git pack-objects' would compute deltas inline with the path-walk logic. This would make the progress indicator look like it is taking a long time to enumerate objects, and then very quickly computed deltas. Instead of computing deltas on each region of objects organized by tree, store a list of regions corresponding to these groups. These can later be pulled from the list for delta compression before doing the "global" delta search. This presents a new progress indicator that can be used in tests to verify that this stage is happening. The current implementation is not integrated with threads, but could be done in a future update. Since we do not attempt to sort objects by size until after exploring all trees, we can remove the previous change to t5530 due to a different error message appearing first. Signed-off-by: Derrick Stolee <[email protected]>
Adapting the implementation of ll_find_deltas(), create a threaded version of the --path-walk compression step in 'git pack-objects'. This involves adding a 'regions' member to the thread_params struct, allowing each thread to own a section of paths. We can simplify the way jobs are split because there is no value in extending the batch based on name-hash the way sections of the object entry array are attempted to be grouped. We re-use the 'list_size' and 'remaining' items for the purpose of borrowing work in progress from other "victim" threads when a thread has finished its batch of work more quickly. Using the Git repository as a test repo, the p5313 performance test shows that the resulting size of the repo is the same, but the threaded implementation gives gains of varying degrees depending on the number of objects being packed. (This was tested on a 16-core machine.) Test HEAD~1 HEAD --------------------------------------------------- 5313.20: big pack 2.38 1.99 -16.4% 5313.21: big pack size 16.1M 16.0M -0.2% 5313.24: repack 107.32 45.41 -57.7% 5313.25: repack size 213.3M 213.2M -0.0% (Test output is formatted to better fit in message.) This ~60% reduction in 'git repack --path-walk' time is typical across all repos I used for testing. What is interesting is to compare when the overall time improves enough to outperform the --name-hash-version=1 case. These time improvements correlate with repositories with data shapes that significantly improve their data size as well. The --path-walk feature frequently takes longer than --name-hash-verison=2, trading some extrac computation for some additional compression. The natural place where this additional computation comes from is the two compression passes that --path-walk takes, though the first pass is naturally faster due to the path boundaries avoiding a number of delta compression attempts. For example, the microsoft/fluentui repo has significant size reduction from --name-hash-version=1 to --name-hash-version=2 followed by further improvements with --path-walk. The threaded computation makes --path-walk more competitive in time compared to --name-hash-version=2, though still ~31% more expensive in that metric. Repack Method Pack Size Time ------------------------------------------ Hash v1 439.4M 87.24s Hash v2 161.7M 21.51s Path Walk (Before) 142.5M 81.29s Path Walk (After) 142.5M 28.16s Similar results hold for the Git repository: Repack Method Pack Size Time ------------------------------------------ Hash v1 248.8M 30.44s Hash v2 249.0M 30.15s Path Walk (Before) 213.2M 142.50s Path Walk (After) 213.3M 45.41s ...as well as the nodejs/node repository: Repack Method Pack Size Time ------------------------------------------ Hash v1 739.9M 71.18s Hash v2 764.6M 67.82s Path Walk (Before) 698.1M 208.10s Path Walk (After) 698.0M 75.10s Finally, the Linux kernel repository is a good test for this repacking time change, even though the space savings is more subtle: Repack Method Pack Size Time ------------------------------------------ Hash v1 2.5G 554.41s Hash v2 2.5G 549.62s Path Walk (before) 2.2G 1562.36s Path Walk (before) 2.2G 559.00s Signed-off-by: Derrick Stolee <[email protected]>
In preparation for allowing both the --shallow and --path-walk options in the 'git pack-objects' builtin, create a new 'edge_aggressive' option in the path-walk API. This option will help walk the boundary more thoroughly and help avoid sending extra objects during fetches and pushes. The only use of the 'edge_hint_aggressive' option in the revision API is within mark_edges_uninteresting(), which is usually called before between prepare_revision_walk() and before visiting commits with get_revision(). In prepare_revision_walk(), the UNINTERESTING commits are walked until a boundary is found. Signed-off-by: Derrick Stolee <[email protected]>
There does not appear to be anything particularly incompatible about the --shallow and --path-walk options of 'git pack-objects'. If shallow commits are to be handled differently, then it is by the revision walk that defines the commit set and which are interesting or uninteresting. However, before the previous change, a trivial removal of the warning would cause a failure in t5500-fetch-pack.sh when GIT_TEST_PACK_PATH_WALK is enabled. The shallow fetch would provide more objects than we desired, due to some incorrect behavior of the path-walk API, especially around walking uninteresting objects. The recently-added tests in t5538-push-shallow.sh help to confirm this behavior is working with the --path-walk option if GIT_TEST_PACK_PATH_WALK is enabled. These tests passed previously due to the --path-walk feature being disabled in the presence of a shallow clone. Signed-off-by: Derrick Stolee <[email protected]>
26e1afb
to
2eb9250
Compare
/submit |
Submitted as [email protected] To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Junio C Hamano wrote (reply to this): "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> ... deltas across path boundaries. This second pass is much faster than a fresh
> pass since the existing deltas are used as a limit for the size of
> potentially new deltas, short-circuiting the checks when the delta size
> exceeds the current-best.
Very nice.
> The microsoft/fluentui is a public Javascript repo that suffers from many of
> the name hash collisions as internal repositories I've worked with. Here is
> a comparison of the compressed size and end-to-end time of the repack:
>
> Repack Method Pack Size Time
> ---------------------------------------
> Hash v1 439.4M 87.24s
> Hash v2 161.7M 21.51s
> Path Walk 142.5M 28.16s
>
>
> Less dramatic, but perhaps more standardly structured is the nodejs/node
> repository, with these stats:
>
> Repack Method Pack Size Time
> ------------------------------------------
> Hash v1 739.9M 71.18s
> Hash v2 764.6M 67.82s
> Path Walk 698.0M 75.10s
>
>
> Even the Linux kernel repository gains some benefits, even though the number
> of hash collisions is relatively low due to a preference for short
> filenames:
>
> Repack Method Pack Size Time
> ------------------------------------------
> Hash v1 2.5G 554.41s
> Hash v2 2.5G 549.62s
> Path Walk 2.2G 559.00s
This third one, v2 not performing much better than v1, is quite
surprising.
> The drawbacks of the --path-walk feature is that it will be harder to
> integrate it with bitmap features, specifically delta islands. This is not
> insurmountable, but would require more work, such as a revision walk to
> paint objects with reachability information before using that during delta
> computations.
>
> However, there should still be significant benefits to Git clients trying to
> save space and improve local performance.
Sure. More experiments and more approaches will eventually give us
overall improvement. I am hoping that we will be able to condense
the result of these different approaches and their combinations into
easy-to-choose-from canned choices (as opposed to a myriad of little
knobs the users need to futz with without really understanding what
they are tweaking).
> This feature was shipped with similar features in microsoft/git as of
> v2.47.0.vfs.0.3 [4]. This was used in CI machines for an internal monorepo
> that had significant repository growth due to constructing a batch of
> beachball [5] CHANGELOG.[md|json] files and pushing them to a release
> branch. These pushes were frequently 70-200 MB due to poor delta
> compression. Using the 'pack.usePathWalk=true' config, these pushes dropped
> in size by 100x while improving performance. Since these CI machines were
> working with a shallow clone, the 'edge_aggressive' changes were required to
> enable the path-walk option.
Nice, thanks. |
This patch series was integrated into seen via git@e51880c. |
This branch is now known as |
This patch series was integrated into seen via git@28416f0. |
This patch series was integrated into seen via git@4fc875f. |
On the Git mailing list, Taylor Blau wrote (reply to this): On Mon, Mar 10, 2025 at 10:28:22AM -0700, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <[email protected]> writes:
>
> > ... deltas across path boundaries. This second pass is much faster than a fresh
> > pass since the existing deltas are used as a limit for the size of
> > potentially new deltas, short-circuiting the checks when the delta size
> > exceeds the current-best.
>
> Very nice.
>
> > The microsoft/fluentui is a public Javascript repo that suffers from many of
> > the name hash collisions as internal repositories I've worked with. Here is
> > a comparison of the compressed size and end-to-end time of the repack:
> >
> > Repack Method Pack Size Time
> > ---------------------------------------
> > Hash v1 439.4M 87.24s
> > Hash v2 161.7M 21.51s
> > Path Walk 142.5M 28.16s
OK, so microsoft/fluentui benefits from the path-walk approach in the
size of the resulting pack, but at the cost of additional time to
generate it.
> > Less dramatic, but perhaps more standardly structured is the nodejs/node
> > repository, with these stats:
> >
> > Repack Method Pack Size Time
> > ------------------------------------------
> > Hash v1 739.9M 71.18s
> > Hash v2 764.6M 67.82s
> > Path Walk 698.0M 75.10s
Same here.
> > Even the Linux kernel repository gains some benefits, even though the number
> > of hash collisions is relatively low due to a preference for short
> > filenames:
> >
> > Repack Method Pack Size Time
> > ------------------------------------------
> > Hash v1 2.5G 554.41s
> > Hash v2 2.5G 549.62s
> > Path Walk 2.2G 559.00s
OK, so here the savings are a little more substantial, and the
performance hit isn't too bad.
> This third one, v2 not performing much better than v1, is quite
> surprising.
I'm not sure... I think Stolee's "the number of hash collisions is
relatively low due to preference for short filenames" is why v2 behaves
so similarly to v1 here.
> > The drawbacks of the --path-walk feature is that it will be harder to
> > integrate it with bitmap features, specifically delta islands. This is not
> > insurmountable, but would require more work, such as a revision walk to
> > paint objects with reachability information before using that during delta
> > computations.
> >
> > However, there should still be significant benefits to Git clients trying to
> > save space and improve local performance.
>
> Sure. More experiments and more approaches will eventually give us
> overall improvement. I am hoping that we will be able to condense
> the result of these different approaches and their combinations into
> easy-to-choose-from canned choices (as opposed to a myriad of little
> knobs the users need to futz with without really understanding what
> they are tweaking).
In the above three examples we see some trade-offs between pack size and
the time it took to generate it. I think it's worth discussing whether
or not the potential benefit of such a trade-off is worth the
significant complexity and code that this feature will introduce. (To be
clear, I don't have a strong opinion here one way or the other, but I do
think that it's at least worth discussing).
I wonder how much of the benefits of path-walk over the hash v2 approach
could be had by simply widening the pack.window during delta selection?
I tried to run a similar experiment as you did above on the
microsoft/fluentui repository and got the following:
Repack Method Pack Size Time
------------------------------------------
Hash v1 447.2MiB 932.41s
Hash v2 154.1MiB 404.35s
Hash v2 (window=20) 146.7MiB 472.66s
Hash v2 (window=50) 138.3MiB 622.13s
Path Walk 140.8MiB 168.86s
In your experiment above on the same repository, the path walk feature
represents an 11.873% reduction in pack size, but at the cost of a 30.9%
regression in runtime.
When I set pack.window to "50" (over the default value of "10"), I get a
~10.3% reduction in pack size at the cost of a 54% increase in runtime
(relative to just --name-hash-version=2 with the default pack.window
settings).
But when I set the pack.window to "20", the relative values (again
comparing against --name-hash-version=2 with the default pack.window)
are 4.8% reduction in pack size and a 16.9% increase in runtime.
But these numbers are pretty confusing to me, TBH. The reduction in pack
sizes makes sense, and here I see numbers that are on-par with what you
noted above for the same repository. But the runtimes are wildly
different (e.g., hash v1 takes you just 87s while mine takes 932s).
There must be something in our environment that is different. I'm
starting with a bare clone of microsoft/fluentui from GitHub, and made
several 'cp -al' copies of it for the different experiments. In the
penultimate one, I ran:
$ time git.compile -c pack.window=50 repack --name-hash-version=2 \
-adF --no-write-bitmap-index
, and similarly for the other experiments with appropriate values for
pack.window, --name-hash-version, and --path-walk, when applicable. All
of this was done on a -O2 build of Git with your patches on top.
So I'm not sure what to make of these results. Clearly on my machine
something is different that makes path-walk much faster than hash v2.
But on your machine it's slower, so I don't know how much I trust the
timing results from either machine.
In any event, it seems like at least in this example we can get
performance that is on-par with path-walk by simply widening the
pack.window when using hash v2. On my machine that seems to cost more
time than it does for you to the point where it's slower than my
path-walk. But I think I need to understand what the differences are
here before we can draw any conclusions on the size or timing.
If the overwhelming majority of cases where the --path-walk feature
presents a significant benefit over hash v2 at various pack.window sizes
(where we could get approximately the same reduction in pack size with
approximately the same end-to-end runtime of 'git repack'), then I feel
we might want to reconsider whether or not the complexity of this feature
is worthwhile.
But if the --path-walk feature either gives us a significant size
benefit that we can't get with hash v2 and a wider pack.window without
paying a significant runtime cost (or vice-versa), then this feature
would indeed be worthwhile.
I also have no idea how representative the above is of your intended
use-case, which seems much more oriented around pushes than from-scratch
repacks, which would also affect our conclusions here.
Thanks,
Taylor |
@@ -3196,6 +3196,33 @@ static int add_ref_tag(const char *tag UNUSED, const char *referent UNUSED, cons | |||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
On Mon, Mar 10, 2025 at 01:50:43AM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <[email protected]>
>
> This will be helpful in a future change, which will reuse this logic.
>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
> builtin/pack-objects.c | 53 +++++++++++++++++++++++-------------------
> 1 file changed, 29 insertions(+), 24 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 58a9b161262..1d0992a8dac 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3196,6 +3196,33 @@ static int add_ref_tag(const char *tag UNUSED, const char *referent UNUSED, cons
> return 0;
> }
>
> +static int should_attempt_deltas(struct object_entry *entry)
> +{
> + if (DELTA(entry))
> + return 0;
> +
> + if (!entry->type_valid ||
> + oe_size_less_than(&to_pack, entry, 50))
> + return 0;
> +
> + if (entry->no_try_delta)
> + return 0;
> +
> + if (!entry->preferred_base) {
> + if (oe_type(entry) < 0)
> + die(_("unable to get type of object %s"),
> + oid_to_hex(&entry->idx.oid));
> + } else if (oe_type(entry) < 0) {
> + /*
> + * This object is not found, but we
> + * don't have to include it anyway.
> + */
> + return 0;
> + }
> +
> + return 1;
> +}
> +
> static void prepare_pack(int window, int depth)
> {
> struct object_entry **delta_list;
> @@ -3226,33 +3253,11 @@ static void prepare_pack(int window, int depth)
> for (i = 0; i < to_pack.nr_objects; i++) {
> struct object_entry *entry = to_pack.objects + i;
>
> - if (DELTA(entry))
> - /* This happens if we decided to reuse existing
> - * delta from a pack. "reuse_delta &&" is implied.
> - */
It looks like this comment went away when this part of prepare_pack()
was extracted into should_attempt_deltas().
> - continue;
> -
> - if (!entry->type_valid ||
> - oe_size_less_than(&to_pack, entry, 50))
> + if (!should_attempt_deltas(entry))
> continue;
>
> - if (entry->no_try_delta)
> - continue;
> -
> - if (!entry->preferred_base) {
> + if (!entry->preferred_base)
> nr_deltas++;
Makes sense; should_attempt_deltas() doesn't itself change nr_deltas, so
we want to do it ourselves here. Looking good!
Thanks,
Taylor
Documentation/git-pack-objects.adoc
Outdated
@@ -16,7 +16,7 @@ SYNOPSIS | |||
[--cruft] [--cruft-expiration=<time>] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
On Mon, Mar 10, 2025 at 01:50:44AM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <[email protected]>
>
> In order to more easily compute delta bases among objects that appear at
> the exact same path, add a --path-walk option to 'git pack-objects'.
>
> This option will use the path-walk API instead of the object walk given
> by the revision machinery. Since objects will be provided in batches
> representing a common path, those objects can be tested for delta bases
> immediately instead of waiting for a sort of the full object list by
> name-hash. This has multiple benefits, including avoiding collisions by
> name-hash.
>
> The objects marked as UNINTERESTING are included in these batches, so we
> are guaranteeing some locality to find good delta bases.
>
> After the individual passes are done on a per-path basis, the default
> name-hash is used to find other opportunistic delta bases that did not
> match exactly by the full path name.
>
> The current implementation performs delta calculations while walking
> objects, which is not ideal for a few reasons. First, this will cause
> the "Enumerating objects" phase to be much longer than usual. Second, it
> does not take advantage of threading during the path-scoped delta
> calculations. Even with this lack of threading, the path-walk option is
> sometimes faster than the usual approach. Future changes will refactor
> this code to allow for threading, but that complexity is deferred until
> later to keep this patch as simple as possible.
>
> This new walk is incompatible with some features and is ignored by
> others:
>
> * Object filters are not currently integrated with the path-walk API,
> such as sparse-checkout or tree depth. A blobless packfile could be
> integrated easily, but that is deferred for later.
>
> * Server-focused features such as delta islands, shallow packs, and
> using a bitmap index are incompatible with the path-walk API.
>
> * The path walk API is only compatible with the --revs option, not
> taking object lists or pack lists over stdin. These alternative ways
> to specify the objects currently ignores the --path-walk option
> without even a warning.
>
> Future changes will create performance tests that demonstrate the power
> of this approach.
>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
> Documentation/git-pack-objects.adoc | 13 +-
> Documentation/technical/api-path-walk.adoc | 1 +
> builtin/pack-objects.c | 147 +++++++++++++++++++--
> t/t5300-pack-object.sh | 15 +++
> 4 files changed, 166 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/git-pack-objects.adoc b/Documentation/git-pack-objects.adoc
> index 7f69ae4855f..7dbbe6d54d2 100644
> --- a/Documentation/git-pack-objects.adoc
> +++ b/Documentation/git-pack-objects.adoc
> @@ -16,7 +16,7 @@ SYNOPSIS
> [--cruft] [--cruft-expiration=<time>]
> [--stdout [--filter=<filter-spec>] | <base-name>]
> [--shallow] [--keep-true-parents] [--[no-]sparse]
> - [--name-hash-version=<n>] < <object-list>
> + [--name-hash-version=<n>] [--path-walk] < <object-list>
>
>
> DESCRIPTION
> @@ -375,6 +375,17 @@ many different directories. At the moment, this version is not allowed
> when writing reachability bitmap files with `--write-bitmap-index` and it
> will be automatically changed to version `1`.
>
> +--path-walk::
> + By default, `git pack-objects` walks objects in an order that
> + presents trees and blobs in an order unrelated to the path they
> + appear relative to a commit's root tree. The `--path-walk` option
> + enables a different walking algorithm that organizes trees and
> + blobs by path. This has the potential to improve delta compression
> + especially in the presence of filenames that cause collisions in
> + Git's default name-hash algorithm. Due to changing how the objects
> + are walked, this option is not compatible with `--delta-islands`,
> + `--shallow`, or `--filter`.
I think from reading further below that this feature is somewhat
incompatible with --use-bitmap-index, at least in the sense that we
implicitly disable the latter whenever we see the former. Would that be
worth mentioning here?
> +static int add_objects_by_path(const char *path,
> + struct oid_array *oids,
> + enum object_type type,
> + void *data)
> +{
> + struct object_entry **delta_list;
> + size_t oe_start = to_pack.nr_objects;
> + size_t oe_end;
> + unsigned int sub_list_size;
> + unsigned int *processed = data;
> +
> + /*
> + * First, add all objects to the packing data, including the ones
> + * marked UNINTERESTING (translated to 'exclude') as they can be
> + * used as delta bases.
> + */
> + for (size_t i = 0; i < oids->nr; i++) {
> + int exclude;
> + struct object_info oi = OBJECT_INFO_INIT;
> + struct object_id *oid = &oids->oid[i];
> +
> + /* Skip objects that do not exist locally. */
> + if (exclude_promisor_objects &&
> + oid_object_info_extended(the_repository, oid, &oi,
> + OBJECT_INFO_FOR_PREFETCH) < 0)
> + continue;
> +
> + exclude = !is_oid_interesting(the_repository, oid);
> +
> + if (exclude && !thin)
> + continue;
> +
> + add_object_entry(oid, type, path, exclude);
> + }
> +
> + oe_end = to_pack.nr_objects;
> +
> + /* We can skip delta calculations if it is a no-op. */
> + if (oe_end == oe_start || !window)
> + return 0;
> +
> + sub_list_size = 0;
> + ALLOC_ARRAY(delta_list, oe_end - oe_start);
Makes sense, and seems all reasonable.
> + for (size_t i = 0; i < oe_end - oe_start; i++) {
> + struct object_entry *entry = to_pack.objects + oe_start + i;
> +
> + if (!should_attempt_deltas(entry))
> + continue;
> +
> + delta_list[sub_list_size++] = entry;
> + }
> +
> + /*
> + * Find delta bases among this list of objects that all match the same
> + * path. This causes the delta compression to be interleaved in the
> + * object walk, which can lead to confusing progress indicators. This is
> + * also incompatible with threaded delta calculations. In the future,
> + * consider creating a list of regions in the full to_pack.objects array
> + * that could be picked up by the threaded delta computation.
> + */
> + if (sub_list_size && window) {
> + QSORT(delta_list, sub_list_size, type_size_sort);
> + find_deltas(delta_list, &sub_list_size, window, depth, processed);
> + }
Interesting. I like the "regions in to_pack.objects" idea as a way of
threading the delta selection process in the future.
Thanks,
Taylor
@@ -10,13 +10,13 @@ SYNOPSIS | |||
-------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
On Mon, Mar 10, 2025 at 01:50:45AM +0000, Derrick Stolee via GitGitGadget wrote:
> ---
> Documentation/git-pack-objects.adoc | 14 +++++++-------
> builtin/pack-objects.c | 10 ++++++++--
> t/t0450/adoc-help-mismatches | 1 -
> 3 files changed, 15 insertions(+), 10 deletions(-)
Thanks for cleaning these up.
Thanks,
Taylor
return 0; | ||
} | ||
|
||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
On Mon, Mar 10, 2025 at 01:50:52AM +0000, Derrick Stolee via GitGitGadget wrote:
> ---
> builtin/pack-objects.c | 82 +++++++++++++++++++++++++-----------
> pack-objects.h | 12 ++++++
> t/t5300-pack-object.sh | 8 +++-
> t/t5530-upload-pack-error.sh | 6 ---
> 4 files changed, 75 insertions(+), 33 deletions(-)
Ah, nice :-). This is where you implement the idea that you were
mentioning back in
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index c756ce50dd7..c5a3129c88e 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3233,6 +3233,51 @@ static int should_attempt_deltas(struct object_entry *entry)
> return 1;
> }
>
> +static void find_deltas_for_region(struct object_entry *list UNUSED,
Interesting, it looks like "list" here is UNUSED in this patch. On first
read I figured that you were going to make use of it in later patches,
but it looks like it remains UNUSED at the tip of my local copy of this
series.
Is this a stray change from when you were writing this? Something else?
> + struct packing_region *region,
> + unsigned int *processed)
> +{
> + struct object_entry **delta_list;
> + uint32_t delta_list_nr = 0;
I know that we have a lot of "_nr" and "_alloc" variables in the
pack-objects code that use uint32_t as their type, but we should prefer
size_t for these in the future, even if it breaks the existing pattern.
As much as I can grok of the implementation through the rest of the
patch makes sense to me and looks good.
Thanks,
Taylor
This patch series was integrated into seen via git@4d04e2a. |
There was a status update in the "New Topics" section about the branch "git pack-objects" learns to find delta bases from blobs at the same path, using the --path-walk API. source: <[email protected]> |
This patch series was integrated into seen via git@90b5c8b. |
This patch series was integrated into seen via git@232845f. |
There was a status update in the "Cooking" section about the branch "git pack-objects" learns to find delta bases from blobs at the same path, using the --path-walk API. source: <[email protected]> |
This patch series was integrated into seen via git@8f49f1a. |
This patch series was integrated into seen via git@5bf988c. |
This patch series was integrated into seen via git@4144693. |
This patch series was integrated into seen via git@1326eb2. |
This patch series was integrated into seen via git@f8374ed. |
This patch series was integrated into seen via git@70c837f. |
There was a status update in the "Cooking" section about the branch "git pack-objects" learns to find delta bases from blobs at the same path, using the --path-walk API. source: <[email protected]> |
Here is a full submission of the --path-walk feature for 'git pack-objects' and 'git repack'. It's been discussed in an RFC [1], as a future application for the path walk API [2], and is updated now that --name-hash-version=2 exists (as a replacement for the --full-name-hash option from the RFC) [3].
[1] https://lore.kernel.org/git/[email protected]/
[2] https://lore.kernel.org/git/[email protected]
[3] https://lore.kernel.org/git/[email protected]
This patch series does the following:
Add a new '--path-walk' option to 'git pack-objects' that uses the path-walk API instead of the revision API to collect objects for delta compression.
Add a new '--path-walk' option to 'git repack' to pass this option along to 'git pack-objects'.
Add a new 'pack.usePathWalk' config option to opt into this option implicitly, such as in 'git push'.
Optimize the '--path-walk' option using threading so it better competes with the existing multi-threaded delta compression mechanism.
Update the path-walk API with a new 'edge_aggressive' option that pairs close to the --edge-aggressive option in the revision API. This is useful when creating thin packs inside shallow clones.
This feature works by using the path-walk API to emit groups of objects that appear at the same path. These groups are tracked so they can be tested for delta compression with each other, and then after those groups are tested a second pass using the name-hash attempts to find better (or first time) deltas across path boundaries. This second pass is much faster than a fresh pass since the existing deltas are used as a limit for the size of potentially new deltas, short-circuiting the checks when the delta size exceeds the current-best.
The benefits of the --path-walk feature first come into play when the name hash functions have many collisions, so sorting by name hash value leads to unhelpful groupings of objects. Many of these benefits are improved by --name-hash-version=2, but collisions still exist with any hash-based approach. There are also performance benefits in some cases due to the isolation of delta compression testing within path groups.
All of the benefits of the --path-walk feature are less dramatic when compared to --name-hash-version=2, but they can still exist in many cases. I have also seen some cases where --name-hash-version=2 compresses better than --path-walk with --name-hash-version=1, but these options can be combined to get the best of both worlds.
Detailed statistics are provided within patch messages, but a few are highlighted here:
The microsoft/fluentui is a public Javascript repo that suffers from many of the name hash collisions as internal repositories I've worked with. Here is a comparison of the compressed size and end-to-end time of the repack:
Less dramatic, but perhaps more standardly structured is the nodejs/node repository, with these stats:
Even the Linux kernel repository gains some benefits, even though the number of hash collisions is relatively low due to a preference for short filenames:
The drawbacks of the --path-walk feature is that it will be harder to integrate it with bitmap features, specifically delta islands. This is not insurmountable, but would require more work, such as a revision walk to paint objects with reachability information before using that during delta computations.
However, there should still be significant benefits to Git clients trying to save space and improve local performance.
This feature was shipped with similar features in microsoft/git as of v2.47.0.vfs.0.3 [4]. This was used in CI machines for an internal monorepo that had significant repository growth due to constructing a batch of beachball [5] CHANGELOG.[md|json] files and pushing them to a release branch. These pushes were frequently 70-200 MB due to poor delta compression. Using the 'pack.usePathWalk=true' config, these pushes dropped in size by 100x while improving performance. Since these CI machines were working with a shallow clone, the 'edge_aggressive' changes were required to enable the path-walk option.
[4] https://github.com/microsoft/git/releases/tag/v2.47.0.vfs.0.3
[5] https://github.com/microsoft/beachball
This version incorporates feedback from previous RFCs and reviewed patch series whenever possible. It also benefits from learned experience, much of which was already applied in the original path-walk API submission.
Thanks,
-Stolee
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]