You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.
This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.
Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)
Test HEAD~1 HEAD
-----------------------------------------------------------------
5313.2: thin pack 0.00 0.00 =
5313.3: thin pack size 589 589 +0.0%
5313.4: thin pack with --path-walk 0.00 0.00 =
5313.5: thin pack size with --path-walk 589 589 +0.0%
5313.6: big pack 2.84 2.80 -1.4%
5313.7: big pack size 14.0M 14.1M +0.3%
5313.8: big pack with --path-walk 5.46 3.77 -31.0%
5313.9: big pack size with --path-walk 13.2M 13.2M -0.0%
5313.10: repack 22.11 21.50 -2.8%
5313.11: repack size 126.4M 126.2M -0.2%
5313.12: repack with --path-walk 66.89 26.41 -60.5%
5313.13: repack size with --path-walk 109.6M 109.6M +0.0%
This 60% reduction in 'git repack --path-walk' time is typical across
all repos I used for testing. What is interesting is to compare when the
overall time improves enough to outperform the standard case. These time
improvements correlate with repositories with data shapes that
significantly improve their data size as well.
For example, the microsoft/fluentui repo has a 439M to 122M size
reduction, and the repack time is now 36.6 seconds with --path-walk
compared to 95+ seconds without it:
Test HEAD~! HEAD
-----------------------------------------------------------------
5313.2: thin pack 0.41 0.42 +2.4%
5313.3: thin pack size 1.2M 1.2M +0.0%
5313.4: thin pack with --path-walk 0.08 0.05 -37.5%
5313.5: thin pack size with --path-walk 18.4K 18.4K +0.0%
5313.6: big pack 4.47 4.53 +1.3%
5313.7: big pack size 19.6M 19.7M +0.3%
5313.8: big pack with --path-walk 6.76 3.51 -48.1%
5313.9: big pack size with --path-walk 16.5M 16.4M -0.2%
5313.10: repack 96.87 99.05 +2.3%
5313.11: repack size 439.5M 439.0M -0.1%
5313.12: repack with --path-walk 95.68 36.55 -61.8%
5313.13: repack size with --path-walk 122.6M 122.6M +0.0%
In a more extreme example, an internal repository that has a similar
name-hash collision issue to microsoft/fluentui reduces its size from
6.4G to 805M with the --path-walk option. This also reduces the
repacking time from 2,138 seconds to 478 seconds.
Test HEAD~1 HEAD
------------------------------------------------------------------
5313.10: repack 2138.22 2138.19 -0.0%
5313.11: repack size 6.4G 6.4G -0.0%
5313.12: repack with --path-walk 1351.46 477.91 -64.6%
5313.13: repack size with --path-walk 804.1M 804.1M -0.0%
Finally, the Linux kernel repository is a good test for this repacking
time change, even though the space savings is more reasonable:
Test HEAD~1 HEAD
----------------------------------------------------------------
5313.10: repack 734.26 735.11 +0.1%
5313.11: repack size 2.5G 2.5G -0.0%
5313.12: repack with --path-walk 1457.23 598.17 -59.0%
5313.13: repack size with --path-walk 2.2G 2.2G +0.0%
Signed-off-by: Derrick Stolee <[email protected]>
0 commit comments