Proposed repack optimisation

In https://github.com/jelmer/dulwich/issues/296 @jelmer wrote:

* it currently caches all objects in memory while it repacks

This creates both an issue in space (the amount of memory used), and time (the amount of time take to complete).

I have been thinking about this in the context of the mirroring scenario that I am currently working on:
1. Clone repository
2. Fetch to update repository
3. Periodically repack loose objects, and periodically repack
4. Go to 2

The optimisation I'm thinking about is that the primary pack created when the repository is cloned contains the bulk of the data. The periodic updates contribute relatively little data, and is open to some kind of optimisation to make a relatively small change to the primary pack that would be more efficient than reading, parsing and writing every object in the repository.

I was reading the description in https://codewords.recurse.com/issues/three/unpacking-git-packfiles and it occurred to me that it might be possible use the following strategy to combine two pack files:
1. Append the new pack to the old pack, rewriting the object count and checksum.
2. Combine the new index file with the old index file, taking care to discard duplicate objects.

I think this will be relatively fast since this strategy does not require parsing of any of the objects (although it does require reading and writing the pack files). Memory requirements are reduced because only the index files must be actively combined.

I was originally thinking about this in terms of combining a relatively large older pack file with a relatively small newer pack file, but I think the strategy can be generalised to accommodate more than two pack files.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Proposed repack optimisation #560

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Proposed repack optimisation #560

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions