Skip to content

Proposed repack optimisation #560

Open
@earlchew

Description

@earlchew

In #296 @jelmer wrote:

  • it currently caches all objects in memory while it repacks

This creates both an issue in space (the amount of memory used), and time (the amount of time take to complete).

I have been thinking about this in the context of the mirroring scenario that I am currently working on:

  1. Clone repository
  2. Fetch to update repository
  3. Periodically repack loose objects, and periodically repack
  4. Go to 2

The optimisation I'm thinking about is that the primary pack created when the repository is cloned contains the bulk of the data. The periodic updates contribute relatively little data, and is open to some kind of optimisation to make a relatively small change to the primary pack that would be more efficient than reading, parsing and writing every object in the repository.

I was reading the description in https://codewords.recurse.com/issues/three/unpacking-git-packfiles and it occurred to me that it might be possible use the following strategy to combine two pack files:

  1. Append the new pack to the old pack, rewriting the object count and checksum.
  2. Combine the new index file with the old index file, taking care to discard duplicate objects.

I think this will be relatively fast since this strategy does not require parsing of any of the objects (although it does require reading and writing the pack files). Memory requirements are reduced because only the index files must be actively combined.

I was originally thinking about this in terms of combining a relatively large older pack file with a relatively small newer pack file, but I think the strategy can be generalised to accommodate more than two pack files.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions