-
Notifications
You must be signed in to change notification settings - Fork 17
Describe caching of git repositories #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
# How to cache git repositories with a reference repo | ||
|
||
## The main idea | ||
|
||
Using `git clone --reference /repo/with/cache https://some/repo/to/clone`. | ||
|
||
What it does? | ||
|
||
- When cloning a repo, you can specify a repo, that will be searched first for commits to download. | ||
(If the commit is found in the reference repo, reference is used instead of saving the data. | ||
If not found, commit is downloaded and data saved.) | ||
|
||
--reference[-if-able] <repository> | ||
If the reference repository is on the local machine, automatically setup .git/objects/info/alternates to obtain objects from the reference repository. Using an already existing repository as an alternate will require | ||
fewer objects to be copied from the repository being cloned, reducing network and local storage costs. When using the --reference-if-able, a non existing directory is skipped with a warning instead of aborting the clone. | ||
|
||
- The state of the reference repo is not important. | ||
(We don't need to care about force-pushes, merging, local changes,...) | ||
What is fetched counts. | ||
- We can fetch commits of multiple repositories in the reference repository. | ||
- Alternatively, we can use more cache repositories: | ||
`git clone --reference-if-able /cache/some-repo-to-clone https://some/repo/to/clone` | ||
|
||
## Pros&Cons | ||
|
||
- Cloning is much faster for repositories in the cache. | ||
- Cloning is slower for repositories not present in the cache. | ||
- Less memory is needed to clone repositories in the cache. | ||
(Which makes it possible to clone kernel for example.) | ||
- More memory is needed to clone repositories not present in the cache. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it depends what would be the workflow to populate the cache but we should do it outside sandcastle/worker |
||
- e.g. 1000m needed to clone ogr using cache repo with kernel-ark and systemd (800m has not been enough) | ||
- Less storage is needed for the cloned repo if it is in the cache. | ||
(Only the current state of the repo is saved, historical commits reference the cache repo.) | ||
Comment on lines
+32
to
+33
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. storage is so much cheaper in comparison to memory |
||
|
||
## Where to store this cache repository? | ||
|
||
- The cache does not need to be writable for cloning. | ||
Only for creating/updating. | ||
- Persistent volumes can be used. | ||
- How much storage we can afford? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. our current cost in online is 25€ for 1G of mem and 1€ for 1G of storage - so we can easily start with 16G of storage |
||
- Storage is cheap, but git repositories can be really big after some time. | ||
- I haven't tested efficiency of the scenario with a lot of repositories in one cache repository. | ||
- Data itself can be shared between stage/prod if we want. (Probably not wanted and hard to do in openshift.) | ||
|
||
## How to create the cache repository? | ||
|
||
- Manually on request. Mount the volume once with more memory and fetch the needed repository. | ||
- Manually on sentry issue. As previous but gather the problematic repos in sentry. | ||
- Start with kernel manually and add new ones on the go. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be delightful if we could create a workflow how to populate the cache (e.g. regen it weekly), have metadata which repositories are in the cache and when should they be "attached" so they are being used transparently. |
||
|
||
## What repositories we want there? | ||
|
||
- Just kernel. | ||
- A group of hardcoded/configured repositories. | ||
- All repositories matching some condition (at least some commits, some size, ...) | ||
- All repositories. (Add if not present.) | ||
Comment on lines
+53
to
+56
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. using kernel for a PoC would be very nice, we could prototype this in the SIG after it's proven, we could introduce this in the upstream to projects with at least N runs a week? |
||
|
||
## When we want to use this mechanism? | ||
|
||
- Everytime. It is possible, but it can be time/memory consuming to use it for new repositories. | ||
- Only on the repos matching the origin. (Will not work for forks/renames.) | ||
We can cache the list to not need to read it multiple times. | ||
- Use the `--reference-if-able` and have one repo for each project. | ||
(Similarly to the previous one, forks/renames will not work.) | ||
|
||
## How to update the repositories? | ||
|
||
Updating of some bigger repositories can require more memory we used to have for workers. | ||
|
||
- Manually. (`git fetch --all`) | ||
- As a cron job. (daily x hourly x weekly) | ||
- As a celery task. (As a reaction on empty queue?) | ||
|
||
## How can this be implemented? | ||
|
||
Currently, cloning is done lazily in LocalProject. | ||
|
||
1. In case of the very basic version, we can make packit agnostic to this | ||
by being able to configure additional arguments for clone commands: | ||
- [packit] Enhance the schema of the user config. | ||
- [packit] Use this value when cloning (LocalProject). | ||
- [deployment] Add persistent volume. | ||
- [deployment] Use `--reference /persistent/volume` in the service config. | ||
2. Alternatively, we can have the cache directory configurable (optionally). | ||
3. If we want more clever behaviour, we need to put the cloning logic to packit. | ||
4. Or, we can forward some method for handling the cloning. | ||
(Defined in the service repo, run in the packit.) | ||
|
||
## Is this relevant for the CLI users? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm sorry but I don't see any value here since people already have those projects cloned. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that it does not make sense to spend much time on it, but (as I wrote below, in this part),
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I really liked how you explained it during the arch meeting, it indeed makes perfect sense for those temporary clones, +1 |
||
|
||
Can help to avoid long cloning during some commands: | ||
|
||
- using temporary distgit repo (default) | ||
- using URL instead of the local git repository as an input | ||
|
||
To give it some value, we need to: | ||
|
||
- add new repositories automatically | ||
- provide a way to update the cache (automatically/manually) | ||
|
||
It looks like it can be useful if the implementation can be shared, | ||
but it does not make sense to spend a lot of time on CLI-only code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can have a list of repos where we know they are being cached: cockpit, systemd, kernel sound like great candidates, especially when we can reuse the cached copy in upstream and c9s