-
Notifications
You must be signed in to change notification settings - Fork 111
Using the checkout action
Checking out the project repository is one of the most common operations in CI, as almost all CI jobs need access to the source code. For a large project with a long history such as Pytorch, this operation can take a significant amount of resources in the form of computing time and transfer volumes. Our fork of the checkout action adds a few features to help reduce that resource consumption. Here, we explain how to use it, how it works, and how to update the fork when the upstream action has a new release.
The fork may be used either directly as a replacement for the upstream checkout action or via a reusable workflow. We first describe the replacement of the upstream action and then list Pytorch reusable workflows that make use of the fork to expose new capabilities.
To benefit from our improvements, we need to take two steps. First, we need to use our fork instead of the upstream action; second, we need to activate the new features as appropriate.
To use the fork instead of the upstream action, replace the line
uses: actions/checkout@v4
with
uses: pytorch/test-infra/.github/actions/checkout@main
or, if the workflow uses an explicit checkout of the test-infra repository as
many of the reusable workflows in the test-infra repository do,
uses: ./test-infra/.github/actions/checkout
To make full use of the features, we need to add filters that apply both to the
main repository and its submodules, as well as activating the single branch
feature. These are explained in more detail below in the How it works section.
To activate them, add the following to the parameters in the with: block that
follows the uses: line we replaced above:
with:
single-branch: true
filter: 'tree:0'
submodules-filter: 'tree:0'
A complete example for a step in a workflow is
- name: Checkout PyTorch
continue-on-error: true
uses: pytorch/test-infra/.github/actions/checkout@main
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
fetch-depth: 0
single-branch: true
submodules: recursive
show-progress: false
filter: 'tree:0'
submodules-filter: 'tree:0'
For some of our reusable workflows, we offer a simplified model to benefit from
these checkout techniques.
The supporting workflows and actions take a new option checkout-mode,
which can be one of normal (behavior as upstream), blobless (use filter
blob:none for both the main repository and submodules), or treeless (use
filter tree:0 for both).
To get the best benefit, use checkout-mode: treeless in their with: block.
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
no-sudo: true
checkout-mode: treeless
To speed-up the checkout action, we exploit internals of git to avoid downloading data that is not needed. In this section, we explain how exactly that works. We also detail the available options and how to use them.
Tip
The information in this section stems from the excellent series on Git's database internals by Derrick Stolee (Part I: Packed Object Store, II: Commit History Queries, III: File History Queries, IV: Distributed Synchronization, V: Scalability). If you want to dive deeper into the understanding of Git, it is recommended reading.
Git deals with exactly three kinds of objects: commits, trees, and blobs. Blobs correspond to file contents, trees correspond to directories in a filesystem and just like directories contain files and other directories, trees contain blobs and other trees. Commits contain the root tree reflecting the state of the working directory as well as some metadata, such as author, committer, time of commit, and parent commits. By following the chain of commits along their parent relationships, we get the commit graph that forms the history of a project. It is worth noting that every commit contains the full contents of its working directory, not only the delta to its parent's state. However, internally Git avoids duplicate data storage for identical objects and uses clever compression to achieve a relatively small data footprint while maintaining high speed.
On a regular fetch, Git will download all commits, all trees, and all blobs from all branches from the remote repository. This has the benefit of making all further operations completely local, which makes it a great default for developers that may be without access to the remote server intermittendly and that need access to not only the latest state of their project, but also its history. However, it is less ideal for CI jobs that are carried out frequently and that do not need access to the full history.
Git has various facilities to reduce the amount of data downloaded in a fetch.
One may restrict the download to a single branch, in which case commits that are
not part of that branch are ignored.
Another frequently used option is to do a shallow fetch, which limits the
dowloaded information to only one commit or a fixed number of consecutive
commits, for example the 10 most recent commits; this option usually implies
single branch as well.
A third way is to put limits on the kind of objects that are downloaded using
filters.
The full specification for filters is quite powerful, but we are interested particularly
in blob:none and tree:0.
We call a clone performed with blob:none blobless.
It avoids downloading all file contents, while still getting all commits and
trees, allowing for operations like merge-base, as well as tests on the
presence of given paths.
We call a clone performed with tree:0 treeless.
In addition to avoiding all file contents, it also does not download any trees,
i.e. no path information is available, but the history formed by all commits
is part of the download, still allowing for merge-base, but not for tests on
paths.
Important
In all of these cases, Git will download any data needed for future checkout operations. In other words, if one does a checkout of a different branch, Git will download the corresponding information from that branch along with trees and blobs necessary to construct the working directory.
Warning
However, Git does not know about branches that have not been fetched. For that reason, branches that are needed, even as part of, for example, a merge-base operation, must be included in the fetch operation in single-branch mode.
The upstream action already supports filters, but only for the main repository, not for submodules. It also supports single branch clones, but only for shallow clones, not when the entire history is included.
To leverage the Git facilities described above, we fork the upstream action and
add three new options to it: submodules-filter allows setting filters for
submodules; single-branch, if set to true, will do a single branch clone also
for non-shallow clones; additional-fetch-refs allows us to add further refs,
e.g. branches, to fetch in a single-brach clone.
The additional-fetch-refs option defaults to refs/heads/main, the most
common merge target for pull request, but accepts an arbitrary number of
additional refs to fetch.
These options are direct additions to the upstream options.
For reusable workflows, we simplify the use by introduction of a new option,
checkout-mode, which can be one of normal (behavior as upstream),
blobless (use filter blob:none for both the main repository and submodules),
or treeless (use filter tree:0 for both).
The current version is based on version 6.0.1 of actions/checkout.
To update to a new upstream release, we want to start from an updated copy of
the upstream sources and add our changes on top, adapting them individually as
needed.
In practice that means the following.
We will walk through the steps using the commandline. Of course it is possible to do this with other tools.
Our starting point is a current checkout of the test-infra repo including the
main branch, which will be the merge target, and the update-checkout-action
branch, which contains all our changes as well as the old base version from
upstream.
Our goal here is to create a new branch with the new base version from upstream
and our changes rebased on top of that leaving in place an updated version of
the update-checkout-action branch for future updates as well as a dedicated
branch for the current update that can be merged via a regular pull request.
As this branch is temporary, its name is somewhat arbitrary, but here we choose
update-checkout-action-<version>, where <version> is replaced by the version
to which we are updating.
First we choose the version that we want to update to and create the new update
branch from origin/main.
UPSTREAM_VERSION=v6.0.2
UPDATE_BRANCH="update-checkout-action-${UPSTREAM_VERSION}"
git switch -c ${UPDATE_BRANCH} origin/mainNext, we create the first commit in this branch with the new sources from upstream, completely replacing our current code.
cd .github/actions
rm -rf ./checkout
mkdir checkout
cd checkout
curl -L 'https://github.com/actions/checkout/archive/${UPSTREAM_VERSION}.tar.gz' | tar -xz --strip-components=1
git add .
UPSTREAM_COMMIT=$(curl -sS -L -H "Accept: application/vnd.github+json" -H "X-GitHub-Api-Version: 2022-11-28" "https://api.github.com/repos/actions/checkout/git/matching-refs/tags/${UPSTREAM_VERSION}" |jq -r '.[0].object.sha')
git commit -m "Add ${UPSTREAM_VERSION} (https://github.com/actions/checkout/commit/${UPSTREAM_COMMIT}) of upstream actions/checkout actionThen, we add our changes by rebasing the update-checkout-action branch on top
of the new base version.
Important
This is the most critical step as it requires careful adaptation of our changes to the changed upstream sources. Outright conflicts will lead to an interruption in the rebase operation and give us a chance to make the most important changes at the right time. However, make sure to check the final result of this operation commit-by-commit regardless of conflicts to ensure proper porting.
Note
You may also want to consult the upstream changelog and upstream version diff to identify likely points that need adaptation.
git checkout update-checkout-action
MERGE_BASE=$(git merge-base update-checkout-action ${UPDATE_BRANCH})
BASE_COMMIT=$(git log --format="format:%H" $MERGE_BASE..update-checkout-action |tail -n 1)
git rebase --onto ${UPDATE_BRANCH} ${BASE_COMMIT} update-checkout-actionNow we switch back to the dedicated upstream branch, bring it to the updated state and perform compiling and testing. If everything works as expected, we can push our efforts.
git checkout ${UPDATE_BRANCH}
git rebase update-checkout-action
npm run build
npm run testAt this point the tests should all pass. If this is the case, we can proceed; otherwise we need to fix any problems and update both branches accordingly.
With all tests passing, we can add the final commit to the update branch and
then push both branches.
The branch that will be merged goes to our own fork, whereas the updated
update-checkout-action branch goes into test-infra.
git add dist README.md
git commit -m "Transpile"
git push -u myfork ${UPDATE_BRANCH}
git push origin update-checkout-actionWe open the PR in the test-infra repo.
Note
Make sure to open a PR for the ${UPDATE_BRANCH}, but not for the
update-checkout-action branch, which should remain for future updates.
Once the PR is merged, our update is done.
To help with the adaption of our changes, here follows a brief description of all the commits that make up the changes. They can be consulted to complete step 3 above.
This removes parts of the upstream sources that are not applicable, such as their contributions guide, and adapts the remainder of the code to the different directory layout dictated by the integration of several actions and other parts in the single test-infra repository.
This commit adds the submodules-filter option, allowing for filtering also in submodules.
This commit adds the single-branch option, allowing for single-branch clones even in the case of non-shallow clones.
This commit adds the additional-fetch-refs option, which is essential when
history operations with other branches than the checkout branch are needed.
This commit updates the README.md file with the information specific to this fork.
This commit adds the changes from the previous commits to the dist/index.js file and updates this README.md file with the documentation for the new options.
It should be created using the npm run build command in the checkout directory.
Important
This commit should only be present in the ${UPDATE_BRANCH} branch, but not
in the update-checkout-action branch.