Using the checkout action

Checking out the project repository is one of the most common operations in CI, as almost all CI jobs need access to the source code. For a large project with a long history such as Pytorch, this operation can take a significant amount of resources in the form of computing time and transfer volumes. Our fork of the checkout action adds a few features to help reduce that resource consumption. Here, we explain how to use it, how it works, and how to update the fork when the upstream action has a new release.

Quick Start

The fork may be used either directly as a replacement for the upstream checkout action or via a reusable workflow. We first describe the replacement of the upstream action and then list Pytorch reusable workflows that make use of the fork to expose new capabilities.

Replacement of actions/checkout

To benefit from our improvements, we need to take two steps. First, we need to use our fork instead of the upstream action; second, we need to activate the new features as appropriate.

Using the fork

To use the fork instead of the upstream action, replace the line

uses: actions/checkout@v4

with

uses: pytorch/test-infra/.github/actions/checkout@main

or, if the workflow uses an explicit checkout of the test-infra repository as many of the reusable workflows in the test-infra repository do,

uses: ./test-infra/.github/actions/checkout

Activating advanced features

To make full use of the features, we need to add filters that apply both to the main repository and its submodules, as well as activating the single branch feature. These are explained in more detail below in the How it works section. To activate them, add the following to the parameters in the with: block that follows the uses: line we replaced above:

with:
    single-branch: true
    filter: 'tree:0'
    submodules-filter: 'tree:0'

Example

A complete example for a step in a workflow is

    - name: Checkout PyTorch
      continue-on-error: true
      uses: pytorch/test-infra/.github/actions/checkout@main
      with:
        ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
        fetch-depth: 0
        single-branch: true
        submodules: recursive
        show-progress: false
        filter: 'tree:0'
        submodules-filter: 'tree:0'

Use in reusable workflows

For some of our reusable workflows, we offer a simplified model to benefit from these checkout techniques. The supporting workflows and actions take a new option checkout-mode, which can be one of normal (behavior as upstream), blobless (use filter blob:none for both the main repository and submodules), or treeless (use filter tree:0 for both).

To get the best benefit, use checkout-mode: treeless in their with: block.

Example

- name: Checkout PyTorch
  uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
  with:
    no-sudo: true
    checkout-mode: treeless

How it works

To speed-up the checkout action, we exploit internals of git to avoid downloading data that is not needed. In this section, we explain how exactly that works. We also detail the available options and how to use them.

Git background

Tip

The information in this section stems from the excellent series on Git's database internals by Derrick Stolee (Part I: Packed Object Store, II: Commit History Queries, III: File History Queries, IV: Distributed Synchronization, V: Scalability). If you want to dive deeper into the understanding of Git, it is recommended reading.

How Git stores data

Git deals with exactly three kinds of objects: commits, trees, and blobs. Blobs correspond to file contents, trees correspond to directories in a filesystem and just like directories contain files and other directories, trees contain blobs and other trees. Commits contain the root tree reflecting the state of the working directory as well as some metadata, such as author, committer, time of commit, and parent commits. By following the chain of commits along their parent relationships, we get the commit graph that forms the history of a project. It is worth noting that every commit contains the full contents of its working directory, not only the delta to its parent's state. However, internally Git avoids duplicate data storage for identical objects and uses clever compression to achieve a relatively small data footprint while maintaining high speed.

What happens on fetch

On a regular fetch, Git will download all commits, all trees, and all blobs from all branches from the remote repository. This has the benefit of making all further operations completely local, which makes it a great default for developers that may be without access to the remote server intermittendly and that need access to not only the latest state of their project, but also its history. However, it is less ideal for CI jobs that are carried out frequently and that do not need access to the full history.

Git has various facilities to reduce the amount of data downloaded in a fetch. One may restrict the download to a single branch, in which case commits that are not part of that branch are ignored. Another frequently used option is to do a shallow fetch, which limits the dowloaded information to only one commit or a fixed number of consecutive commits, for example the 10 most recent commits; this option usually implies single branch as well. A third way is to put limits on the kind of objects that are downloaded using filters. The full specification for filters is quite powerful, but we are interested particularly in blob:none and tree:0.

We call a clone performed with blob:none blobless. It avoids downloading all file contents, while still getting all commits and trees, allowing for operations like merge-base, as well as tests on the presence of given paths.

We call a clone performed with tree:0 treeless. In addition to avoiding all file contents, it also does not download any trees, i.e. no path information is available, but the history formed by all commits is part of the download, still allowing for merge-base, but not for tests on paths.

Important

In all of these cases, Git will download any data needed for future checkout operations. In other words, if one does a checkout of a different branch, Git will download the corresponding information from that branch along with trees and blobs necessary to construct the working directory.

Warning

However, Git does not know about branches that have not been fetched. For that reason, branches that are needed, even as part of, for example, a merge-base operation, must be included in the fetch operation in single-branch mode.

How we use it

The upstream action already supports filters, but only for the main repository, not for submodules. It also supports single branch clones, but only for shallow clones, not when the entire history is included.

To leverage the Git facilities described above, we fork the upstream action and add three new options to it: submodules-filter allows setting filters for submodules; single-branch, if set to true, will do a single branch clone also for non-shallow clones; additional-fetch-refs allows us to add further refs, e.g. branches, to fetch in a single-brach clone. The additional-fetch-refs option defaults to refs/heads/main, the most common merge target for pull request, but accepts an arbitrary number of additional refs to fetch.

These options are direct additions to the upstream options. For reusable workflows, we simplify the use by introduction of a new option, checkout-mode, which can be one of normal (behavior as upstream), blobless (use filter blob:none for both the main repository and submodules), or treeless (use filter tree:0 for both).

How to update to a new upstream version

The current version is based on version 6.0.1 of actions/checkout. To update to a new upstream release, we want to start from an updated copy of the upstream sources and add our changes on top, adapting them individually as needed.

In practice that means the following.

Steps to update

We will walk through the steps using the commandline. Of course it is possible to do this with other tools.

Our starting point is a current checkout of the test-infra repo including the main branch, which will be the merge target, and the update-checkout-action branch, which contains all our changes as well as the old base version from upstream.

Our goal here is to create a new branch with the new base version from upstream and our changes rebased on top of that leaving in place an updated version of the update-checkout-action branch for future updates as well as a dedicated branch for the current update that can be merged via a regular pull request. As this branch is temporary, its name is somewhat arbitrary, but here we choose update-checkout-action-<version>, where <version> is replaced by the version to which we are updating.

1. Make a new branch

First we choose the version that we want to update to and create the new update branch from origin/main.

UPSTREAM_VERSION=v6.0.2
UPDATE_BRANCH="update-checkout-action-${UPSTREAM_VERSION}"
git switch -c ${UPDATE_BRANCH} origin/main

2. Replace current action sources with new upstream version

Next, we create the first commit in this branch with the new sources from upstream, completely replacing our current code.

cd .github/actions
rm -rf ./checkout
mkdir checkout
cd checkout
curl -L 'https://github.com/actions/checkout/archive/${UPSTREAM_VERSION}.tar.gz' | tar -xz --strip-components=1
git add .
UPSTREAM_COMMIT=$(curl -sS -L -H "Accept: application/vnd.github+json" -H "X-GitHub-Api-Version: 2022-11-28" "https://api.github.com/repos/actions/checkout/git/matching-refs/tags/${UPSTREAM_VERSION}" |jq -r '.[0].object.sha')
git commit -m "Add ${UPSTREAM_VERSION} (https://github.com/actions/checkout/commit/${UPSTREAM_COMMIT}) of upstream actions/checkout action

3. Add our changes

Then, we add our changes by rebasing the update-checkout-action branch on top of the new base version.

Important

This is the most critical step as it requires careful adaptation of our changes to the changed upstream sources. Outright conflicts will lead to an interruption in the rebase operation and give us a chance to make the most important changes at the right time. However, make sure to check the final result of this operation commit-by-commit regardless of conflicts to ensure proper porting.

Note

You may also want to consult the upstream changelog and upstream version diff to identify likely points that need adaptation.

git checkout update-checkout-action
MERGE_BASE=$(git merge-base update-checkout-action ${UPDATE_BRANCH})
BASE_COMMIT=$(git log --format="format:%H" $MERGE_BASE..update-checkout-action |tail -n 1)
git rebase --onto ${UPDATE_BRANCH} ${BASE_COMMIT} update-checkout-action

4. Update the update branch testing

Now we switch back to the dedicated upstream branch, bring it to the updated state and perform compiling and testing. If everything works as expected, we can push our efforts.

git checkout ${UPDATE_BRANCH}
git rebase update-checkout-action
npm run build
npm run test

At this point the tests should all pass. If this is the case, we can proceed; otherwise we need to fix any problems and update both branches accordingly.

5. Push update

With all tests passing, we can add the final commit to the update branch and then push both branches. The branch that will be merged goes to our own fork, whereas the updated update-checkout-action branch goes into test-infra.

git add dist README.md
git commit -m "Transpile"
git push -u myfork ${UPDATE_BRANCH}
git push origin update-checkout-action

6. Open the PR

We open the PR in the test-infra repo.

Note

Make sure to open a PR for the ${UPDATE_BRANCH}, but not for the update-checkout-action branch, which should remain for future updates.

Once the PR is merged, our update is done.

Update guidance per commit

To help with the adaption of our changes, here follows a brief description of all the commits that make up the changes. They can be consulted to complete step 3 above.

Cleanup fork

This removes parts of the upstream sources that are not applicable, such as their contributions guide, and adapts the remainder of the code to the different directory layout dictated by the integration of several actions and other parts in the single test-infra repository.

Add submodules filter

This commit adds the submodules-filter option, allowing for filtering also in submodules.

Add single-branch fetch option

This commit adds the single-branch option, allowing for single-branch clones even in the case of non-shallow clones.

Add additional fetch refs to input settings

This commit adds the additional-fetch-refs option, which is essential when history operations with other branches than the checkout branch are needed.

Update README.md

This commit updates the README.md file with the information specific to this fork.

Transpile

This commit adds the changes from the previous commits to the dist/index.js file and updates this README.md file with the documentation for the new options. It should be created using the npm run build command in the checkout directory.

Important

This commit should only be present in the ${UPDATE_BRANCH} branch, but not in the update-checkout-action branch.

Using the checkout action

Quick Start

Replacement of actions/checkout

Using the fork

Activating advanced features

Example

Use in reusable workflows

Example

How it works

Git background

How Git stores data

What happens on fetch

How we use it

How to update to a new upstream version

Steps to update

1. Make a new branch

2. Replace current action sources with new upstream version

3. Add our changes

4. Update the update branch testing

5. Push update

6. Open the PR

Update guidance per commit

Cleanup fork

Add submodules filter

Add single-branch fetch option

Add additional fetch refs to input settings

Update README.md

Transpile

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally