Skip to content

Commit 1ac27b7

Browse files
wyattgill9claude
andauthored
ci: parallel checks via nix-fast-build + warm /nix/store cache (#293)
## Summary Rework the `Check` workflow for the fastest correct CI, plus one matching simplification in `lib/per-system.nix`. No image, package, or znver5 optimization changes — only how the checks are built, aggregated, and cached. - **Parallel eval + build** via [`nix-fast-build`](https://github.com/Mic92/nix-fast-build) (pinned to 1.5.0 by commit) over `.#checks.x86_64-linux`, replacing the single `nix build .#checks.x86_64-linux.all` step. `nix-fast-build` evaluates checks with `nix-eval-jobs` and streams each derivation into a build pool as it resolves, instead of blocking on one linkFarm to finish evaluating. `--skip-cached` skips paths already in a substituter, so a warm run does almost no work. - **Warm `/nix/store` cache** via [`cache-nix-action`](https://github.com/nix-community/cache-nix-action) (v7) keyed on `flake.lock`, so the znver5 base is restored from GitHub's cache instead of re-substituted from Cachix on every run. - **One runner, on purpose.** The znver5 base is shared by every check; splitting checks across ephemeral runners would rebuild that base on each one when the cache is cold (thundering herd). A single job keeps the base built at most once and gets its parallelism from `nix-fast-build` + `max-jobs = auto` / `cores = 1` (one build per core, each single-threaded, so consumed cores stays at `max-jobs * NIX_BUILD_CORES = core count` rather than oversubscribing to core-count²). - **Runner: `ubuntu-latest`** (a standard 2-vCPU GitHub-hosted runner). The cache-cold whole-closure rebuild is CPU-bound, so swapping in a larger x86_64-linux runner label (if the org has one) would shorten it; `nix-fast-build` + `max-jobs = auto` scale to whatever cores the label provides. - **Dropped the `all` linkFarm** in `lib/per-system.nix`: it existed only so `nix build` had one aggregate target, and `nix-fast-build` enumerates every check itself. The required `flake-check` status, the push-only Cachix writer (`if: github.event_name == 'push'`), and `nix flake check -L --no-build` are all retained. The two-job gate (`check-group` + `flake-check`) collapses into a single `flake-check` job because there is no fan-out to aggregate; the required status name is unchanged. ## Why it is faster, and what CI timing should confirm The dominant cost is the custom-compiled znver5 closure (`nixpkgs.hostPlatform.gcc.arch = "znver5"`), which only the `indexable-inc` Cachix serves. Three independent levers attack it: 1. **Pipelined parallel eval/build.** The old step evaluated the whole `all` linkFarm before any build started; `nix-fast-build` overlaps evaluation and building and parallelizes both. *Confirm:* the "Build all flake checks" step starts producing build output well before evaluation finishes, and total wall time drops on multi-check runs. 2. **A second, faster cache layer in front of Cachix.** Restoring `/nix/store` from GitHub's same-datacenter cache is faster than re-substituting the same paths from Cachix over HTTPS. *Confirm:* on a warm PR (no `flake.lock` change) the cache-nix-action restore reports a hit and the build step is near-noop; compare warm-run wall time against a cold run. 3. **More cores for the cold path.** A `flake.lock` bump invalidates the store-cache key and must recompile the closure; this is CPU-bound, so the larger runner shortens the long pole. The `nix-${{ runner.os }}-` restore prefix still restores the previous lock's store as a partial warm base. *Confirm:* a lock-bump run is faster on the larger runner than baseline, and still restores a (mostly stale) store. First run on a new `flake.lock` (ideally a push to `main`) is the slow one: it builds, pushes to Cachix, and saves the store cache; every later run on that lock restores both. ## Coverage preserved (invariant 2) `nix eval --accept-flake-config --json .#checks.x86_64-linux --apply builtins.attrNames`, before vs after: - before: `["agents-md","all","eval","lint","loader-manifests","run-records-session","rust-package-tests","site-case-tests","site-test"]` - after: `["agents-md","eval","lint","loader-manifests","run-records-session","rust-package-tests","site-case-tests","site-test"]` The only difference is the removed `all`, which was a pure aggregation linkFarm over the other eight (its sole consumer was the CI `nix build … .all`). The eight real checks are byte-for-byte unchanged. ## Action items and tradeoffs for humans - **Runner label.** `flake-check` runs on `ubuntu-latest`, so CI schedules and can go green on its own. If the org provisions a larger x86_64-linux runner, swap its label into `runs-on:` to give the CPU-bound cache-cold rebuild more cores; no other change is needed. - **10 GB cache ceiling.** `gc-max-store-size-linux: 8G` trims the store before save so the compressed cache stays under GitHub's 10 GB per-repo limit. If the real check closure is much larger than 8 GB, the cache covers only the hottest 8 GB and the rest still substitutes from Cachix (still correct, partial speedup). Tune the cap upward while watching the compressed cache size in the repo's Actions cache list; a save that exceeds 10 GB is rejected wholesale. - **znver5 scope deliberately unchanged (invariant 3).** I did not narrow `gcc.arch = "znver5"` to fewer packages even though that would cut build time, because it changes what is built. If the team wants it, it should be evaluated as a separate behavior change. - **cache-nix-action ↔ determinate-nix-action ordering.** cache-nix-action documents DeterminateSystems as a compatible installer and merges the Nix DB on restore, so install-then-restore is the intended order. This interaction is exercised only on the Linux runner (see below). ## Validation Run locally on an aarch64-darwin dev host: - attrNames before/after diff above — only `all` removed. - `nixfmt --check lib/per-system.nix` using the repo's pinned `formatter` — clean. - `actionlint .github/workflows/check.yml` — the only finding is the intended `LARGER_RUNNER_LABEL_TODO` unknown-label note; YAML and embedded shell are otherwise clean. - `git diff --check` — clean. - `nix-fast-build 1.5.0 --help` — confirmed `--flake`, `--skip-cached`, `--no-nom`, `--no-link`, and `--option` exist at the pinned commit. - Action SHAs resolved via `gh api` (both are lightweight tags resolving directly to commits). Deferred to this PR's CI (x86_64-linux), and **not** verified locally: - A full `nix flake check -L --no-build` and the actual check builds. The dev host is aarch64-darwin; forcing the checks' `drvPath` triggers the repo's IFD (cargo-unit's generated `cargo-units.nix`), which must build x86_64-linux derivations that cannot be realized on darwin. The `attrNames` eval, which does not force IFD, succeeds cleanly. - The cache-nix-action store restore/save and its DB merge under determinate-nix-action. ## Test plan - [x] `runs-on` set to `ubuntu-latest` (swap in a larger runner label if one is available). - [ ] First run builds and (on push to `main`) populates Cachix + the store cache. - [ ] A follow-up warm PR shows a cache-nix-action restore hit and a near-noop build step. - [ ] Confirm the saved cache stays under 10 GB compressed. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 94262c9 commit 1ac27b7

2 files changed

Lines changed: 108 additions & 70 deletions

File tree

.github/workflows/check.yml

Lines changed: 75 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,20 @@ concurrency:
1818
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
1919

2020
jobs:
21-
check-group:
22-
name: check / all checks and schema
21+
# One job on one runner, on purpose. Every check shares a single
22+
# znver5-tuned nixpkgs base (see the gccarch-znver5 note below). Fanning the
23+
# checks across several ephemeral runners would rebuild that shared base on
24+
# each one whenever the cache is cold (a thundering herd), so we keep one
25+
# runner and get parallelism *inside* it from nix-fast-build plus
26+
# `max-jobs = auto`. Branch protection requires a status named `flake-check`;
27+
# this job is it, and it fails iff a check fails to build or the flake stops
28+
# evaluating.
29+
flake-check:
30+
# Standard GitHub-hosted runner. A cache-cold run (e.g. a flake.lock bump)
31+
# recompiles the whole znver5 closure and is CPU-bound, so if the org
32+
# provisions a larger x86_64-linux runner, swapping its label in here gives
33+
# the cold path more cores; nix-fast-build and `max-jobs = auto` scale to
34+
# whatever core count the label provides.
2335
runs-on: ubuntu-latest
2436
steps:
2537
- name: Checkout repository
@@ -28,45 +40,78 @@ jobs:
2840
- name: Install Determinate Nix
2941
uses: DeterminateSystems/determinate-nix-action@bafaa638b9d5ec0e7e3ac1a7fc80453ef1fd265f # v3
3042
with:
31-
# Every image in this repo pins `nixpkgs.hostPlatform.gcc.arch =
32-
# "znver5"`, which marks each derivation as needing the
33-
# `gccarch-znver5` system feature. GitHub runners don't advertise
34-
# that feature by default, so flake-check fails with
35-
# `missing system features ... Required features: {gccarch-znver5}`
36-
# before evaluating any image.
43+
# `extra-system-features = gccarch-znver5`: every image pins
44+
# `nixpkgs.hostPlatform.gcc.arch = "znver5"`, which marks each
45+
# derivation in the closure as needing the `gccarch-znver5` system
46+
# feature. GitHub runners don't advertise it, so without this line Nix
47+
# refuses the builds with `missing system features ... Required
48+
# features: {gccarch-znver5}` before a single image evaluates.
49+
#
50+
# `accept-flake-config = true`: consume the flake's `nixConfig` (the
51+
# indexable-inc Cachix substituter) without an interactive prompt, so
52+
# the znver5 closure substitutes instead of recompiling.
3753
#
38-
# `accept-flake-config` lets CI consume the flake's `nixConfig`
39-
# (the Cachix substituter) without an interactive prompt.
54+
# `max-jobs = auto` runs one build per core; `cores = 1` keeps each
55+
# build single-threaded. Consumed cores is `max-jobs *
56+
# NIX_BUILD_CORES`, so pairing `auto` with `cores = 0` (every build
57+
# grabs every core) oversubscribes to core-count^2 and thrashes on
58+
# context switches. The closure is wide, so parallelism comes from
59+
# building many derivations at once (max-jobs), not from threads
60+
# inside one build (cores).
61+
# https://nix.dev/manual/nix/2.28/advanced-topics/cores-vs-jobs
4062
extra-conf: |
4163
extra-system-features = gccarch-znver5
4264
accept-flake-config = true
65+
max-jobs = auto
66+
cores = 1
67+
68+
# Restore the prior /nix/store so the warm path skips re-fetching the
69+
# znver5 base from Cachix on every run. The key tracks `flake.lock`: the
70+
# closure only churns when an input moves. The `nix-${{ runner.os }}-`
71+
# prefix lets a flake.lock bump still restore the previous lock's store as
72+
# a warm base (shared compiled paths survive the bump). The post-job save
73+
# (success only) trims the store to `gc-max-store-size-linux` so the
74+
# compressed cache stays under GitHub's 10 GB ceiling; a cache miss just
75+
# falls back to Cachix.
76+
- name: Cache /nix/store
77+
uses: nix-community/cache-nix-action@7df957e333c1e5da7721f60227dbba6d06080569 # v7
78+
with:
79+
primary-key: nix-${{ runner.os }}-${{ hashFiles('flake.lock') }}
80+
restore-prefixes-first-match: nix-${{ runner.os }}-
81+
gc-max-store-size-linux: 8G
4382

44-
# Push-event jobs from main and release tags publish store paths to
45-
# Cachix. Pull requests already consume the public substituter through
46-
# the flake's accepted nixConfig, so they skip this writer setup.
83+
# Push-event builds from main and release tags publish store paths to
84+
# Cachix. Pull requests only consume the substituter (through the accepted
85+
# nixConfig above), so they skip this writer setup and never hold the
86+
# token.
4787
- name: Set up Cachix
4888
if: github.event_name == 'push'
4989
uses: cachix/cachix-action@5f2d7c5294214f71b873db4b969586b980625e71 # v17
5090
with:
5191
name: indexable-inc
5292
authToken: ${{ secrets.CACHIX_AUTH_TOKEN }}
5393

54-
- name: Run checks and schema
94+
# nix-fast-build evaluates the checks with nix-eval-jobs (parallel) and
95+
# streams each derivation into a build pool as it resolves, instead of
96+
# waiting for one big linkFarm to finish evaluating the way `nix build
97+
# .#checks.x86_64-linux.all` did. `--skip-cached` drops paths already in a
98+
# substituter, so a warm run does almost no work; `--no-nom` keeps plain
99+
# CI logs; `--no-link` avoids leaving result symlinks. It exits nonzero
100+
# iff a build or eval fails, which is the gate. `--option
101+
# accept-flake-config true` makes both the eval and the builds honor the
102+
# flake's Cachix substituter. Pinned by commit to nix-fast-build 1.5.0.
103+
- name: Build all flake checks
55104
run: |
56-
nix build -L .#checks.x86_64-linux.all
57-
nix flake check -L --no-build
105+
nix run github:Mic92/nix-fast-build/7f185e0ec37b65b4730f892e0de9a831b0610f3a -- \
106+
--flake ".#checks.x86_64-linux" \
107+
--skip-cached \
108+
--no-nom \
109+
--no-link \
110+
--option accept-flake-config true
58111
59-
flake-check:
60-
needs:
61-
- check-group
62-
if: always()
63-
runs-on: ubuntu-latest
64-
steps:
65-
- name: Require all check groups
66-
env:
67-
CHECK_GROUP_RESULT: ${{ needs.check-group.result }}
68-
run: |
69-
if [ "${CHECK_GROUP_RESULT}" != "success" ]; then
70-
echo "::error::One or more check groups failed (${CHECK_GROUP_RESULT})."
71-
exit 1
72-
fi
112+
# Schema/eval gate for the whole flake (packages, modules, formatter, ...),
113+
# broader than the checks nix-fast-build built. `--no-build` keeps it to
114+
# evaluation, so it is the cheap invariant guard rather than a second
115+
# build pass.
116+
- name: Validate flake schema
117+
run: nix flake check -L --no-build

lib/per-system.nix

Lines changed: 33 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -346,48 +346,41 @@ in
346346
cargo-unit-real-workspaces = tests.cargoUnitRealWorkspaces;
347347
}
348348
// rustPackageTests;
349-
350-
checkAttrs = {
351-
inherit (tests) eval;
352-
agents-md = pkgs.runCommand "agents-md-check" { nativeBuildInputs = [ agentsMd ]; } ''
353-
agents-md --check ${paths.root}
354-
mkdir -p "$out"
355-
'';
356-
# Offline schema gate for the loader manifests. `deepSeq` forces
357-
# every Paper / Velocity / Fabric per-version lock through
358-
# `readLoaderManifest` in `lib/artifacts.nix`, so malformed JSON or a
359-
# missing key fires here before any image starts evaluating. The
360-
# forced surface is the parsed-and-validated manifest data, not the
361-
# wrapped `fetchurl` derivations, to keep this check pure eval.
362-
loader-manifests =
363-
let
364-
forced = builtins.deepSeq ix.artifacts.minecraft.loaderManifests "ok";
365-
in
366-
pkgs.runCommand "loader-manifests-check" { } ''
367-
printf '%s\n' '${forced}' > "$out"
368-
'';
369-
run-records-session = repoPackages.run.passthru.tests.recordsSession;
370-
lint = pkgs.runCommand "ix-images-lint" { nativeBuildInputs = [ pkgs.coreutils ]; } ''
371-
cp -R ${lintSource} source
372-
chmod -R u+w source
373-
cd source
374-
${lib.getExe lint}
375-
mkdir -p "$out"
376-
'';
377-
rust-package-tests = pkgs.linkFarm "rust-package-tests" (
378-
lib.mapAttrsToList (name: path: { inherit name path; }) rustChecks
379-
);
380-
site-case-tests = pkgs.linkFarm "site-case-tests" (
381-
lib.mapAttrsToList (name: path: { inherit name path; }) siteTests.cases
382-
);
383-
site-test = siteTests.all;
384-
};
385349
in
386-
checkAttrs
387-
// {
388-
all = pkgs.linkFarm "ix-images-checks" (
389-
lib.mapAttrsToList (name: path: { inherit name path; }) checkAttrs
350+
{
351+
inherit (tests) eval;
352+
agents-md = pkgs.runCommand "agents-md-check" { nativeBuildInputs = [ agentsMd ]; } ''
353+
agents-md --check ${paths.root}
354+
mkdir -p "$out"
355+
'';
356+
# Offline schema gate for the loader manifests. `deepSeq` forces
357+
# every Paper / Velocity / Fabric per-version lock through
358+
# `readLoaderManifest` in `lib/artifacts.nix`, so malformed JSON or a
359+
# missing key fires here before any image starts evaluating. The
360+
# forced surface is the parsed-and-validated manifest data, not the
361+
# wrapped `fetchurl` derivations, to keep this check pure eval.
362+
loader-manifests =
363+
let
364+
forced = builtins.deepSeq ix.artifacts.minecraft.loaderManifests "ok";
365+
in
366+
pkgs.runCommand "loader-manifests-check" { } ''
367+
printf '%s\n' '${forced}' > "$out"
368+
'';
369+
run-records-session = repoPackages.run.passthru.tests.recordsSession;
370+
lint = pkgs.runCommand "ix-images-lint" { nativeBuildInputs = [ pkgs.coreutils ]; } ''
371+
cp -R ${lintSource} source
372+
chmod -R u+w source
373+
cd source
374+
${lib.getExe lint}
375+
mkdir -p "$out"
376+
'';
377+
rust-package-tests = pkgs.linkFarm "rust-package-tests" (
378+
lib.mapAttrsToList (name: path: { inherit name path; }) rustChecks
379+
);
380+
site-case-tests = pkgs.linkFarm "site-case-tests" (
381+
lib.mapAttrsToList (name: path: { inherit name path; }) siteTests.cases
390382
);
383+
site-test = siteTests.all;
391384
}
392385
);
393386

0 commit comments

Comments
 (0)