CI: pool the per-config runner matrices into parallel make-check jobs#10667
Open
julek-wolfssl wants to merge 6 commits into
Open
CI: pool the per-config runner matrices into parallel make-check jobs#10667julek-wolfssl wants to merge 6 commits into
julek-wolfssl wants to merge 6 commits into
Conversation
Replace the one-runner-per-configuration matrices across the
make-check workflow family with a generic pooled runner,
.github/scripts/parallel-make-check.py. Each workflow keeps its
configuration list as JSON next to the invocation; one runner (or a
small fixed set of shards, balanced by measured per-config minutes)
builds every config in its own out-of-tree (VPATH) build directory off
a single checkout/autogen, on a pool of one-per-CPU worker threads,
longest first. Concurrent checks are isolated with bubblewrap network
namespaces, compilations are cached with ccache, the first failure
aborts the rest (fail-fast, with --no-fail-fast to run everything),
and per-config timings plus pool efficiency land in the step summary.
Failure logs upload as artifacts. smoke-test.yml is likewise reworked
into a single pooled job that runs its nine configs on one runner.
Converted workflows (runner jobs per full pass):
os-check.yml 101 -> 8 (92 Ubuntu configs -> 4 shards;
the macOS matrix, the user-settings jobs and
the standalone
macos-apple-native-cert-validation.yml fold
into one macOS runner; Windows unchanged)
pq-all.yml 21 -> 2 shards
disable-pk-algs.yml 15 -> 1
wolfCrypt-Wconversion.yml 11 -> 1
trackmemory.yml 7 -> 1
cryptocb-only.yml 8 -> 1 (incl. the two new SHA512 entries)
multi-compiler.yml 6 -> 1
smallStackSize.yml 6 -> 1
multi-arch.yml 6 -> 1
async.yml 5 -> 1
psk.yml 5 -> 1
no-malloc.yml 3 -> 1
wolfsm.yml 3 -> 1
opensslcoexist.yml 2 -> 1
Measured against current upstream passing runs (job execution time,
queue excluded): ~200 runner jobs / ~374 runner-minutes per full pass
become 23 jobs / ~168 runner-minutes, with more coverage than before.
multi-arch's old matrix combined an "include" list of four
architectures with an "opts" axis; GitHub's include-merge rules made
each arch entry overwrite the previous one, so only the armel
combinations actually ran. The pooled list restores the intended
aarch64/armhf/riscv64 coverage (23 combinations; riscv64 x sp-math is
omitted as invalid - configure rejects sp-math without SP, and
--enable-riscv-asm, unlike --enable-sp-asm, does not bring SP in).
Out-of-tree build fixes this depends on:
- Makefile.am: symlink the read-only test data (certs/, tests/ config
files, sniffer captures and helpers, examples/crypto_policies,
input, quit) into the build tree via a BUILT_SOURCES stamp, removed
again in distclean-local. ChangeToWolfRoot() and the script tests
resolve everything relative to the working directory, so out-of-tree
make check and make distcheck now pass.
- scripts/multi-msg-record.py: locate the client binary from the build
tree working directory rather than the script's source directory.
- configure.ac + wolfssl/include.am: run
support/gen-debug-trace-error-codes.sh from $srcdir; it reads the
error-code headers from the source tree and generates into the build
tree.
- tests/swdev: a WOLFBUILD variable points the sub-make at the build
tree for the configure-generated headers (wolfssl/options.h,
wolfssl/version.h); the in-tree-only guards are dropped.
Portions of PR wolfSSL#10649 are incorporated: the cross-platform
ccache-setup composite action, repository_owner gates on check-headers
and check-source-text, the docs-only paths-ignore on os-check, and the
libspdm timeout bumps.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR restructures the GitHub Actions “make check” workflow family to run many per-configuration builds on a smaller pooled set of runners, driven by a new generic runner script. To enable this, it also fixes multiple out-of-tree (VPATH) build/test path assumptions across the autotools build, scripts, and swdev test integration, and expands ccache usage to reduce rebuild time.
Changes:
- Add
.github/scripts/parallel-make-check.pyto build/test many configs concurrently (or sharded) from JSON config lists, with fail-fast, per-config timing, summaries, and log artifacts. - Update autotools + ancillary scripts to support out-of-tree
make check(symlink test data into build tree; ensure generated headers and binaries are found correctly). - Convert numerous CI workflows (including smoke-test and os-check) from large matrices to pooled/sharded pooled runs, plus ccache-setup improvements and timeouts.
Reviewed changes
Copilot reviewed 30 out of 31 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
wolfssl/include.am |
Run debug-trace error-code header generator from $(top_srcdir) for VPATH builds. |
wolfcrypt/test/include.am |
Allow swdev to build out-of-tree by passing WOLFBUILD instead of failing VPATH builds. |
tests/swdev/README.md |
Update swdev docs to reflect VPATH support via WOLFBUILD. |
tests/swdev/Makefile |
Add WOLFBUILD include path and prerequisite targeting build-tree generated headers. |
support/gen-debug-trace-error-codes.sh |
Read input headers from $srcdir and generate outputs into the build tree. |
scripts/multi-msg-record.py |
Locate the examples/client/client binary from the build-tree working directory for VPATH checks. |
Makefile.am |
Add VPATH make check support via a BUILT_SOURCES stamp that symlinks required test data into the build tree and cleans it on distclean. |
configure.ac |
Remove in-tree-only swdev guard; run debug-trace header generation from $srcdir. |
.gitignore |
Ignore the new stamp file and pooled build dirs (/build-*). |
.github/workflows/wolfsm.yml |
Convert to pooled parallel make-check runner with ccache + artifact logs. |
.github/workflows/wolfCrypt-Wconversion.yml |
Convert warning-check builds to pooled runner with JSON configs and ccache + artifact logs. |
.github/workflows/trackmemory.yml |
Convert to pooled runner with bwrap, ccache, and JSON config list. |
.github/workflows/smoke-test.yml |
Convert smoke matrix to a single pooled job using the new script, with ccache and bwrap notes. |
.github/workflows/smallStackSize.yml |
Convert to pooled runner with explicit run of testwolfcrypt per config. |
.github/workflows/psk.yml |
Convert to pooled runner with per-config JSON list and ccache + artifact logs. |
.github/workflows/pq-all.yml |
Convert to 2-shard pooled runner; shard-balanced by measured minutes; ccache + artifact logs per shard. |
.github/workflows/os-check.yml |
Convert Ubuntu configs to 4 pooled shards + single pooled macOS job; restore intended multi-arch coverage; add docs-only paths-ignore. |
.github/workflows/opensslcoexist.yml |
Convert to pooled runner with JSON configs and ccache + artifact logs. |
.github/workflows/no-malloc.yml |
Convert to pooled runner and preserve “build + testwolfcrypt only” behavior via check: false + run. |
.github/workflows/multi-compiler.yml |
Convert to pooled runner building multiple compilers in one job using per-config compiler selection. |
.github/workflows/multi-arch.yml |
Convert to pooled runner restoring intended multi-arch × opts coverage; run testwolfcrypt under qemu-user per config. |
.github/workflows/macos-apple-native-cert-validation.yml |
Remove standalone workflow; folded into os-check macOS pooled config list. |
.github/workflows/libspdm.yml |
Increase timeouts to reduce flaky cancellations on loaded runners. |
.github/workflows/disable-pk-algs.yml |
Convert to pooled runner with JSON configs and ccache + artifact logs. |
.github/workflows/cryptocb-only.yml |
Convert to pooled runner; inline former BASE_CONFIG into each JSON entry; add artifact logs. |
.github/workflows/check-source-text.yml |
Gate execution to wolfssl org to avoid fork CI spend. |
.github/workflows/check-headers.yml |
Gate execution to wolfssl org to avoid fork CI spend. |
.github/workflows/async.yml |
Convert to pooled runner with JSON configs and ccache + artifact logs. |
.github/scripts/parallel-make-check.py |
New pooled runner implementation: JSON schema, sharding, parallel workers, fail-fast killing, summaries, and log capture. |
.github/actions/wait-for-smoke/action.yml |
Increase default smoke wait timeout to 3600s. |
.github/actions/ccache-setup/action.yml |
Make ccache install cross-platform and set PATH for compiler wrappers on Linux/macOS. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address the Copilot review: - parallel-make-check.py: validate "configure" (list of strings) and cflags/ldflags (strings) so a malformed entry fails the load instead of exploding a string into per-character configure arguments; print a single line for passing configs instead of dumping their full make-check.log into the CI log (failure dumps unchanged; the logs remain in build-<name>/ for the failure artifacts). - Makefile.am: use rm -rf for the certs/input/quit setup and distclean cleanup. A --private-dir run replaces the certs symlink with a private directory copy that rm -f cannot remove (verified: make distclean in a build dir with a privatized certs/ now succeeds and removes it). - psk.yml, disable-pk-algs.yml: normalize the single-dash tokens (-disable-rsa, -disable-ecc, -disable-aescbc, -enable-cryptonly) carried verbatim from the old matrices to the canonical double-dash form. No coverage change: configure honors single-dash spellings (verified -disable-rsa sets NO_RSA with no unrecognized-option warning), so these were always in effect; both touched configs re-validated end-to-end. The --cc default stays "ccache gcc": ccache resolves the compiler through its own masquerade symlinks (verified: no recursion and normal cache hits with /usr/lib/ccache prepended to PATH), and the explicit CC= also covers jobs that use ccache without the PATH masquerade.
Two fixes from the second Copilot review round: A process spawned between abort_others()' live_procs snapshot and its registration escaped the kill sweep, leaving that build/check running to completion after fail-fast had begun. Re-check stop_event right after registering the process and SIGTERM its process group if the abort already started: either the registration happened before the sweep's snapshot (the sweep kills it) or it happened after stop_event was set (the re-check sees it), so the window is closed. Exceptions from callable steps (user_settings staging, private-dir copies) used to escape the worker thread and crash the whole script with no summary. They are now recorded as that config's step failure with the exception written to its make-check.log, e.g. a bad "user_settings" path reports FAIL (stage <path>) while the other configs keep running; the fail-fast bookkeeping is shared with the nonzero-exit path via record_failure().
Third Copilot review round: - Makefile.am: run the test-data stamp recipe body under set -e. A failed symlink mid-loop previously did not fail the compound command (only the last command's status counted), so a partially-populated build tree could be stamped complete. Now any failed setup command aborts the recipe and the stamp is not created. - parallel-make-check.py: fail-fast sent SIGTERM only, so a test that traps or ignores SIGTERM could keep the job alive until the workflow timeout. abort_others() now polls the swept processes and SIGKILLs whatever is still alive after a 10 s grace period, and the post-registration race-window kill escalates the same way (bounded wait, then SIGKILL). Verified with a config running "trap '' TERM; sleep 300": the run completes in ~10 s with the stubborn config reported as aborted and no surviving processes.
The two jobs that manage their ccache cache manually rely on ccache's XDG default (~/.cache/ccache) matching the actions/cache path. That holds today, but nothing enforces it: a later change that sets CCACHE_DIR (e.g. adopting the ccache-setup composite, which uses ~/.ccache) would silently decouple the build's cache from the saved/restored directory. Pin CCACHE_DIR explicitly to the cached path so the pairing is visible and cannot drift.
wolfSSL's configure enables make's jobserver by default (AX_AM_JOBSERVER([yes]) -> AM_MAKEFLAGS += -j<nproc+1> in aminclude.am), and automake passes that explicit -j to every recursive sub-make, where it overrides the invoking make's job limit. The script's -j therefore only ever scheduled the outermost recursion hop: --jobs was inert. Measured on a 4-CPU host with 10 build-only configs oversaturating the worker pool, the jobserver default is also the better policy: capping sub-makes via --disable-jobserver and -j2 dropped CPU utilization from 96% to 89% and lengthened the wall time, because configs' serial phases (configure, link) stopped being backfilled by other configs' compile jobs. So make is now invoked with no -j at all - parallelism within a config comes from the configure-default jobserver - and the misleading knob is gone, including the macOS job's --jobs 3.
|
retest this please |
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replace the one-runner-per-configuration matrices across the
make-check workflow family with a generic pooled runner,
.github/scripts/parallel-make-check.py. Each workflow keeps its
configuration list as JSON next to the invocation; one runner (or a
small fixed set of shards, balanced by measured per-config minutes)
builds every config in its own out-of-tree (VPATH) build directory off
a single checkout/autogen, on a pool of one-per-CPU worker threads,
longest first. Concurrent checks are isolated with bubblewrap network
namespaces, compilations are cached with ccache, the first failure
aborts the rest (fail-fast, with --no-fail-fast to run everything),
and per-config timings plus pool efficiency land in the step summary.
Failure logs upload as artifacts. smoke-test.yml is likewise reworked
into a single pooled job that runs its nine configs on one runner.
Converted workflows (runner jobs per full pass):
os-check.yml 101 -> 8 (92 Ubuntu configs -> 4 shards;
the macOS matrix, the user-settings jobs and
the standalone
macos-apple-native-cert-validation.yml fold
into one macOS runner; Windows unchanged)
pq-all.yml 21 -> 2 shards
disable-pk-algs.yml 15 -> 1
wolfCrypt-Wconversion.yml 11 -> 1
trackmemory.yml 7 -> 1
cryptocb-only.yml 8 -> 1 (incl. the two new SHA512 entries)
multi-compiler.yml 6 -> 1
smallStackSize.yml 6 -> 1
multi-arch.yml 6 -> 1
async.yml 5 -> 1
psk.yml 5 -> 1
no-malloc.yml 3 -> 1
wolfsm.yml 3 -> 1
opensslcoexist.yml 2 -> 1
Measured against current upstream passing runs (job execution time,
queue excluded): ~200 runner jobs / ~374 runner-minutes per full pass
become 23 jobs / ~168 runner-minutes, with more coverage than before.
multi-arch's old matrix combined an "include" list of four
architectures with an "opts" axis; GitHub's include-merge rules made
each arch entry overwrite the previous one, so only the armel
combinations actually ran. The pooled list restores the intended
aarch64/armhf/riscv64 coverage (23 combinations; riscv64 x sp-math is
omitted as invalid - configure rejects sp-math without SP, and
--enable-riscv-asm, unlike --enable-sp-asm, does not bring SP in).
Out-of-tree build fixes this depends on:
files, sniffer captures and helpers, examples/crypto_policies,
input, quit) into the build tree via a BUILT_SOURCES stamp, removed
again in distclean-local. ChangeToWolfRoot() and the script tests
resolve everything relative to the working directory, so out-of-tree
make check and make distcheck now pass.
tree working directory rather than the script's source directory.
support/gen-debug-trace-error-codes.sh from $srcdir; it reads the
error-code headers from the source tree and generates into the build
tree.
tree for the configure-generated headers (wolfssl/options.h,
wolfssl/version.h); the in-tree-only guards are dropped.
Portions of PR #10649 are incorporated: the cross-platform
ccache-setup composite action, repository_owner gates on check-headers
and check-source-text, the docs-only paths-ignore on os-check, and the
libspdm timeout bumps.