Build test_suites concurrently in local_build.sh with a capped pool by jonathanspw · Pull Request #510 · GoogleCloudPlatform/cloud-image-tests

jonathanspw · 2026-05-18T17:00:23Z

Supersedes #337

Summary

Parallelizes the test-suite builds in local_build.sh with a self-tuning
concurrency cap. Cold full-suite builds drop from ~5m29s to ~2m30s on a
32-core host — a 54% wall-clock reduction — with safe self-throttling on
smaller hardware.

Problem

local_build.sh builds 41 test suites × 4 GOOS/GOARCH variants each in a
single serial loop. On a 32-core/124-GiB host the average CPU usage during
a build is ~4 cores out of 32 — most of the machine sits idle. Locally
this is just slow; in CI it directly blocks every presubmit job that
rebuilds the test binaries.

Naively parallelizing (launch all 41 suites concurrently) is slower than
it should be: 41 simultaneous go test -c invocations race to compile the
same shared dependencies (the imagetest framework, compute-daisy, the
GCE compute APIs, stdlib). Go's build cache is filesystem-locked so the
work is safe, but only one winner gets cached per package; the other
~40 simultaneous compiles of the same dep are pure waste. Measured: user
CPU climbs to ~6000s with only a modest wall-clock improvement (4m36s),
and on memory-constrained hosts the process tree can OOM.

Changes

New -j N flag on local_build.sh. -j 0 (default) means auto:
max(1, (nproc-1)/3). The (nproc-1)/3 formula matches the
observation that each go test -c spawns ~3 hot worker processes
(compile, vet, link), so a cap of K*3 ≤ cores-1 keeps the worker
count close to the core count without leaving the system starved.
A FIFO-based semaphore throttles the per-suite background subshells to
$jobs concurrent suites. Each suite still builds its 4 arch variants
serially inside its slot, so a hung suite can't starve more than one
slot.
The per-suite build commands are hoisted into a build_suite()
function — same commands, same outputs, just callable.
Failure tracking: per-suite exit codes are collected; on any failure
the script reports which suite(s) failed and exits non-zero.

Benchmark

Cold cache, 41 suites × 4 archs = 164 build outputs. Single trial each on
a 32-core / 124 GiB host. All runs produce identical 210 artifacts.

Variant	Wall clock	User CPU	Effective cores	Notes
Serial (today)	5m29s	1180s	~4.3	baseline
Parallel unlimited (41 jobs)	4m36s	5947s	~25	heavy dup compile, OOM risk
Parallel K=10 (default on 32 cores)	2m30s	2601s	~20	shipped default
Parallel K=16 (cores/2)	2m54s	3258s	~22	enough oversubscription to lose

Per-suite walls in the K=10 run are 8–52 seconds: the first wave of
suites starts cold and pays for the shared compile, the rest finish in
seconds against warm cache. User CPU drops 2.3× compared to the
unlimited variant — most of the previously wasted dup-work is gone.

Behavior on smaller hardware

The formula degrades gracefully:

Cores	Default `K`
4	1 (serial)
8	2
16	5
32	10
64	21

-j N is honored verbatim for environments that want explicit control
(CI runners with a known core/memory budget). -j 0 resets to auto.

Default cap is (nproc-1)/3 (floor 1), matching the ~3 hot worker processes each `go test -c` spawns (compile, vet, link). Pass -j N to override; -j 0 restores the auto formula. A FIFO semaphore throttles to $jobs concurrent suites; each suite still builds its 4 GOOS/GOARCH variants serially inside the slot. On a 32-core host this cuts a cold all-suites build from ~5m29s to ~2m30s (-54%) and self-throttles cleanly on smaller hardware (4 cores → K=1, effectively serial).

google-oss-prow · 2026-05-18T17:00:52Z

Hi @jonathanspw. Thanks for your PR.

I'm waiting for a GoogleCloudPlatform member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

shenpai35 · 2026-05-18T17:11:12Z

/ok-to-test

-- 4eb3f49 by Jonathan Wright <jonathan@almalinux.org>: Build test_suites concurrently in local_build.sh with a capped pool Default cap is (nproc-1)/3 (floor 1), matching the ~3 hot worker processes each `go test -c` spawns (compile, vet, link). Pass -j N to override; -j 0 restores the auto formula. A FIFO semaphore throttles to $jobs concurrent suites; each suite still builds its 4 GOOS/GOARCH variants serially inside the slot. On a 32-core host this cuts a cold all-suites build from ~5m29s to ~2m30s (-54%) and self-throttles cleanly on smaller hardware (4 cores → K=1, effectively serial). FUTURE_COPYBARA_INTEGRATE_REVIEW=#510 from jonathanspw:local_build_parallel 4eb3f49 PiperOrigin-RevId: 926948257

google-oss-prow Bot added size/M needs-ok-to-test labels May 18, 2026

google-oss-prow Bot added ok-to-test and removed needs-ok-to-test labels May 18, 2026

drewhli approved these changes Jun 4, 2026

View reviewed changes

copybara-service Bot mentioned this pull request Jun 4, 2026

Copybara import of the project: #523

Merged

copybara-service Bot merged commit d123233 into GoogleCloudPlatform:main Jun 4, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build test_suites concurrently in local_build.sh with a capped pool#510

Build test_suites concurrently in local_build.sh with a capped pool#510
copybara-service[bot] merged 1 commit into
GoogleCloudPlatform:mainfrom
jonathanspw:local_build_parallel

jonathanspw commented May 18, 2026

Uh oh!

google-oss-prow Bot commented May 18, 2026

Uh oh!

shenpai35 commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jonathanspw commented May 18, 2026

Summary

Problem

Changes

Benchmark

Behavior on smaller hardware

Uh oh!

google-oss-prow Bot commented May 18, 2026

Uh oh!

shenpai35 commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants