ci: speed up test workflow with parallel jobs, sccache, and mold by Evrard-Nil · Pull Request #831 · nearai/cloud-api

Evrard-Nil · 2026-06-19T10:57:44Z

Summary

Splits the single sequential test job into five parallel jobs and adds sccache + the mold linker to cut CI wall-clock time. Targets the cloud-api test workflow, which already runs on our self-hosted runner (gpu11) but was slow because everything ran sequentially in one job.

Does it already run on our runners?

Yes — both the old lint and test jobs already used runs-on: [self-hosted, infra]. This PR keeps that. The slowness comes from (a) sequential execution, (b) no object-level compile cache, (c) slow GNU ld linking of the large test binaries, and (d) a dead ALTER SYSTEM + docker restart step.

Results

Job	Time
Lint	3m59s
Unit tests	2m50s
Integration tests	2m58s
E2E tests	5m58s
Total wall-clock (parallel)	~6m

Baseline (old workflow on main, sequential): ~10-11m. This is ~40% faster on a cold cache; sccache will improve further on warm runs.

Changes

Job split (parallel execution)

Old: lint ‖ test (where test ran unit → e2e → vLLM → release sequentially)

New: lint ‖ unit-test ‖ integration-test ‖ e2e-test ‖ build-release (all parallel)

unit-test (cargo nextest run --lib --bins): no PostgreSQL needed — in-crate #[test]s use mocks.
integration-test (cargo nextest run --test integration_tests): no PostgreSQL needed — MockProvider is used by default (USE_REAL_VLLM is not set).
e2e-test (cargo nextest run --test e2e_all): the only job with the PostgreSQL service container.
build-release (cargo build --release): main-push only, split out so it doesn't block tests and doesn't pull in PostgreSQL (there is no build.rs).

sccache (`RUSTC_WRAPPER=sccache`)

Caches compiled objects at the object-file granularity, complementing swatinem/rust-cache (which caches target/). Survives Cargo.lock bumps and feature-flag changes better. On a self-hosted runner the default cache dir (~/.cache/sccache) persists on disk across runs, so no actions/cache step is needed.

mold linker (`RUSTFLAGS="-C link-arg=-fuse-ld=mold"`)

2-5× faster linking of the ~60MB test binary. Installed via GitHub release binary (both mold and ld.mold) — not apt — to avoid dpkg lock contention between parallel jobs. Configured via RUSTFLAGS env var rather than in .cargo/config.toml so:

The Docker reproducible build (which copies .cargo/config.toml and does NOT set RUSTFLAGS) is unaffected and keeps using ld.
RUSTFLAGS env replaces target.cfg.rustflags from config (cargo treats them as mutually exclusive), which is fine for CI — the dropped flags (--remap-path-prefix, --build-id=none, --hash-style=gnu, --no-undefined) are reproducibility-only and irrelevant for tests. -C debuginfo=0 is kept for faster test compiles.

Mold integration gotcha: gcc -fuse-ld=mold looks for ld.mold on PATH (not mold). Rustc also passes its own -fuse-ld=lld + a -B path to its bundled lld; the last -fuse-ld flag wins, so -fuse-ld=mold takes effect. Installing ld.mold from the mold tarball resolves the collect2: cannot find 'ld' error.

Cleanup

Removed the ALTER SYSTEM SET max_connections=150 + docker restart step (was test.yml:74-87). .config/nextest.toml already caps the e2e-db test group at max-threads = 16 (16 × 4 pool conns = 64 < PG default max_connections = 100), so the override was dead code adding ~15-30s and a failure point.
Switched vLLM integration tests from cargo test --test integration_tests -- --nocapture to cargo nextest run --test integration_tests for consistent tooling.
Stripped vestigial env vars from the release build (no build.rs → cargo build --release needs none of DATABASE_* / MODEL_DISCOVERY_* / AUTH_*).

Composite action (`.github/actions/setup-rust-ci`)

New composite action centralizes: Rust toolchain, cargo-nextest, sccache, mold, and swatinem/rust-cache (with per-job cache-key for isolation). Avoids step duplication across 5 jobs.

Verification

All 5 jobs pass on this PR (Lint 3m59s, Unit 2m50s, Integration 2m58s, E2E 5m58s)
YAML validated (5 jobs, correct runs-on, only e2e-test has postgres service, only build-release is main-gated)
Confirmed no build.rs anywhere in the workspace (cargo build --release needs no env vars)
Confirmed integration_tests target lives in crates/inference_providers/tests/ and uses MockProvider by default
Confirmed .config/nextest.toml already caps e2e concurrency at 16 threads (justifies removing the PG restart step)
mold v2.41.0 release verified to exist and tarball structure confirmed (bin/mold + bin/ld.mold)

Follow-up issues (not in this PR)

e2e sharding with nextest --partition hash i/N (~3× e2e speedup) — needs a separate runner label or matrix to be effective.
Dedicated CI VM to remove gpu11 prod-host contention (the root cause of CI-duration variability).
Pre-built CI Docker image with deps compiled (cargo test --no-run), refreshed on Cargo.lock changes.

Follow-up issues

ci: shard e2e tests with nextest --partition hash for ~3x speedup #832 — e2e sharding with nextest --partition hash (~3× e2e speedup)
ci: pre-build CI Docker image with deps compiled for faster cold builds #833 — pre-build CI Docker image with deps compiled
nearai/infra#184 — dedicated CI VM to remove gpu11 prod-host contention

Split the single 'test' job into five parallel jobs (lint, unit-test, integration-test, e2e-test, build-release) so they run concurrently on the self-hosted runner instead of sequentially. Unit and integration tests do not need PostgreSQL and no longer pull in the service container. Tooling improvements (shared via a new composite action): - sccache (RUSTC_WRAPPER=sccache): caches compiled objects across runs, complementing swatinem/rust-cache. Persistent on the self-hosted runner. - mold linker: 2-5x faster linking of the large test binaries. Installed via release binary (not apt) to avoid dpkg lock contention between parallel jobs. Configured via env var so the Docker reproducible build (which copies .cargo/config.toml) is unaffected and keeps using ld. Cleanup: - Remove the 'ALTER SYSTEM SET max_connections=150 + docker restart' step: .config/nextest.toml already caps the e2e-db group at 16 threads (16 * 4 pool conns = 64 < PG default 100), so the override is dead code that only added ~15-30s and a failure point. - Move 'cargo build --release' (main-push only) into its own job with no PostgreSQL service or test env vars (there is no build.rs). - Switch vLLM integration tests from 'cargo test' to 'cargo nextest' for consistent tooling. MockProvider is used by default (USE_REAL_VLLM is not set), so these run without external dependencies.

gemini-code-assist

Code Review

This pull request introduces a new composite GitHub Action, setup-rust-ci, to centralize Rust CI setup steps including the Rust toolchain, sccache, mold linker, cargo-nextest, and caching. Feedback on the PR identifies a critical issue in the mold installation step where using --strip-components=2 with -C /usr/local extracts the binary to /usr/local/mold instead of /usr/local/bin/mold, which will cause subsequent commands to fail as /usr/local is typically not in the system PATH. The reviewer provided actionable suggestions to fix the extraction path or use a local directory with $GITHUB_PATH.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-19T10:59:03Z

+        TARBALL="mold-${MOLD_VERSION}-${TARBALL_ARCH}-linux.tar.gz"
+        URL="https://github.com/rui314/mold/releases/download/v${MOLD_VERSION}/${TARBALL}"
+        curl -fsSL "$URL" -o "/tmp/${TARBALL}"
+        sudo tar -xzf "/tmp/${TARBALL}" -C /usr/local --strip-components=2 "mold-${MOLD_VERSION}-${TARBALL_ARCH}-linux/bin/mold"


Using --strip-components=2 with -C /usr/local will strip both mold-${MOLD_VERSION}-${TARBALL_ARCH}-linux and bin, extracting the mold binary directly into /usr/local/mold. Since /usr/local is typically not in the system PATH (unlike /usr/local/bin), the subsequent mold --version check and any cargo builds using mold as a linker will fail with a command/linker not found error.

To fix this, you can change --strip-components to 1 so that bin/mold is extracted into /usr/local, resulting in /usr/local/bin/mold.

Alternatively, to avoid requiring sudo and to prevent race conditions when multiple parallel jobs run on the same self-hosted runner concurrently, you could extract it to a local directory and append it to $GITHUB_PATH:

BIN_DIR="$HOME/.local/bin" mkdir -p "$BIN_DIR" tar -xzf "/tmp/${TARBALL}" -C "$BIN_DIR" --strip-components=2 "mold-${MOLD_VERSION}-${TARBALL_ARCH}-linux/bin/mold" echo "$BIN_DIR" >> $GITHUB_PATH

sudo tar -xzf "/tmp/${TARBALL}" -C /usr/local --strip-components=1 "mold-${MOLD_VERSION}-${TARBALL_ARCH}-linux/bin/mold"

claude · 2026-06-19T11:00:40Z

Review: CI parallelization + sccache + mold

Nicely scoped and well-documented PR. I verified the central claim behind removing the ALTER SYSTEM max_connections step: .config/nextest.toml does cap the e2e-db group at max-threads = 16, the override filter binary(/^e2e/) does match the e2e_all binary, and create_test_pool builds a single 4-connection pool (db_setup.rs:134, max_size: 4). So 16 × 4 = 64 < 100 holds and dropping the restart step is safe. 👍

A few issues worth addressing before merge:

⚠️ 1. Mold install has a race across parallel jobs (medium)

setup-rust-ci is used by 5 jobs that run concurrently on the same self-hosted host (you confirm they share gpu11). On a cold runner (mold not yet installed, e.g. first run after merge or after a reimage), the command -v mold guard fails for all jobs simultaneously, so all of them:

curl ... -o /tmp/${TARBALL} to the same path, and
sudo tar -xzf ... -C /usr/local writing the same /usr/local/bin/mold

concurrently. Interleaved writes to the shared tarball / target binary can yield a truncated tarball (tar error) or a corrupt mold, breaking the linker for every job. You replaced the dpkg-lock contention with an unsynchronized write race.

Make the install atomic, e.g. download to a unique temp file and flock + atomic mv into place:

TMP=$(mktemp /tmp/mold.XXXXXX.tar.gz)
curl -fsSL "$URL" -o "$TMP"
exec 9>/tmp/.mold-install.lock
flock 9
if ! command -v mold >/dev/null 2>&1; then
  sudo tar -xzf "$TMP" -C /usr/local --strip-components=2 ".../bin/mold"
fi
rm -f "$TMP"

⚠️ 2. `integration-test` references environment-scoped secrets without the environment (minor)

The job sets VLLM_BASE_URL / VLLM_API_KEY from secrets but, unlike e2e-test, has no environment: Cloud API test env. If those are environment-scoped secrets they resolve to empty strings here. Harmless today (MockProvider is used since USE_REAL_VLLM is unset), but it's dead/misleading config that would silently fail the day someone flips on real vLLM. Either drop the two env: lines or add the environment: binding.

🔎 3. Five separate `rust-cache` keys may thrash GitHub's 10 GB cache (minor)

swatinem/rust-cache still uses the Actions cache backend; with distinct keys (lint/unit/integration/e2e/release) each saving a multi-GB target/, you can blow past the 10 GB per-repo limit and trigger LRU eviction → cache misses. sccache (persistent on-disk) covers object-level recompiles so the net is probably fine, but consider a shared-key for the test jobs that build the same profile to reduce duplication.

None of these block correctness of the test results; #1 is the one most likely to cause an intermittent CI failure.

⚠️ (issues found)

The previous --strip-components=2 with -C /usr/local extracted mold to /usr/local/mold instead of /usr/local/bin/mold, so 'command -v mold' failed and cargo could not find the linker.

Setting CARGO_TARGET_*_LINKER=mold made rustc invoke mold directly as a linker driver, but mold is a linker, not a driver, and rejects gcc driver flags like -m64 ('unknown -m argument: 64'). Switch to RUSTFLAGS='-C debuginfo=0 -C link-arg=-fuse-ld=mold' so the default cc driver invokes mold via -fuse-ld. This replaces the target.cfg.rustflags from .cargo/config.toml (cargo treats RUSTFLAGS and target.cfg.rustflags as mutually exclusive), which is fine for CI: the dropped flags are reproducibility-only. debuginfo=0 is kept for faster test compiles. The Docker reproducible build doesn't set RUSTFLAGS, so it keeps the full config.toml rustflags with ld.

gcc's -fuse-ld=mold looks for 'ld.mold' on PATH, not 'mold'. Without it, collect2 falls back to searching for 'ld' and fails ('cannot find ld') because rustc's -B path only contains an lld symlink named 'ld'. Installing both mold and ld.mold from the release tarball resolves this.

Evrard-Nil · 2026-06-19T14:32:00Z

/ocr

github-actions · 2026-06-19T14:36:15Z

✅ OpenCodeReview: No comments generated. Looks good to me.

PierreLeGuen

CI-only change that splits the test workflow into parallel jobs and adds sccache + mold via a setup-rust-ci composite action. All five jobs (lint, unit, integration, e2e, build-release) pass on the head; prior review feedback (mold extraction path) was addressed in c9f94d7. Looks good to merge.

Optional, non-blocking follow-ups:

.github/actions/setup-rust-ci/action.yml: on a cold runner, all parallel jobs hit the command -v mold guard simultaneously and download/extract to the same /tmp path and /usr/local/bin, which could intermittently corrupt the tarball or race the extraction. No production impact, but may cause spurious CI failures on first run after a runner reimage.
.github/workflows/test.yml (integration-test job): VLLM_BASE_URL/VLLM_API_KEY are passed without environment: Cloud API test env, so if those secrets are environment-scoped they resolve to empty. Harmless today since integration_tests.rs:21 defaults to MockProvider unless USE_REAL_VLLM is set — only matters if you later run real-vLLM tests here.

Checks: YAML parse of workflow + action files; gh pr checks 831 (all green); verified test targets e2e_all (crates/api/Cargo.toml:82) and integration_tests exist; confirmed -C debuginfo=0 already in .cargo/config.toml; confirmed nextest.toml e2e-db cap of 16 threads and that its filter excludes integration tests.

Evrard-Nil had a problem deploying to Cloud API test env June 19, 2026 10:57 — with GitHub Actions Failure

gemini-code-assist Bot reviewed Jun 19, 2026

View reviewed changes

fix(ci): extract mold to /usr/local/bin so it lands on PATH

c9f94d7

The previous --strip-components=2 with -C /usr/local extracted mold to /usr/local/mold instead of /usr/local/bin/mold, so 'command -v mold' failed and cargo could not find the linker.

Evrard-Nil had a problem deploying to Cloud API test env June 19, 2026 11:06 — with GitHub Actions Failure

Evrard-Nil had a problem deploying to Cloud API test env June 19, 2026 11:13 — with GitHub Actions Failure

Evrard-Nil temporarily deployed to Cloud API test env June 19, 2026 13:46 — with GitHub Actions Inactive

This was referenced Jun 19, 2026

ci: shard e2e tests with nextest --partition hash for ~3x speedup #832

Open

ci: pre-build CI Docker image with deps compiled for faster cold builds #833

Open

Merge branch 'main' into ci/speed-up-test-workflow

237a09b

Evrard-Nil temporarily deployed to Cloud API test env June 19, 2026 14:04 — with GitHub Actions Inactive

Evrard-Nil requested a review from PierreLeGuen June 19, 2026 14:31

PierreLeGuen approved these changes Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: speed up test workflow with parallel jobs, sccache, and mold#831

ci: speed up test workflow with parallel jobs, sccache, and mold#831
Evrard-Nil wants to merge 5 commits into
mainfrom
ci/speed-up-test-workflow

Evrard-Nil commented Jun 19, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Uh oh!

claude Bot commented Jun 19, 2026

Uh oh!

Evrard-Nil commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

PierreLeGuen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Evrard-Nil commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Does it already run on our runners?

Results

Changes

Job split (parallel execution)

sccache (RUSTC_WRAPPER=sccache)

mold linker (RUSTFLAGS="-C link-arg=-fuse-ld=mold")

Cleanup

Composite action (.github/actions/setup-rust-ci)

Verification

Follow-up issues (not in this PR)

Follow-up issues

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jun 19, 2026

Review: CI parallelization + sccache + mold

⚠️ 1. Mold install has a race across parallel jobs (medium)

⚠️ 2. integration-test references environment-scoped secrets without the environment (minor)

🔎 3. Five separate rust-cache keys may thrash GitHub's 10 GB cache (minor)

Uh oh!

Evrard-Nil commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

PierreLeGuen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Evrard-Nil commented Jun 19, 2026 •

edited

Loading

sccache (`RUSTC_WRAPPER=sccache`)

mold linker (`RUSTFLAGS="-C link-arg=-fuse-ld=mold"`)

Composite action (`.github/actions/setup-rust-ci`)

⚠️ 2. `integration-test` references environment-scoped secrets without the environment (minor)

🔎 3. Five separate `rust-cache` keys may thrash GitHub's 10 GB cache (minor)