Skip to content

ci: speed up test workflow with parallel jobs, sccache, and mold#831

Open
Evrard-Nil wants to merge 5 commits into
mainfrom
ci/speed-up-test-workflow
Open

ci: speed up test workflow with parallel jobs, sccache, and mold#831
Evrard-Nil wants to merge 5 commits into
mainfrom
ci/speed-up-test-workflow

Conversation

@Evrard-Nil

@Evrard-Nil Evrard-Nil commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

Splits the single sequential test job into five parallel jobs and adds sccache + the mold linker to cut CI wall-clock time. Targets the cloud-api test workflow, which already runs on our self-hosted runner (gpu11) but was slow because everything ran sequentially in one job.

Does it already run on our runners?

Yes — both the old lint and test jobs already used runs-on: [self-hosted, infra]. This PR keeps that. The slowness comes from (a) sequential execution, (b) no object-level compile cache, (c) slow GNU ld linking of the large test binaries, and (d) a dead ALTER SYSTEM + docker restart step.

Results

Job Time
Lint 3m59s
Unit tests 2m50s
Integration tests 2m58s
E2E tests 5m58s
Total wall-clock (parallel) ~6m

Baseline (old workflow on main, sequential): ~10-11m. This is ~40% faster on a cold cache; sccache will improve further on warm runs.

Changes

Job split (parallel execution)

Old: linttest (where test ran unit → e2e → vLLM → release sequentially)

New: lintunit-testintegration-teste2e-testbuild-release (all parallel)

  • unit-test (cargo nextest run --lib --bins): no PostgreSQL needed — in-crate #[test]s use mocks.
  • integration-test (cargo nextest run --test integration_tests): no PostgreSQL needed — MockProvider is used by default (USE_REAL_VLLM is not set).
  • e2e-test (cargo nextest run --test e2e_all): the only job with the PostgreSQL service container.
  • build-release (cargo build --release): main-push only, split out so it doesn't block tests and doesn't pull in PostgreSQL (there is no build.rs).

sccache (RUSTC_WRAPPER=sccache)

Caches compiled objects at the object-file granularity, complementing swatinem/rust-cache (which caches target/). Survives Cargo.lock bumps and feature-flag changes better. On a self-hosted runner the default cache dir (~/.cache/sccache) persists on disk across runs, so no actions/cache step is needed.

mold linker (RUSTFLAGS="-C link-arg=-fuse-ld=mold")

2-5× faster linking of the ~60MB test binary. Installed via GitHub release binary (both mold and ld.mold) — not apt — to avoid dpkg lock contention between parallel jobs. Configured via RUSTFLAGS env var rather than in .cargo/config.toml so:

  • The Docker reproducible build (which copies .cargo/config.toml and does NOT set RUSTFLAGS) is unaffected and keeps using ld.
  • RUSTFLAGS env replaces target.cfg.rustflags from config (cargo treats them as mutually exclusive), which is fine for CI — the dropped flags (--remap-path-prefix, --build-id=none, --hash-style=gnu, --no-undefined) are reproducibility-only and irrelevant for tests. -C debuginfo=0 is kept for faster test compiles.

Mold integration gotcha: gcc -fuse-ld=mold looks for ld.mold on PATH (not mold). Rustc also passes its own -fuse-ld=lld + a -B path to its bundled lld; the last -fuse-ld flag wins, so -fuse-ld=mold takes effect. Installing ld.mold from the mold tarball resolves the collect2: cannot find 'ld' error.

Cleanup

  • Removed the ALTER SYSTEM SET max_connections=150 + docker restart step (was test.yml:74-87). .config/nextest.toml already caps the e2e-db test group at max-threads = 16 (16 × 4 pool conns = 64 < PG default max_connections = 100), so the override was dead code adding ~15-30s and a failure point.
  • Switched vLLM integration tests from cargo test --test integration_tests -- --nocapture to cargo nextest run --test integration_tests for consistent tooling.
  • Stripped vestigial env vars from the release build (no build.rscargo build --release needs none of DATABASE_* / MODEL_DISCOVERY_* / AUTH_*).

Composite action (.github/actions/setup-rust-ci)

New composite action centralizes: Rust toolchain, cargo-nextest, sccache, mold, and swatinem/rust-cache (with per-job cache-key for isolation). Avoids step duplication across 5 jobs.

Verification

  • All 5 jobs pass on this PR (Lint 3m59s, Unit 2m50s, Integration 2m58s, E2E 5m58s)
  • YAML validated (5 jobs, correct runs-on, only e2e-test has postgres service, only build-release is main-gated)
  • Confirmed no build.rs anywhere in the workspace (cargo build --release needs no env vars)
  • Confirmed integration_tests target lives in crates/inference_providers/tests/ and uses MockProvider by default
  • Confirmed .config/nextest.toml already caps e2e concurrency at 16 threads (justifies removing the PG restart step)
  • mold v2.41.0 release verified to exist and tarball structure confirmed (bin/mold + bin/ld.mold)

Follow-up issues (not in this PR)

  • e2e sharding with nextest --partition hash i/N (~3× e2e speedup) — needs a separate runner label or matrix to be effective.
  • Dedicated CI VM to remove gpu11 prod-host contention (the root cause of CI-duration variability).
  • Pre-built CI Docker image with deps compiled (cargo test --no-run), refreshed on Cargo.lock changes.

Follow-up issues

Split the single 'test' job into five parallel jobs (lint, unit-test,
integration-test, e2e-test, build-release) so they run concurrently on the
self-hosted runner instead of sequentially. Unit and integration tests do
not need PostgreSQL and no longer pull in the service container.

Tooling improvements (shared via a new composite action):
- sccache (RUSTC_WRAPPER=sccache): caches compiled objects across runs,
  complementing swatinem/rust-cache. Persistent on the self-hosted runner.
- mold linker: 2-5x faster linking of the large test binaries. Installed
  via release binary (not apt) to avoid dpkg lock contention between
  parallel jobs. Configured via env var so the Docker reproducible build
  (which copies .cargo/config.toml) is unaffected and keeps using ld.

Cleanup:
- Remove the 'ALTER SYSTEM SET max_connections=150 + docker restart' step:
  .config/nextest.toml already caps the e2e-db group at 16 threads
  (16 * 4 pool conns = 64 < PG default 100), so the override is dead code
  that only added ~15-30s and a failure point.
- Move 'cargo build --release' (main-push only) into its own job with no
  PostgreSQL service or test env vars (there is no build.rs).
- Switch vLLM integration tests from 'cargo test' to 'cargo nextest' for
  consistent tooling. MockProvider is used by default (USE_REAL_VLLM is
  not set), so these run without external dependencies.
@Evrard-Nil Evrard-Nil had a problem deploying to Cloud API test env June 19, 2026 10:57 — with GitHub Actions Failure

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new composite GitHub Action, setup-rust-ci, to centralize Rust CI setup steps including the Rust toolchain, sccache, mold linker, cargo-nextest, and caching. Feedback on the PR identifies a critical issue in the mold installation step where using --strip-components=2 with -C /usr/local extracts the binary to /usr/local/mold instead of /usr/local/bin/mold, which will cause subsequent commands to fail as /usr/local is typically not in the system PATH. The reviewer provided actionable suggestions to fix the extraction path or use a local directory with $GITHUB_PATH.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

TARBALL="mold-${MOLD_VERSION}-${TARBALL_ARCH}-linux.tar.gz"
URL="https://github.com/rui314/mold/releases/download/v${MOLD_VERSION}/${TARBALL}"
curl -fsSL "$URL" -o "/tmp/${TARBALL}"
sudo tar -xzf "/tmp/${TARBALL}" -C /usr/local --strip-components=2 "mold-${MOLD_VERSION}-${TARBALL_ARCH}-linux/bin/mold"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using --strip-components=2 with -C /usr/local will strip both mold-${MOLD_VERSION}-${TARBALL_ARCH}-linux and bin, extracting the mold binary directly into /usr/local/mold. Since /usr/local is typically not in the system PATH (unlike /usr/local/bin), the subsequent mold --version check and any cargo builds using mold as a linker will fail with a command/linker not found error.

To fix this, you can change --strip-components to 1 so that bin/mold is extracted into /usr/local, resulting in /usr/local/bin/mold.

Alternatively, to avoid requiring sudo and to prevent race conditions when multiple parallel jobs run on the same self-hosted runner concurrently, you could extract it to a local directory and append it to $GITHUB_PATH:

BIN_DIR="$HOME/.local/bin"
mkdir -p "$BIN_DIR"
tar -xzf "/tmp/${TARBALL}" -C "$BIN_DIR" --strip-components=2 "mold-${MOLD_VERSION}-${TARBALL_ARCH}-linux/bin/mold"
echo "$BIN_DIR" >> $GITHUB_PATH
        sudo tar -xzf "/tmp/${TARBALL}" -C /usr/local --strip-components=1 "mold-${MOLD_VERSION}-${TARBALL_ARCH}-linux/bin/mold"

@claude

claude Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review: CI parallelization + sccache + mold

Nicely scoped and well-documented PR. I verified the central claim behind removing the ALTER SYSTEM max_connections step: .config/nextest.toml does cap the e2e-db group at max-threads = 16, the override filter binary(/^e2e/) does match the e2e_all binary, and create_test_pool builds a single 4-connection pool (db_setup.rs:134, max_size: 4). So 16 × 4 = 64 < 100 holds and dropping the restart step is safe. 👍

A few issues worth addressing before merge:

⚠️ 1. Mold install has a race across parallel jobs (medium)

setup-rust-ci is used by 5 jobs that run concurrently on the same self-hosted host (you confirm they share gpu11). On a cold runner (mold not yet installed, e.g. first run after merge or after a reimage), the command -v mold guard fails for all jobs simultaneously, so all of them:

  • curl ... -o /tmp/${TARBALL} to the same path, and
  • sudo tar -xzf ... -C /usr/local writing the same /usr/local/bin/mold

concurrently. Interleaved writes to the shared tarball / target binary can yield a truncated tarball (tar error) or a corrupt mold, breaking the linker for every job. You replaced the dpkg-lock contention with an unsynchronized write race.

Make the install atomic, e.g. download to a unique temp file and flock + atomic mv into place:

TMP=$(mktemp /tmp/mold.XXXXXX.tar.gz)
curl -fsSL "$URL" -o "$TMP"
exec 9>/tmp/.mold-install.lock
flock 9
if ! command -v mold >/dev/null 2>&1; then
  sudo tar -xzf "$TMP" -C /usr/local --strip-components=2 ".../bin/mold"
fi
rm -f "$TMP"

⚠️ 2. integration-test references environment-scoped secrets without the environment (minor)

The job sets VLLM_BASE_URL / VLLM_API_KEY from secrets but, unlike e2e-test, has no environment: Cloud API test env. If those are environment-scoped secrets they resolve to empty strings here. Harmless today (MockProvider is used since USE_REAL_VLLM is unset), but it's dead/misleading config that would silently fail the day someone flips on real vLLM. Either drop the two env: lines or add the environment: binding.

🔎 3. Five separate rust-cache keys may thrash GitHub's 10 GB cache (minor)

swatinem/rust-cache still uses the Actions cache backend; with distinct keys (lint/unit/integration/e2e/release) each saving a multi-GB target/, you can blow past the 10 GB per-repo limit and trigger LRU eviction → cache misses. sccache (persistent on-disk) covers object-level recompiles so the net is probably fine, but consider a shared-key for the test jobs that build the same profile to reduce duplication.

None of these block correctness of the test results; #1 is the one most likely to cause an intermittent CI failure.

⚠️ (issues found)

The previous --strip-components=2 with -C /usr/local extracted mold to
/usr/local/mold instead of /usr/local/bin/mold, so 'command -v mold' failed
and cargo could not find the linker.
@Evrard-Nil Evrard-Nil had a problem deploying to Cloud API test env June 19, 2026 11:06 — with GitHub Actions Failure
Setting CARGO_TARGET_*_LINKER=mold made rustc invoke mold directly as a
linker driver, but mold is a linker, not a driver, and rejects gcc driver
flags like -m64 ('unknown -m argument: 64').

Switch to RUSTFLAGS='-C debuginfo=0 -C link-arg=-fuse-ld=mold' so the
default cc driver invokes mold via -fuse-ld. This replaces the
target.cfg.rustflags from .cargo/config.toml (cargo treats RUSTFLAGS and
target.cfg.rustflags as mutually exclusive), which is fine for CI: the
dropped flags are reproducibility-only. debuginfo=0 is kept for faster
test compiles. The Docker reproducible build doesn't set RUSTFLAGS, so it
keeps the full config.toml rustflags with ld.
@Evrard-Nil Evrard-Nil had a problem deploying to Cloud API test env June 19, 2026 11:13 — with GitHub Actions Failure
gcc's -fuse-ld=mold looks for 'ld.mold' on PATH, not 'mold'. Without it,
collect2 falls back to searching for 'ld' and fails ('cannot find ld')
because rustc's -B path only contains an lld symlink named 'ld'. Installing
both mold and ld.mold from the release tarball resolves this.
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env June 19, 2026 14:04 — with GitHub Actions Inactive
@Evrard-Nil Evrard-Nil requested a review from PierreLeGuen June 19, 2026 14:31
@Evrard-Nil

Copy link
Copy Markdown
Collaborator Author

/ocr

@github-actions

Copy link
Copy Markdown

OpenCodeReview: No comments generated. Looks good to me.

@PierreLeGuen PierreLeGuen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI-only change that splits the test workflow into parallel jobs and adds sccache + mold via a setup-rust-ci composite action. All five jobs (lint, unit, integration, e2e, build-release) pass on the head; prior review feedback (mold extraction path) was addressed in c9f94d7. Looks good to merge.

Optional, non-blocking follow-ups:

  • .github/actions/setup-rust-ci/action.yml: on a cold runner, all parallel jobs hit the command -v mold guard simultaneously and download/extract to the same /tmp path and /usr/local/bin, which could intermittently corrupt the tarball or race the extraction. No production impact, but may cause spurious CI failures on first run after a runner reimage.
  • .github/workflows/test.yml (integration-test job): VLLM_BASE_URL/VLLM_API_KEY are passed without environment: Cloud API test env, so if those secrets are environment-scoped they resolve to empty. Harmless today since integration_tests.rs:21 defaults to MockProvider unless USE_REAL_VLLM is set — only matters if you later run real-vLLM tests here.

Checks: YAML parse of workflow + action files; gh pr checks 831 (all green); verified test targets e2e_all (crates/api/Cargo.toml:82) and integration_tests exist; confirmed -C debuginfo=0 already in .cargo/config.toml; confirmed nextest.toml e2e-db cap of 16 threads and that its filter excludes integration tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants