Skip to content

CI: structural improvements (post-PR-1302 follow-ups) #1304

@thekevinbot

Description

@thekevinbot

Context

Tests workflow had been broken for 21 months on main (last green run before the action-version cliff in 2024-07). Fixing it required a sequence of root-cause fixes (action bumps, puppeteer no-sandbox, browserstack device drift, webpack/Docusaurus schema mismatch). This issue tracks the larger structural improvements that are out of scope for the unblocking PR but worth doing once CI is green.

High-impact improvements

1. Test sharding for browser integration tests

Integration / Browserstack, Integration / Clientside, Integration / Memory Leaks, and Models / Browser each take 15–30+ minutes. Splitting them across strategy.matrix.shard would cut wall-clock time roughly in half. Vitest supports --shard=1/N natively.

2. Pull setup-pnpm/action.yml upstream pins out

The composite action hard-pins:

  • pnpm@8 (now multiple majors behind)
  • Globally installs node-gyp and @mapbox/node-pre-gyp (unclear if still needed; likely a relic of tfjs-node install pain)

Audit whether either is still load-bearing; bumping pnpm to a current LTS major and removing the global installs would simplify the path significantly.

3. Replace browserstack manual device list with capability matrix

internals/webdriver/browserStackOptions.json is a hand-maintained JSON list. It silently drifts as browserstack rotates supported devices/OS versions (the most recent break: Pixel 5 v12.0 and iPhone 12 Pro Max v16). Switching to "latest" + per-OS-family + a small sample (e.g., os_version: "latest", browserName: "chrome") avoids the per-device rot.

4. Decouple docs build from test workflow

build-docs lives in tests.yml but is unrelated to test correctness; it's a smoke test of the Docusaurus build. It also pulls webpack into the resolution graph, which caused the ProgressPlugin schema break we just patched (Docusaurus 2.4.3 vs webpack 5.97+ schema validation). Either:

The webpack@5.96.1 override is a temporary unblocker.

5. Replace mxschmitt/action-tmate debug stubs with reusable workflow input

Multiple jobs have commented-out tmate steps. A shared workflow_dispatch input (debug_with_tmate: bool) gated on if: inputs.debug_with_tmate removes the toggle-then-commit pattern.

6. Concurrency at job level, not just workflow

Workflow-level cancel-in-progress (added in #1302) helps for new pushes, but the long-running browserstack tunnel jobs sometimes leak tunnelmole processes when canceled mid-test. A cleanup if: always() step that explicitly tears down tunnels would prevent occasional zombie tunnel charges.

7. CodeCov upload only on PR runs to main

upload-to-codecov runs on every push (every branch). For a one-maintainer repo this is mostly noise. Gate on if: github.ref == 'refs/heads/main' || github.event_name == 'pull_request'.

8. Stop running pnpm install twice in build-docs

Setup composite already installs; the explicit Install dependencies step that follows is redundant.

9. Memory leak test runs as a single sequential file

The memory-leak suite is structured as one test file with a shared puppeteer page; a single browser crash takes the whole suite down. Splitting into per-test isolated browsers would make CI failures attributable to specific tests instead of "browser is undefined" blanket failures.

10. tfjs-node-gpu as a devDep is a footgun

Both @tensorflow/tfjs-node and @tensorflow/tfjs-node-gpu are listed as root devDeps (and as peer deps). On CI runners without CUDA the GPU build either silently no-ops or attempts to fetch ~1GB of binaries. Confirm whether GPU is actually used in tests; if not, remove from devDeps and gate the peer dep behind an optional install path.

Already addressed (PR #1302)

  • Action version bumps (checkout, setup-node, setup-pnpm, upload-artifact, cache, upload-pages-artifact, deploy-pages, codecov-action)
  • Workflow-level concurrency.cancel-in-progress
  • timeout-minutes per job
  • pnpm store cache enabled
  • PUPPETEER_SKIP_DOWNLOAD=true on no-browser jobs
  • Puppeteer --no-sandbox for Ubuntu 24.04 runners
  • Browserstack device list refresh
  • Webpack pnpm override for Docusaurus 2.4.3 ProgressPlugin schema

CI Gate (PR #1303)

Aggregator job (clankerbot/pr-monitor@v1) so branch protection has a single required check instead of N per job.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions