Context
Tests workflow had been broken for 21 months on main (last green run before the action-version cliff in 2024-07). Fixing it required a sequence of root-cause fixes (action bumps, puppeteer no-sandbox, browserstack device drift, webpack/Docusaurus schema mismatch). This issue tracks the larger structural improvements that are out of scope for the unblocking PR but worth doing once CI is green.
High-impact improvements
1. Test sharding for browser integration tests
Integration / Browserstack, Integration / Clientside, Integration / Memory Leaks, and Models / Browser each take 15–30+ minutes. Splitting them across strategy.matrix.shard would cut wall-clock time roughly in half. Vitest supports --shard=1/N natively.
2. Pull setup-pnpm/action.yml upstream pins out
The composite action hard-pins:
pnpm@8 (now multiple majors behind)
- Globally installs
node-gyp and @mapbox/node-pre-gyp (unclear if still needed; likely a relic of tfjs-node install pain)
Audit whether either is still load-bearing; bumping pnpm to a current LTS major and removing the global installs would simplify the path significantly.
3. Replace browserstack manual device list with capability matrix
internals/webdriver/browserStackOptions.json is a hand-maintained JSON list. It silently drifts as browserstack rotates supported devices/OS versions (the most recent break: Pixel 5 v12.0 and iPhone 12 Pro Max v16). Switching to "latest" + per-OS-family + a small sample (e.g., os_version: "latest", browserName: "chrome") avoids the per-device rot.
4. Decouple docs build from test workflow
build-docs lives in tests.yml but is unrelated to test correctness; it's a smoke test of the Docusaurus build. It also pulls webpack into the resolution graph, which caused the ProgressPlugin schema break we just patched (Docusaurus 2.4.3 vs webpack 5.97+ schema validation). Either:
The webpack@5.96.1 override is a temporary unblocker.
5. Replace mxschmitt/action-tmate debug stubs with reusable workflow input
Multiple jobs have commented-out tmate steps. A shared workflow_dispatch input (debug_with_tmate: bool) gated on if: inputs.debug_with_tmate removes the toggle-then-commit pattern.
6. Concurrency at job level, not just workflow
Workflow-level cancel-in-progress (added in #1302) helps for new pushes, but the long-running browserstack tunnel jobs sometimes leak tunnelmole processes when canceled mid-test. A cleanup if: always() step that explicitly tears down tunnels would prevent occasional zombie tunnel charges.
7. CodeCov upload only on PR runs to main
upload-to-codecov runs on every push (every branch). For a one-maintainer repo this is mostly noise. Gate on if: github.ref == 'refs/heads/main' || github.event_name == 'pull_request'.
8. Stop running pnpm install twice in build-docs
Setup composite already installs; the explicit Install dependencies step that follows is redundant.
9. Memory leak test runs as a single sequential file
The memory-leak suite is structured as one test file with a shared puppeteer page; a single browser crash takes the whole suite down. Splitting into per-test isolated browsers would make CI failures attributable to specific tests instead of "browser is undefined" blanket failures.
10. tfjs-node-gpu as a devDep is a footgun
Both @tensorflow/tfjs-node and @tensorflow/tfjs-node-gpu are listed as root devDeps (and as peer deps). On CI runners without CUDA the GPU build either silently no-ops or attempts to fetch ~1GB of binaries. Confirm whether GPU is actually used in tests; if not, remove from devDeps and gate the peer dep behind an optional install path.
Already addressed (PR #1302)
- Action version bumps (checkout, setup-node, setup-pnpm, upload-artifact, cache, upload-pages-artifact, deploy-pages, codecov-action)
- Workflow-level
concurrency.cancel-in-progress
timeout-minutes per job
- pnpm store cache enabled
PUPPETEER_SKIP_DOWNLOAD=true on no-browser jobs
- Puppeteer
--no-sandbox for Ubuntu 24.04 runners
- Browserstack device list refresh
- Webpack pnpm override for Docusaurus 2.4.3 ProgressPlugin schema
CI Gate (PR #1303)
Aggregator job (clankerbot/pr-monitor@v1) so branch protection has a single required check instead of N per job.
Context
Tests workflow had been broken for 21 months on
main(last green run before the action-version cliff in 2024-07). Fixing it required a sequence of root-cause fixes (action bumps, puppeteer no-sandbox, browserstack device drift, webpack/Docusaurus schema mismatch). This issue tracks the larger structural improvements that are out of scope for the unblocking PR but worth doing once CI is green.High-impact improvements
1. Test sharding for browser integration tests
Integration / Browserstack,Integration / Clientside,Integration / Memory Leaks, andModels / Browsereach take 15–30+ minutes. Splitting them acrossstrategy.matrix.shardwould cut wall-clock time roughly in half. Vitest supports--shard=1/Nnatively.2. Pull
setup-pnpm/action.ymlupstream pins outThe composite action hard-pins:
pnpm@8(now multiple majors behind)node-gypand@mapbox/node-pre-gyp(unclear if still needed; likely a relic oftfjs-nodeinstall pain)Audit whether either is still load-bearing; bumping pnpm to a current LTS major and removing the global installs would simplify the path significantly.
3. Replace browserstack manual device list with capability matrix
internals/webdriver/browserStackOptions.jsonis a hand-maintained JSON list. It silently drifts as browserstack rotates supported devices/OS versions (the most recent break: Pixel 5 v12.0 and iPhone 12 Pro Max v16). Switching to "latest" + per-OS-family + a small sample (e.g.,os_version: "latest", browserName: "chrome") avoids the per-device rot.4. Decouple docs build from test workflow
build-docslives intests.ymlbut is unrelated to test correctness; it's a smoke test of the Docusaurus build. It also pullswebpackinto the resolution graph, which caused the ProgressPlugin schema break we just patched (Docusaurus 2.4.3 vs webpack 5.97+ schema validation). Either:paths:filter, orwebpackpnpm override added in Bump deprecated action versions and fix CI #1302.The
webpack@5.96.1override is a temporary unblocker.5. Replace
mxschmitt/action-tmatedebug stubs with reusable workflow inputMultiple jobs have commented-out tmate steps. A shared
workflow_dispatchinput (debug_with_tmate: bool) gated onif: inputs.debug_with_tmateremoves the toggle-then-commit pattern.6. Concurrency at job level, not just workflow
Workflow-level
cancel-in-progress(added in #1302) helps for new pushes, but the long-running browserstack tunnel jobs sometimes leaktunnelmoleprocesses when canceled mid-test. A cleanupif: always()step that explicitly tears down tunnels would prevent occasional zombie tunnel charges.7. CodeCov upload only on PR runs to
mainupload-to-codecovruns on every push (every branch). For a one-maintainer repo this is mostly noise. Gate onif: github.ref == 'refs/heads/main' || github.event_name == 'pull_request'.8. Stop running
pnpm installtwice inbuild-docsSetup composite already installs; the explicit
Install dependenciesstep that follows is redundant.9. Memory leak test runs as a single sequential file
The memory-leak suite is structured as one test file with a shared puppeteer page; a single browser crash takes the whole suite down. Splitting into per-test isolated browsers would make CI failures attributable to specific tests instead of "browser is undefined" blanket failures.
10.
tfjs-node-gpuas a devDep is a footgunBoth
@tensorflow/tfjs-nodeand@tensorflow/tfjs-node-gpuare listed as root devDeps (and as peer deps). On CI runners without CUDA the GPU build either silently no-ops or attempts to fetch ~1GB of binaries. Confirm whether GPU is actually used in tests; if not, remove from devDeps and gate the peer dep behind an optional install path.Already addressed (PR #1302)
concurrency.cancel-in-progresstimeout-minutesper jobPUPPETEER_SKIP_DOWNLOAD=trueon no-browser jobs--no-sandboxfor Ubuntu 24.04 runnersCI Gate (PR #1303)
Aggregator job (
clankerbot/pr-monitor@v1) so branch protection has a single required check instead of N per job.