Skip to content

Add Windows E2E tests to CI pipeline#107

Open
jancurn wants to merge 9 commits intomainfrom
claude/add-windows-e2e-tests-I8s2h
Open

Add Windows E2E tests to CI pipeline#107
jancurn wants to merge 9 commits intomainfrom
claude/add-windows-e2e-tests-I8s2h

Conversation

@jancurn
Copy link
Member

@jancurn jancurn commented Mar 24, 2026

Summary

  • Add a windows-latest E2E test job to the CI workflow, running tests via Git Bash
  • Make the e2e test framework (framework.sh, run.sh) cross-platform by replacing Unix-specific tools/patterns with portable alternatives
  • Key cross-platform fixes: pkilltaskkill, atomic symlink locking → atomic mkdir, bc → integer arithmetic, /tmp$TEMP/$TMPDIR, export -f + xargs → background jobs on Windows

Test plan

  • Existing Linux E2E tests (Node.js and Bun) still pass
  • New Windows E2E job runs successfully on windows-latest
  • Unit tests unaffected (430 tests pass)
  • Lint passes

https://claude.ai/code/session_01QA2R1dDEWRrdGUAbX9Qz9B

claude added 6 commits March 25, 2026 20:57
Make the e2e test framework cross-platform so it runs on Windows via
Git Bash. Key changes:
- Add cross-platform helpers (is_windows, _kill_tree, _find_python, _md5_short, _TMPDIR)
- Replace pkill with taskkill on Windows for process cleanup
- Replace atomic symlink locking with atomic mkdir (works on all platforms)
- Replace bc-based floating-point math with integer arithmetic in wait_for()
- Use $TEMP/$TMPDIR instead of hardcoded /tmp paths
- Use background jobs instead of export -f + xargs on Windows
- Skip symlink creation on Windows (requires elevated privileges)
- Add e2e-windows job to CI workflow (windows-latest, parallel 1)

https://claude.ai/code/session_01QA2R1dDEWRrdGUAbX9Qz9B
Mirrors the Linux CI structure where Node.js and Bun e2e tests run as
separate parallel jobs, improving overall CI throughput on Windows.

https://claude.ai/code/session_01QA2R1dDEWRrdGUAbX9Qz9B
Root cause: _kill_tree used taskkill with Git Bash PIDs, but taskkill
only understands Windows PIDs. The process survived, and the subsequent
wait() blocked forever.

Fixes:
- _kill_tree: use bash kill -9 (understands MSYS PIDs) on Windows,
  with taskkill as best-effort fallback for native child processes
- _wait_killed: skip blocking wait on Windows, use brief sleep instead
- Per-test 5-minute timeout in run.sh Windows path to prevent hangs
- Fix hardcoded /tmp paths in env-vars and expired tests for Windows

https://claude.ai/code/session_01QA2R1dDEWRrdGUAbX9Qz9B
… shutdown

Four root causes addressed:

1. MSYS path conversion: Config file paths in composite CLI args
   (path:entry) weren't converted by MSYS2. Added to_native_path()
   helper using cygpath and NATIVE_TEST_TMP for tool-call args.

2. Process detection: process.kill(pid, 0) unreliable on Windows/Bun.
   Added tasklist-based fallback in isProcessAlive() and matching
   tasklist checks in test assertions.

3. Bridge graceful shutdown: SIGTERM = immediate TerminateProcess on
   Windows, preventing HTTP DELETE on close. stopBridge() now sends
   IPC shutdown message on Windows, giving the bridge time to clean up.

4. JSON escaping: Windows named pipe paths had invalid backslash
   escapes in manually-crafted sessions.json test fixtures.

https://claude.ai/code/session_01QA2R1dDEWRrdGUAbX9Qz9B
1. cygpath flag collision: to_native_path() was called on all args
   including flags like -y, causing "cygpath: unknown option -- y".
   Now only converts args that look like file paths (start with /, ~, .).

2. Session expiry after bridge kill: On Linux, SIGTERM triggers
   graceful shutdown (HTTP DELETE removes server session). On Windows,
   force-kill prevents cleanup so server still has the session active.
   Tests now call /control/expire-session after killing the bridge on
   Windows, then /control/reset before restart to allow reconnection.

https://claude.ai/code/session_01QA2R1dDEWRrdGUAbX9Qz9B
… tests

The Windows process check used `require('child_process')` which doesn't
exist in Node.js ESM modules (only works in Bun which supports require
in ESM). This caused every isProcessAlive() call to throw, be caught,
and return false — making all bridge processes appear dead.

Cascading effects: sessions always showed "crashed", PIDs were null
(bridges got unnecessarily restarted), close couldn't send HTTP DELETE
(thought bridge was dead), and proxy stayed alive after close.

Fix: import execFileSync at module top level (ESM-compatible) and use
execFileSync instead of execSync (also avoids shell injection).

https://claude.ai/code/session_01QA2R1dDEWRrdGUAbX9Qz9B
@jancurn jancurn force-pushed the claude/add-windows-e2e-tests-I8s2h branch from ddeceef to c83c6ef Compare March 25, 2026 20:59
claude added 3 commits March 25, 2026 21:14
…ist only for Bun

isProcessAlive() was using tasklist for ALL Windows runtimes, but
tasklist spawns a process taking ~1-2s per call. With dozens of calls
per test run (every status check, cleanup, etc.), this caused the
Node.js Windows E2E suite to hang past the CI timeout.

The original Bun false-positive issue with process.kill(pid, 0) only
affects Bun, not Node.js. Now:
- Node.js on Windows: uses fast process.kill(pid, 0)
- Bun on Windows: uses tasklist (slow but needed for correctness)
- All platforms on Unix: uses process.kill(pid, 0)

Also increased waitForProcessExit polling interval from 100ms to 500ms
to reduce tasklist invocations when running under Bun.

https://claude.ai/code/session_01QA2R1dDEWRrdGUAbX9Qz9B
…tart

Three optimizations to reduce Windows E2E time from ~15min to closer
to Linux times:

1. Cache tasklist results for 2 seconds — one tasklist call fetches ALL
   PIDs, subsequent isProcessAlive checks are instant Set lookups.
   Previously each check spawned tasklist.exe (~1-2s × hundreds of calls).

2. Only use IPC graceful shutdown in closeSession (which needs HTTP
   DELETE), not in restartBridge/reconnect/clean. Saves 2-5s per
   non-graceful stop on Windows.

3. Reduce sendBridgeShutdown connect timeout from 5s to 2s, and
   waitForProcessExit timeout from 3s to 2s.

Also consolidate CHANGELOG entries into a single record.

https://claude.ai/code/session_01QA2R1dDEWRrdGUAbX9Qz9B
process.kill(pid, 0) on Windows uses OpenProcess() which can return
success for zombie processes whose handles haven't been released.
This is a Windows OS behavior, not runtime-specific. Use tasklist
for all Windows runtimes with the 2s cache for performance.

https://claude.ai/code/session_01QA2R1dDEWRrdGUAbX9Qz9B
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants