fix: eliminate flaky 404s in link/QR live tests and fix qrCodeBytes type#143
Merged
Conversation
The link integration tests intermittently failed with 'Fetch failed with status 404'. Audit findings and fixes: - The stats test listed the organization's short codes (newest first) and picked a random one. Concurrent CI runs (Node version matrix plus the coverage workflow) share the organization and delete their temporary codes within seconds, so the picked code often vanished before getCodeStats resolved. The test now creates and uses its own code. - Operations that run immediately after a create can transiently 404 while the API is eventually consistent. All live calls now go through a retry helper with backoff (replacing the one-off 500ms sleep), and list assertions poll until the created items are visible. - Cleanup deletes were asserted, so a transient cleanup failure failed tests whose subject had already passed. Cleanup is now best-effort with a warning; dedicated delete tests assert the delete paths explicitly. Also fixes qrCodeBytes to be Uint8Array: new Uint16Array(buffer) widened each byte to 16 bits, which corrupts the QR image when the bytes are written out. https://claude.ai/code/session_01Jw4fk5in8dEyFKCs3pLY9g
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #143 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 8 8
Lines 492 492
Branches 105 101 -4
=========================================
Hits 492 492 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pnpm testintermittently fails withError: Fetch failed with status 404in the link/QR integration tests. This PR audits the QR code area, fixes the root causes of the flakes, and fixes a real QR bug the audit turned up.Audit findings
1. The stats test raced concurrent CI runs (primary 404 source)
should get code stats for a short codelisted the org's short codes and calledgetCodeStatson a random one. Three things make that race likely, not rare:sdk-testcodes (leaked by past failed runs), and the newest entries are the temporary codes of whatever test runs are in flight right now.tests.yamlNode 22/24/26 matrix +code-coverage.yaml), each deleting its temp codes within seconds.So the random pick frequently landed on another run's code that was deleted before
getCodeStatsresolved → 404. Verified live:GET .../codes/{id}/statson a deleted code returns 404.Fix: the test now creates its own short code, gets stats for it, and cleans it up. (Verified live that a fresh, zero-click code returns 200 with all stats fields populated.)
2. Eventual consistency after writes
Reads, QR operations, and deletes that run immediately after
createShortCodecan transiently 404 before the write is visible — the suite already worked around this in one test with a 500 ms sleep and a comment. All live calls now go through a smallretryhelper with backoff, and list assertions poll until the created items are visible instead of asserting on the first response.3. Cleanup failures failed unrelated tests
Most tests asserted their cleanup
deleteShortCode(...)returnedtrue, so a transient cleanup 404 failed a test whose subject behavior had already passed. Cleanup is now best-effort (retry + warn — the pattern the suite already used in one place), and explicit delete coverage lives in dedicated tests (should create and delete a short code,should delete a QR code by ID).4.
qrCodeByteshad the wrong typed array (QR bug)createQrCode,getQrCode, andgetQrCodesdecoded the base64 PNG withnew Uint16Array(buffer), which widens every byte to a 16-bit element — writing those bytes out produces a corrupted, double-size image. Changed toUint8Array(type + 3 call sites).CreateQrCodeResponse.qrCodeBytestype; it's a bugfix, but worth flagging in release notes.Should we run the tests in parallel?
No — parallelism is the cause here, not the cure. Within one run, vitest already runs test files in parallel and tests within a file sequentially, which is fine because only
link.test.tstouches the link API. The conflicts come from multiple suites sharing one organization (CI matrix + coverage workflow + local runs). With every test now self-contained (operating only on codes it created) plus retries for eventual consistency, the suite is safe under that existing parallelism.Verification
pnpm test: 235 tests pass, lint clean, statements and branches back at 100%.link.test.tssuites run concurrently (simulating the CI matrix): all pass.pnpm buildandtsc --noEmitpass.Follow-up recommendations (not in this PR)
sdk-testcodes from past failed runs. Consider a scheduled sweeper forsdk-test-tagged codes older than an hour, or a dedicated org for CI.@cacheable/net's genericFetch failed with status N, so the SDK's contextualFailed to create QR code: ...branches are unreachable (they'rev8 ignored). Wrapping network calls to add operation context would make future failures much easier to attribute.https://claude.ai/code/session_01Jw4fk5in8dEyFKCs3pLY9g
Generated by Claude Code