Skip to content

fix(App): eliminate clientMsgDeviceId=0 race via sessionStorage cache (closes #256, depends on #255)#274

Open
MrAladdin wants to merge 23 commits into
Mininglamp-OSS:mainfrom
MrAladdin:fix/issue-256-clientmsgdeviceid-race
Open

fix(App): eliminate clientMsgDeviceId=0 race via sessionStorage cache (closes #256, depends on #255)#274
MrAladdin wants to merge 23 commits into
Mininglamp-OSS:mainfrom
MrAladdin:fix/issue-256-clientmsgdeviceid-race

Conversation

@MrAladdin

Copy link
Copy Markdown

Summary

Eliminates the clientMsgDeviceId = 0 race window in App.startMain, where outbound chatManager.send() messages get clientMsgNo = <uuid>_0_3 instead of <uuid>_<real-id>_3 during the ~50-300ms window between connectIM() (synchronous WebSocket open) and /user/devices/{deviceId} (async numeric id fetch).

Approach — Option D+ hybrid (cache + first-login await):

  • Persist the numeric device id in sessionStorage after the first successful /user/devices response
  • On subsequent sessions (refresh / cache-hit): set WKSDK.config.clientMsgDeviceId from cache before connectIM()zero latency, zero race
  • On first-login (cache-miss): await the device fetch with a 5s timeout safety net before connectIM() — ~200ms WS connect delay only for the first session, zero race
  • Cache cleared on logout via clearLocalLoginState so user-switch on the same tab always re-traverses the first-login path correctly

No server changes. No SDK changes.

⚠️ Depends on #255 (PR-stacking note)

This branch was developed on top of #255 (Plan F stale-resource interceptor). Of the 26 commits in this PR's diff, 22 belong to #255 and 4 belong to #256.

The 4 new commits:

4411d649  fix(App): short-circuit cache-miss path if Plan F triggers logout (#256)
0fdaa32b  fix(App): clear clientMsgDeviceId cache on logout (#256)
1f300888  fix(App): dual-track startMain to eliminate clientMsgDeviceId=0 race (#256)
8f27af44  feat(Service): add clientMsgDeviceIdCache helper for issue #256

Recommended merge order: merge #255 first; this PR will then auto-rebase to a clean 4-commit diff. Until then, review either by filtering to those 4 SHAs or by waiting for #255 to land.

Plan F integration

#256 uses Plan F's existing staleLocalResourceCallback interceptor for stale-device error recovery (unchanged). The new await path in cache-miss explicitly short-circuits via this.loggingOut guard after the await — without it, a stale-device-triggered logout would let connectIM/contactsSync/prohibitWordsSync fire against a just-cleared token, causing 401 chains and Uncaught (in promise) console noise. This was a real regression caught by end-to-end browser testing during development; the guard prevents it.

Files changed (just the 4 #256 commits)

packages/dmworkbase/src/App.tsx                                          | +106 / -12
packages/dmworkbase/src/Service/__tests__/clientMsgDeviceIdCache.test.ts | +81 (NEW)
packages/dmworkbase/src/Service/clientMsgDeviceIdCache.ts                | +55 (NEW)

Race elimination truth table

Scenario Before After D+ hybrid
First-login (no cache) ~200ms race window ~200ms WS connect delay, no race
Refresh logged-in (cache hit) ~200ms race window zero delay, no race
Logout + relogin same tab (cache cleared) ~200ms race window Same as first-login
New tab (sessionStorage fresh) ~200ms race window Same as first-login

Test plan

  • Unit tests: cd packages/dmworkbase && pnpm test → 46 tests pass (35 Plan F baseline + 11 new clientMsgDeviceIdCache.test.ts)
  • Lint clean on touched files
  • Manual E2E (verified locally against local octo-server during development):
    • Fresh user first refresh → WebSocket connects AFTER /user/devices returns (await path proven; observed t=550ms after t=537ms fetch end)
    • Subsequent refresh → WebSocket connects BEFORE /user/devices settles (cache-hit fast path proven; observed t=550.6ms vs t=551.0ms)
    • Logout → sessionStorage.clientMsgDeviceIdNumeric cleared (verified directly)
    • Plan F stale device path: delete server-side device → refresh → user lands on login page; no Uncaught (in promise) in console (Task E guard verified working — pre-Task-E had 2× unhandled rejections; post-Task-E has 0)

Risks & mitigations

Risk Mitigation
First-login fetch hangs → WS never connects 5s Promise.race timeout falls back to current pre-#256 behavior
Concurrent stale/auth callbacks racing Plan F's loggingOut guard (already in place from #255) + new outer post-await guard
Cache stale after server-side device replacement Background refresh updates cache; Plan F handles stale-not-found
sessionStorage write throw (Safari private mode) Helper try/catch swallows; cache becomes best-effort
Cache-miss await → Plan F logout → post-await side effects (401 noise) Task E short-circuit if (this.loggingOut) return; after await

Roll-back plan

Each task is its own commit; revert specific layers if needed:

Closes #256

🤖 Generated with Claude Code

MrAladdin and others added 23 commits June 5, 2026 12:26
Pure function handleDeviceFetchError() detects 400 +
err.server.user.device_not_found and invokes injected
recovery handlers. No App/WKApp imports — fully unit-testable.

Wiring into App.startMain follows in a later commit.

Refs Mininglamp-OSS#76
…ininglamp-OSS#76)

Regression guard: auth-expired errors must remain on the
APIClient's existing auth path, not our stale-device path.

Refs Mininglamp-OSS#76
Mininglamp-OSS#76)

When the request fails before reaching the server (timeout,
DNS, offline), status is undefined and we must not logout.

Refs Mininglamp-OSS#76
… handler (Mininglamp-OSS#76)

Only err.server.user.device_not_found should trigger the
recovery flow. Other validation errors at the same HTTP
status are unrelated and must pass through.

Refs Mininglamp-OSS#76
…lamp-OSS#76)

Wires handleDeviceFetchError into the device fetch in
App.startMain. On 400 err.server.user.device_not_found:
clear sessionStorage deviceId, reset in-memory deviceId,
trigger logout. logout() ends in window.location.reload(),
which re-runs App.startup → getDeviceIdFromStorage() generates
a fresh UUID; the user's next login re-registers the device
record server-side.

Without this handler the device fetch silently fails, leaving
WKSDK.config.clientMsgDeviceId at its default 0. Outbound
messages then build clientMsgNo as "<uuid>_0_3" instead of
"<uuid>_<real-device-id>_3", breaking per-device message
dedup and cross-device tracking — a degrading symptom, not a
visible UI freeze.

Trade-off acknowledged: this force-logs the user out, which
violates the issue reporter's "no manual intervention"
preference. A silent re-register path would require a new
server endpoint (POST /v1/user/devices/register or allowing
re-login with existing token) — out of scope for this PR.
See PR description for forward suggestion.

Local logout() is intentionally used over logoutUserInitiated()
to avoid triggering OIDC end-session for a system-detected
condition (see deviceFailure.ts module doc).

Fixes Mininglamp-OSS#76
…nglamp-OSS#76)

Browser empirical test revealed err.status === 400 never matches
in production. APIClient.normalizeApiError exposes err.status from
the response body's err.http_status field (=404), not from the
actual HTTP status line (400). The two diverge for this endpoint
because the server's response body is internally inconsistent
(HTTP 400 Bad Request but body.error.http_status=404).

Switch the gate to err.code-only matching. The code string
'err.server.user.device_not_found' is the stable backend contract
and unambiguously identifies the stale-device condition.

Tests updated:
- Test 1: status value changed 400 → 404 (matches production reality)
- New test 5: explicit defensive case that asserts the gate fires
  on the device_not_found code regardless of status field value

Refs Mininglamp-OSS#76
Address final code review feedback: replace `normalized: undefined as any`
with a real (minimal) NormalizedApiError stub `{ message: '', raw: undefined }`,
and drop the unnecessary `status: undefined as any` baseline (status is already
optional in APIClientRejectedError; omitting from baseline is equivalent).

Removes the outer `as APIClientRejectedError` cast — the returned object is
now structurally complete enough to satisfy the type without coercion.

If handleDeviceFetchError ever begins to read `normalized.*`, the stub will
need real values rather than failing silently against `undefined`.

Refs Mininglamp-OSS#76
…Mininglamp-OSS#76)

Address review feedback (PR Mininglamp-OSS#255): silently ignoring non-stale device
fetch errors hides any future regression on /user/devices. Add a single
console.warn for any err.code that is NOT err.server.user.device_not_found.

The stale-device path still calls handleDeviceFetchError unchanged; this
warn is purely additive observability. Status/code/msg are logged in
structured form for easy filtering.

Refs Mininglamp-OSS#76
Address review feedback (PR Mininglamp-OSS#255): the order
removeDeviceId → resetInMemoryDeviceId → triggerLogout is
load-bearing — triggerLogout fires window.location.reload(),
and the post-reload App.startup → getDeviceIdFromStorage() only
regenerates a fresh UUID if sessionStorage.deviceId is empty.

Pin the order with an explicit array-tracking test so any future
refactor that accidentally reorders the calls fails the test.

Refs Mininglamp-OSS#76
First entry: err.server.user.device_not_found. Code-only matching because
server returns HTTP 400 with body.http_status=404 — neither status field
is the stable contract. Future siblings (e.g., space.not_found) can be
added to the set without interceptor changes.

Part of Plan F (octo-web#270 / follow-up to Mininglamp-OSS#76).
See docs/superpowers/specs/2026-06-05-plan-f-stale-resource-interceptor-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the existing logoutCallback/isAuthExpiredApiError pattern. Branch
is else-if (mutually exclusive with auth-expired) — a single error
response fires at most one recovery callback.

Part of Plan F (octo-web#270 / follow-up to Mininglamp-OSS#76).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ity, cousin code, concurrency, code propagation

Adds 5 cases to APIClient.staleResource.test.ts:
- auth-expired still routes via logoutCallback only
- login_device_expired (HTTP 401 cousin) routes via auth path, not stale
- concurrent stale responses fire callback per response (re-entry contract)
- unrelated 404 codes don't trigger stale callback
- code string propagates correctly to callback argument

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan F adds a new callback field on APIClient.shared. The shared singleton
means a test that sets a mock callback would leak it to subsequent tests
without an explicit reset. Adding this line proactively (rather than after
hitting cross-test pollution) is bundled fix Mininglamp-OSS#2 from the spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Registers the Plan F callback next to logoutCallback. Dispatches the
device_not_found code to the existing handleDeviceFetchError recovery
handler (unchanged). Future sibling codes plug in here by adding else-if
branches — no APIClient or interceptor changes needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… interceptor

Plan F: stale device recovery now fires in the APIClient response
interceptor via staleLocalResourceCallback. startMain's catch only
consumes the promise rejection to prevent Uncaught noise and logs
non-stale failures for observability. Recovery semantics are unchanged
(handleDeviceFetchError still invoked, just from module.tsx now).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Multiple concurrent API errors (auth-expired + stale-device, or N parallel
stale-device responses) could call logout() rapidly before the page
actually unloads, causing clearLocalLoginState() to run N times. Idempotent
in practice but wasteful. Guard is one-way per session: only a real page
reload resets the flag, which matches production semantics (the flag's
lifetime equals the JS context's lifetime).

Bundled fix Mininglamp-OSS#1 from Plan F spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan F Task 7's one-way loggingOut guard had a latent wedge: if
clearLocalLoginState threw (Safari private mode, QuotaExceededError,
SecurityError on raw localStorage/sessionStorage), the guard would be
set to true but the reload would never run, and every subsequent logout
attempt would short-circuit on the guard. The user would be permanently
stuck with no path back to login except a manual browser refresh.

Wrap in try/catch and let the reload happen regardless. After reload,
the next bootstrap will detect the broken/missing auth state and route
to login normally.

Surfaced by Plan F final code review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment mistakenly said try/finally; the code uses try/catch.
Surfaced by Plan F final review nit. One-word fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… convention

Addresses PR Mininglamp-OSS#255 blocking review (code-review bot): although Vitest 4.1.5
ships toHaveBeenCalledExactlyOnceWith as a built-in matcher, no other test
in this repo uses it. Substitute the equivalent two-line form to match
project convention. Semantics identical (count==1 AND called-with-arg).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-OSS#256

Three pure helpers (loadCachedNumericDeviceId, saveNumericDeviceId,
clearCachedNumericDeviceId) wrapping StorageService.shared. Parse-with-
validation rejects NaN/0/negative on load. Save silently no-ops on storage
write failures (Safari private mode, QuotaExceededError) so a write
failure degrades to "no cache" rather than breaking the caller.

Used by App.startMain in the next commit to eliminate the
clientMsgDeviceId=0 race window (issue Mininglamp-OSS#256).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ininglamp-OSS#256)

Cache hit path: load cached numeric id from sessionStorage, set
WKSDK.config.clientMsgDeviceId BEFORE connectIM(), then refresh in
background. Zero WS connect latency, zero race window for repeat users.

Cache miss path (first login / new tab / post-logout-relogin): await the
device fetch with a 5s timeout safety net, then connectIM(). ~200ms WS
connect delay only for first-login, but zero race. Timeout fallback
degrades to pre-Mininglamp-OSS#256 behavior (proceed with clientMsgDeviceId=0); the
underlying fetch keeps running in background and will update the cache
if it eventually resolves.

The fetchPromise has its own .catch to prevent late rejections from
becoming Uncaught (in promise) when timeout wins the race. Plan F
interceptor still handles stale device errors centrally.

Risk A guard: skip cache writes if `loggingOut` is set between dispatch
and resolve — prevents the background .then from repopulating a cache
that clearLocalLoginState just emptied.

startMain becomes async. Both callers (App.tsx inside addOnLogin and
inside isLogined()) are intentionally fire-and-forget; the inner .catch
handlers handle errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
clearLocalLoginState now removes the cached numeric device id. Without
this, a same-tab relogin or user-switch would inherit the previous
user's cached id, leading to wrong-id messages until the next device
fetch updates the cache. With this clear, the post-logout session
always re-traverses the first-login path → correct id for the new auth.

Also consumes the clearCachedNumericDeviceId import that was staged
in the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ninglamp-OSS#256)

Regression discovered during E2E: when the cache-miss path's await fetch
returns stale device, Plan F's APIClient interceptor fires logout
synchronously. After the await resolves, control returns to startMain
and continues with connectIM / contactsSync / prohibitWordsSync — but
the token/uid are already cleared, so all three fire requests that 401.
The 401 chains produce unhandled Uncaught (in promise) noise in the
console before the pending reload completes.

Fix: check this.loggingOut after the await. If logout was triggered
during the await, return early. The pending window.location.reload()
will reset the page; no further work needs to happen in this function.

The cache-hit branch already returns BEFORE any await, so this same
guard is unnecessary there — connectIM fires before any logout signal
could land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MrAladdin MrAladdin requested a review from a team as a code owner June 5, 2026 05:53
@github-actions github-actions Bot added the size/XL PR size: XL label Jun 5, 2026
Jerry-Xin
Jerry-Xin previously approved these changes Jun 5, 2026

@Jerry-Xin Jerry-Xin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is relevant to Mininglamp-OSS/octo-web and the implementation is generally sound: it addresses the startup clientMsgDeviceId=0 race with a cache/await split, adds interceptor-based stale-device recovery, and covers the new helper/interceptor behavior with focused tests.

💬 Non-blocking

🟡 Warning — startMain() is now async, but all call sites ignore its returned promise. Most expected failures are handled internally, but any unexpected synchronous throw in loadCachedNumericDeviceId(), connectIM(), contactsSync(), or ProhibitwordsService.shared.sync() will now surface as an unhandled promise rejection rather than a synchronous exception. Consider wrapping the top-level body or changing callers to void this.startMain().catch(...). Reference: packages/dmworkbase/src/App.tsx:854, packages/dmworkbase/src/App.tsx:862, packages/dmworkbase/src/App.tsx:883.

🔵 Suggestion — loadCachedNumericDeviceId() uses parseInt, so corrupted values like "42abc" or "1.5" are accepted as valid numeric IDs. The comment says the cache must be a positive finite integer; Number(raw) plus Number.isInteger() would match that contract more tightly. Reference: packages/dmworkbase/src/Service/clientMsgDeviceIdCache.ts:24.

🔵 Suggestion — clearCachedNumericDeviceId() does not catch storage removal errors, while saveNumericDeviceId() explicitly treats storage as best-effort. logout() catches clearLocalLoginState() failures, so this is not blocking, but making clear symmetric with save would keep the helper contract cleaner. Reference: packages/dmworkbase/src/Service/clientMsgDeviceIdCache.ts:53.

🔵 Suggestion — The APIClient tests mutate axios.defaults.adapter globally and do not restore it. Existing tests already follow this pattern in places, but this new file adds more global mutation. Adding afterEach restoration would reduce order-dependent failures. Reference: packages/dmworkbase/src/Service/__tests__/APIClient.staleResource.test.ts:20.

✅ Highlights

The cache-miss path correctly awaits the device lookup before connectIM() and short-circuits after stale-device logout, which closes the main race without adding a permanent startup delay. Reference: packages/dmworkbase/src/App.tsx:869.

The cache-hit path sets WKSDK.shared().config.clientMsgDeviceId before opening the websocket and refreshes the value in the background with rejection handling. Reference: packages/dmworkbase/src/App.tsx:857.

The interceptor keeps auth-expired and stale-resource handling mutually exclusive, and the added tests cover the important classifier boundaries. Reference: packages/dmworkbase/src/Service/APIClient.ts:96.

I could not run the targeted Vitest suite in this checkout because dependencies are not installed and pnpm exec vitest is unavailable.

lml2468
lml2468 previously approved these changes Jun 5, 2026

@lml2468 lml2468 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: octo-web #274 — fix(App): eliminate clientMsgDeviceId=0 race

Commit: 4411d64
Verdict: APPROVED

Blocking

  • None.

Non-blocking

  • loadCachedNumericDeviceId() uses parseInt, so malformed stored values with numeric prefixes like 42abc or 1.5 would be accepted as valid ids. Since this key is only written by the new helper, I do not consider this blocking, but Number(raw) plus Number.isInteger() would match the helper contract more tightly.
  • The new test coverage is helper-level. A focused startMain() ordering test would better lock the actual race invariant: cache hit sets WKSDK.config.clientMsgDeviceId before connectIM(), and cache miss awaits the device fetch before connectIM().

Summary

Reviewed only the stacked delta from c0ed0f218a4c to 4411d6498603: App.tsx, clientMsgDeviceIdCache.ts, and the new cache tests. The implementation preserves the intended ordering: cache hits synchronously set the SDK numeric device id before opening the WebSocket, cache misses await /user/devices/{deviceId} before connecting, and the 5s timeout degrades to the prior clientMsgDeviceId=0 behavior rather than wedging startup. Plan F stale-device recovery is also handled correctly: fetch rejections are consumed, loggingOut prevents post-logout connect/sync work on the cache-miss path, and clearLocalLoginState() clears the numeric-id cache.

Verification: pnpm exec vitest run packages/dmworkbase/src/Service/__tests__/clientMsgDeviceIdCache.test.ts passed locally (11 tests). A direct tsc -p packages/dmworkbase/tsconfig.json --noEmit is not a useful signal in this repo state because it fails broadly on pre-existing dependency/storybook/test-global typing issues unrelated to this delta.

REVIEW_STATE: APPROVED

yujiawei
yujiawei previously approved these changes Jun 5, 2026

@yujiawei yujiawei left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #274 (octo-web)

Verdict: APPROVED

This change cleanly closes the clientMsgDeviceId = 0 race for the common cache-hit path and meaningfully narrows it for first login, with no server or SDK changes. The new helper is small, well-guarded, and well-tested. No P0/P1 blocking issues were found, including a focused pass on the auth/token boundary (this PR was flagged security-sensitive). The notes below are non-blocking hardening suggestions.

What was reviewed

The 4 #256 commits only (the diff also contains 22 commits belonging to #255 / Plan F, reviewed separately):

  • packages/dmworkbase/src/Service/clientMsgDeviceIdCache.ts (new helper)
  • packages/dmworkbase/src/Service/__tests__/clientMsgDeviceIdCache.test.ts (new, 11 tests)
  • packages/dmworkbase/src/App.tsx (startMain dual-track, clearLocalLoginState cache clear)

Verification

  • Cache-hit fast path closes the race. loadCachedNumericDeviceId() returns null for "0", negatives, NaN and empty (clientMsgDeviceIdCache.ts:28), so a non-null cache value is guaranteed > 0. It is assigned to WKSDK.config.clientMsgDeviceId before connectIM() (App.tsx:861-862). WS never opens with 0 on this path. ✅
  • First-login path awaits before connecting with a 5s Promise.race timeout safety net (App.tsx:874, 949-963). On success within 5s the race is also closed; on timeout/non-stale-error it gracefully degrades to the pre-#256 behavior (WS opens with 0, in-flight fetch updates later). This is documented in the inline comments. ✅
  • Self-handled internal promises. Both fetchAndCacheDeviceId and awaitDeviceIdOrTimeout chain an inline .catch, so fetchPromise is non-rejecting and only the timeout can reject Promise.race — no Uncaught (in promise) from the device fetch. ✅
  • loggingOut short-circuit after the await (App.tsx:882) correctly prevents connectIM/contactsSync/prohibitWordsSync from firing against a just-cleared token when Plan F's stale-device interceptor triggers logout during the await. The .then handlers also re-check loggingOut (902, 934). ✅
  • Cache cleared on logout. clearCachedNumericDeviceId() is wired into clearLocalLoginState (App.tsx:988), which runs on every logout path (logout, logoutUserInitiated, OIDC post-logout cleanup). A same-tab user-switch therefore always re-traverses the first-login flow → no risk of one user's device id leaking into another's session. ✅
  • Storage scope is correct. The cache key clientMsgDeviceIdNumeric is not in CROSS_TAB_KEYS, so it lives in sessionStorage only (per-tab) — same scope as the deviceId UUID, keeping each tab's (UUID, numeric id) pair consistent. ✅
  • Save is defensive. saveNumericDeviceId no-ops on n <= 0 / non-finite and swallows storage write failures (Safari private mode / quota), degrading to the first-login path rather than throwing. ✅
  • Unit tests cover the helper's load/save/clear and the important guards (rejecting "0", negatives, NaN; quota-exceeded swallow; round-trip). 11 tests. ✅

Non-blocking suggestions

P2 — Unguarded fire-and-forget at the startMain call sites. startMain is now async but is called without .catch at App.tsx:697 and 701. A synchronous throw originating in startMain's own body (e.g. from connectIM()) would now surface as an Uncaught (in promise) rather than a synchronous exception. This is a label change, not a regression (the same throw was unhandled before), and realistic throw probability is low — but since the rest of this PR carefully eliminates unhandled rejections, consider void this.startMain().catch(e => console.warn("[startMain] failed", e)); for consistency.

P2 — Verification wording overstates "race eliminated". The header comment / linked issue say the WS is never opened with clientMsgDeviceId=0. That is exactly true only for cache hits; for first login it is closed on success within 5s and falls back to pre-#256 behavior on timeout or non-stale fetch error. The code itself is correct and the comments do concede the fallback — just suggest the summary read "eliminated for cache hits; narrowed (with timeout fallback) for first login."

Nit — if (this.loggingOut) return; is on the first-login branch but not the cache-hit branch (App.tsx:857-866). The cache-hit branch is fully synchronous so the window is essentially unreachable, and a reload() is already pending in any logout, so worst case is a benign 401. Hoisting the guard to the top of startMain would cover both branches uniformly — cheap defense-in-depth.

Nit — if (res?.id) is a truthiness guard. A negative res.id (server contract violation) would be assigned to the SDK while saveNumericDeviceId rejects it. The producing endpoint (GET /user/devices/:id) returns a positive int64, so this is not reachable in practice; validating Number.isFinite(n) && n > 0 once before both the SDK assignment and the cache write would make the two paths share a single validated value.

Nit — test coverage is helper-only. The async orchestration in startMain (timeout race, cache-hit/miss branch selection, loggingOut short-circuit) has no direct tests; the defect-prone pure logic (the validation guards) is fully covered, so this is acceptable, but a couple of startMain tests with a mocked apiClient and fake timers would lock in the timeout/short-circuit behavior.

For human verification (security-sensitive)

  • Confirm GET /user/devices/{deviceId} returns id as a JSON number (not a quoted string). The backend struct (int64, no ,string tag) indicates a number, which is what makes saveNumericDeviceId persist correctly; if that ever changes, the cache would silently stop persisting (the Number.isFinite(n) && n > 0 suggestion above would also future-proof this).
  • Confirm there is no logout path that bypasses clearLocalLoginState (all known paths route through it), so the per-tab device-id cache can never survive an auth-session change.

@MrAladdin MrAladdin dismissed stale reviews from lml2468 and Jerry-Xin via 9039a31 June 11, 2026 05:47
@MrAladdin MrAladdin force-pushed the fix/issue-256-clientmsgdeviceid-race branch from 4411d64 to 9039a31 Compare June 11, 2026 05:47
@github-actions

Copy link
Copy Markdown

❌ PR title does not follow Conventional Commits

Your title: fix(App): eliminate clientMsgDeviceId=0 race via sessionStorage cache (closes #256, depends on #255)

Expected format: type(optional-scope): description

Allowed types: feat, fix, docs, style, refactor, perf, test, build, ci, chore, revert

Examples:

  • feat(auth): add OAuth2 PKCE flow
  • fix(api): handle null upstream response
  • docs: update installation steps

This repository uses squash-and-merge, so the PR title becomes the commit subject on main and feeds release notes. Please update the title — this check re-runs automatically.

@mochashanyao mochashanyao left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Octo-Q · automated review]

Verdict: Approve — no blocking findings; notes below (data-flow traced).


octo-web PR#274 Review Report

Reviewer: Octo-Q (automated review)
PR: #274
Head SHA: 9039a31717cebd17bb0ff70f44c5736d71b91d59
Title: fix(App): eliminate clientMsgDeviceId=0 race via sessionStorage cache (closes #256, depends on #255)
Routing: complexity=security_sensitive (automated review)


1. Verification Summary

Item Status Evidence
Race elimination (cache-hit) App.tsx:861 sets clientMsgDeviceId = cached BEFORE connectIM() at :862
Race elimination (cache-miss) awaitDeviceIdOrTimeout resolves with res.id set at :936 before post-await connectIM() at :884
Timeout safety net Promise.race with 5s timer at :947-949; catch at :951-959 logs and falls back to SDK default
loggingOut guard (post-await) :882 short-circuits if Plan F fired during await
loggingOut guard (background refresh) :898 in fetchAndCacheDeviceId().then() skips writes if logout in-flight
Cache clear on logout clearLocalLoginState() at :991 calls clearCachedNumericDeviceId()
Plan F interceptor mutual exclusion APIClient.ts:92-100 uses else if — at most one callback per error
Unhandled rejection prevention Both fetchAndCacheDeviceId and awaitDeviceIdOrTimeout have .catch() chains; fetchPromise is never-rejecting
startMain() async safety Callers at :697/:701 don't await, but awaitDeviceIdOrTimeout cannot reject (self-handled), so no unhandled promise rejection
Test coverage 4 new test files: clientMsgDeviceIdCache.test.ts (11 tests), deviceFailure.test.ts (6 tests), APIClient.staleResource.test.ts (6 tests), apiError.test.ts additions (3 tests)

2. Findings

No P0 or P1 issues found.

P2-1: { code } as any type cast in staleLocalResourceCallback wiring

File: module.tsx:218
Diff-scope: new (introduced by this PR)

handleDeviceFetchError(
  { code } as any,  // ← only `code` populated; rest of APIClientRejectedError is undefined
  { ... }
);

handleDeviceFetchError currently only reads err.code, so this works. But the as any cast masks the incomplete construction. If a future refactor adds field access to handleDeviceFetchError (e.g., err.status for logging), it would silently get undefined.

Suggestion: Either construct a fuller APIClientRejectedError stub, or refactor handleDeviceFetchError to accept just { code: string } as its first argument (since that's all it uses).

P2-2: else if ordering in APIClient interceptor creates latent shadowing risk

File: APIClient.ts:92-100
Diff-scope: new (Plan F interceptor addition)

if (isAuthExpiredApiError(normalized) && self.logoutCallback) {
    self.logoutCallback()
} else if (
    isStaleLocalResourceApiError(normalized) && ...
) {
    self.staleLocalResourceCallback(normalized.code)
}

Currently safe because:

  • Auth codes (err.shared.auth.*) and stale codes (err.server.user.device_not_found) are disjoint sets.
  • Server returns http_status=404 in body for device_not_found, so isAuthExpiredApiError (which falls back to httpStatus === 401) doesn't match.

Latent risk: If the server ever changes body.error.http_status to 401 for device_not_found, isAuthExpiredApiError would match first via the httpStatus fallback, and the stale-device handler would never fire. The stale UUID would persist in storage → post-reload re-use → infinite logout loop.

Suggestion: Consider checking stale-resource codes BEFORE auth-expired (swap the if/else if order), or add a code-based exclusion in isAuthExpiredApiError that skips known stale codes. Low urgency — the current server contract makes this unreachable.

P2-3: startMain() callers don't handle the returned promise

File: App.tsx:697, App.tsx:701
Diff-scope: amplified (function changed from sync to async)

WKApp.endpoints.addOnLogin(() => {
  this.startMain();  // ← returned promise is discarded
});
if (WKApp.loginInfo.isLogined()) {
  this.startMain();  // ← same
}

Currently safe because awaitDeviceIdOrTimeout is designed to never reject (self-handled .catch() chain). However, any future code added after the await that could throw would produce an unhandled rejection.

Suggestion: Add .catch(e => console.error(...)) to the call sites, or add a void prefix (void this.startMain()) to signal intentional discard.

3. Suggestions

  1. P2-1 fix: Change handleDeviceFetchError signature to (err: Pick<APIClientRejectedError, 'code'>, handlers: ...) to match actual usage and eliminate the as any.
  2. P2-2 fix: Not urgent. Add a comment in APIClient.ts near the else if documenting the mutual-exclusion assumption on code sets.
  3. P2-3 fix: Add void prefix or .catch() at the two startMain() call sites.

4. Additional Observations

  • Double handling of stale-device errors (benign): Both the APIClient interceptor (→ staleLocalResourceCallbackhandleDeviceFetchErrorlogout()) AND the .catch() in fetchAndCacheDeviceId/awaitDeviceIdOrTimeout process the same rejection. The .catch() correctly identifies device_not_found and stays silent. The loggingOut guard prevents post-logout side effects. No bug, but worth noting for future maintainers.
  • Concurrent stale-device responses: The test "concurrent stale device responses each invoke callback" (APIClient.staleResource.test.ts) confirms N callbacks fire for N concurrent stale responses. The loggingOut one-way guard in logout() correctly handles this — second+ calls short-circuit at if (this.loggingOut) return.
  • sessionStorage scope: clientMsgDeviceIdNumeric is stored in sessionStorage (per-tab) via StorageService. This matches the UUID scope, keeping (UUID, numeric id) pairs consistent per tab. New tabs fall through to the first-login path — correct behavior.

5. Data-Flow Trace

Cache-hit path (refresh / subsequent session)

loadCachedNumericDeviceId()
  → StorageService.shared.getItem("clientMsgDeviceIdNumeric")
    → sessionStorage.getItem("clientMsgDeviceIdNumeric")
    → parseInt(raw, 10) → n > 0 ? n : null
  → returns number | null

[cached !== null]:
  WKSDK.shared().config.clientMsgDeviceId = cached  ← SET BEFORE CONNECT
  this.connectIM()                                   ← WS opens with real id
  WKApp.dataSource.contactsSync()
  ProhibitwordsService.shared.sync()
  this.fetchAndCacheDeviceId()                       ← background refresh
    → WKApp.apiClient.get(`/user/devices/${deviceId}`)
    → .then: if (!loggingOut && res?.id) → update SDK config + saveNumericDeviceId
    → .catch: log non-stale errors; stale handled by Plan F interceptor

Cache-miss path (first login / new tab)

awaitDeviceIdOrTimeout(5000)
  → fetchPromise = WKApp.apiClient.get(`/user/devices/${deviceId}`)
    → .then: if (!loggingOut && res?.id) → set SDK config + saveNumericDeviceId
    → .catch: log non-stale errors (never rejects — self-handled)
  → timeoutPromise = setTimeout 5s → reject("FIRST_LOGIN_TIMEOUT")
  → Promise.race: try { await race } catch { log timeout warning }

[post-await]:
  if (this.loggingOut) return  ← Plan F guard
  this.connectIM()             ← WS opens with real id (or 0 on timeout)
  WKApp.dataSource.contactsSync()
  ProhibitwordsService.shared.sync()

Plan F stale-device interceptor chain

API response: { code: "err.server.user.device_not_found", http_status: 404 }
  → APIClient response interceptor
    → normalizeApiError → { code: "err.server.user.device_not_found", httpStatus: 404 }
    → isAuthExpiredApiError? NO (code not in authExpiredCodes, httpStatus ≠ 401)
    → isStaleLocalResourceApiError? YES (code in staleLocalResourceCodes)
    → staleLocalResourceCallback("err.server.user.device_not_found")
      → module.tsx callback
        → handleDeviceFetchError({ code })
          → removeDeviceId(): StorageService.shared.removeItem("deviceId")
          → resetInMemoryDeviceId(): WKApp.shared.deviceId = ""
          → triggerLogout(): WKApp.shared.logout()
            → if (loggingOut) return  ← idempotent guard
            → loggingOut = true
            → clearLocalLoginState()
              → loginInfo.logout()
              → clearAuthStorage()
              → clearCachedNumericDeviceId()  ← clears numeric cache
            → window.location.reload()

Logout cache-clear path

WKApp.shared.logout()
  → loggingOut = true (one-way guard)
  → clearLocalLoginState()
    → WKApp.loginInfo.logout()
    → clearAuthStorage()          ← clears token/uid/etc, NOT deviceId
    → clearCachedNumericDeviceId() ← clears "clientMsgDeviceIdNumeric"
  → window.location.reload()

6. R5 Blind-Spot Checklist (security_sensitive)

C1 — Dual-path parity:

  • Cache read ↔ write ↔ clear: loadCachedNumericDeviceId / saveNumericDeviceId / clearCachedNumericDeviceId — all three are present and correctly wired. Cache is cleared in clearLocalLoginState() (logout path). ✅ Clear.
  • fetchAndCacheDeviceId (background) vs awaitDeviceIdOrTimeout (first-login): both have identical loggingOut guards and .catch() handling. Symmetric. ✅ Clear.
  • No create/delete asymmetry found.

C2 — Control-flow ordering / nesting reuse:

  • loggingOut one-way guard: checked at 3 points (post-await :882, background .then() :898, logout() :1001). All correctly ordered — guard is set BEFORE clearLocalLoginState() and BEFORE window.location.reload(). ✅ Clear.
  • else if in APIClient interceptor: mutual exclusion verified (see P2-2 above for latent risk). ✅ Clear for current server contract.
  • Promise.race ordering: fetchPromise is never-rejecting (.catch() chained inline), so only timeoutPromise can reject. The try/catch correctly handles this. ✅ Clear.

C3 — Authorization boundary ≠ capability boundary:

  • The cached value is a numeric integer (server-issued device id). No secrets, tokens, or PII stored in the cache. ✅ N/A.
  • sessionStorage scope is per-tab, matching the UUID scope. No cross-tab leakage. ✅ Clear.
  • No new endpoints or tools exposed by this PR. ✅ N/A.

Verdict

No P0 or P1 issues found. Three P2 findings (type cast fragility, latent interceptor shadowing, unhandled promise discard) are all maintainability concerns that don't affect current production behavior.

The race elimination design is sound: cache-hit sets the id before connectIM(), cache-miss awaits the fetch with a timeout safety net, and the loggingOut one-way guard correctly prevents post-logout side effects. Data-flow tracing confirms all consumed values (cached id, fetched id, loggingOut flag) are properly populated at their source and correctly flow to their consumption points.

Test coverage is thorough: 26 new tests across 4 test files covering the cache module, device failure handler, APIClient interceptor, and error classification.

[Octo-Q] verdict: APPROVE + 理由:No P0/P1 findings. Three P2 maintainability suggestions (type cast, interceptor ordering comment, promise discard) don't block landing. Race elimination is correctly implemented with proper guards and comprehensive test coverage.

@Jerry-Xin Jerry-Xin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary: The PR is relevant to octo-web and the main approach is sound: it sets clientMsgDeviceId before IM connection on cache hits, awaits the first fetch on cache misses, clears cache on logout, and handles stale-device recovery centrally.

💬 Non-blocking

🟡 Warning — Cache identity is only keyed by clientMsgDeviceIdNumeric, not by uid or deviceId.
packages/dmworkbase/src/Service/clientMsgDeviceIdCache.ts:10 and packages/dmworkbase/src/App.tsx:855
If a same-tab account switch happens without the normal logout path, or if deviceId changes independently, the cache-hit path can briefly connect using a numeric ID from a previous identity/device until the background refresh resolves. Consider storing { uid, deviceId, numericId } and validating both before using the fast path.

🔵 Suggestion — Tighten numeric parsing.
packages/dmworkbase/src/Service/clientMsgDeviceIdCache.ts:27
parseInt("123abc", 10) returns 123, despite the comment saying the stored value must be a positive finite integer. Number(raw) plus Number.isInteger() would better match the contract and tests could cover malformed numeric strings.

🔵 Suggestion — Add focused tests for App.startMain.
packages/dmworkbase/src/App.tsx:854
The helper and interceptor tests are useful, but the highest-risk behavior is orchestration: cache hit sets SDK config before connectIM, cache miss awaits before connectIM, timeout falls back, and stale-device logout short-circuits. A small mocked WKApp/SDK test would lock down the actual race fix.

✅ Highlights

The cache-miss path correctly awaits the device fetch before opening IM, with a bounded timeout fallback. The loggingOut guard after the await is a good regression guard against stale-device logout side effects.

The interceptor keeps auth-expired and stale-local-resource recovery mutually exclusive, which avoids double recovery on one response.

I could not run the focused tests in this checkout because the available package scripts/tooling did not expose a working test/vitest command from this environment.

@yujiawei yujiawei left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #274 (octo-web)

Summary

This PR eliminates the clientMsgDeviceId=0 race (issue #256) by caching the server-issued numeric device id in sessionStorage and reading it back before WKSDK.connect(). It also bundles a generic "stale local resource" recovery path (Plan F): an API-client interceptor branch that classifies err.server.user.device_not_found by error code (not HTTP status) and routes it to a central dispatcher that clears the device id and forces logout.

The change is well-structured, defensively coded, and the extracted helpers (clientMsgDeviceIdCache, deviceFailure, apiError.isStaleLocalResourceApiError, and the APIClient interceptor branch) are covered by 40 passing unit tests (all green locally).

Verdict: APPROVE. No P0/P1 (correctness / security / data-loss / build-break) issues found. A handful of P2/maintainability nits are listed below — none block merge.

What I verified (✅)

  • Core race fix is correct. In the cache-hit path, WKSDK.shared().config.clientMsgDeviceId = cached is assigned before connectIM() (App.tsx:854-861), so the WS is never opened with clientMsgDeviceId=0 when a valid cache exists. On cache miss the code awaits the device fetch (with a 5s timeout) before connecting (App.tsx:879-901).
  • Mutual exclusivity of recovery callbacks. The interceptor uses else if (APIClient.ts:95-103) so a single error response triggers at most one of logoutCallback / staleLocalResourceCallback. Confirmed by the auth-expired and login_device_expired (HTTP 401, cousin code) tests in APIClient.staleResource.test.ts.
  • Code-only gating for stale device. isStaleLocalResourceApiError matches the string code only (apiError.ts:93-95), which is the stable contract given the backend returns HTTP 400 with body.http_status=404. The "regardless of status field value" test in deviceFailure.test.ts locks this in.
  • Cross-user safety on same tab. clearCachedNumericDeviceId() is called from clearLocalLoginState() (App.tsx:991), so a relogin / user-switch on the same tab re-traverses the first-login flow and cannot reuse a previous user's numeric id. sessionStorage (per-tab) scope matches the existing deviceId UUID scope.
  • Logout is wedge-proof. The loggingOut in-flight guard (App.tsx:1013-1014) plus the try/catch around clearLocalLoginState() (App.tsx:1024-1028) guarantee window.location.reload() always runs even if storage writes throw (Safari private mode / QuotaExceededError) — preventing a permanent short-circuit wedge.
  • Task E short-circuit. After the cache-miss await, if (this.loggingOut) return (App.tsx:892) prevents connectIM/contactsSync/prohibitWordsSync from firing against a just-cleared token when Plan F's interceptor already triggered logout — avoiding Uncaught (in promise) 401 noise.
  • No unhandled rejections from async startMain. Both call sites (App.tsx:707, 711) are fire-and-forget, but startMain never rejects: the cache-hit branch returns synchronously, and the cache-miss branch's awaitDeviceIdOrTimeout swallows all errors internally (Promise.race wrapped in try/catch; the fetchPromise has its own inline .catch, so only the timeout can reject the race and that is caught).
  • No secret/credential exposure. The cached value is a server-issued numeric routing id, not a token or PII. Security-sensitive review of the auth/logout paths surfaced nothing that leaks or weakens auth state.

Findings

P2 — Duplicated fetch/then/catch logic between fetchAndCacheDeviceId and awaitDeviceIdOrTimeout

App.tsx:909-959. The .then body (if (this.loggingOut) return; if (res?.id) { set clientMsgDeviceId; saveNumericDeviceId }) and the .catch body (warn on non-stale code) are duplicated almost verbatim across the two methods. Extracting a single private applyDeviceIdResult(res) / logNonStaleFetchError(err) pair would remove the drift risk (e.g., if the loggingOut guard or the cache-write rule changes in one place but not the other). Not blocking.

P2 — startMain dual-track integration is not unit-tested

The extracted helpers are well covered, but the orchestration in startMain itself — cache-hit fast path, cache-miss await, the 5s timeout fallback, and the if (this.loggingOut) return short-circuit — has no direct test (no App.test additions in this PR). These are the highest-value behaviors of #256. A focused test that stubs loadCachedNumericDeviceId / WKApp.apiClient.get and asserts (a) cache hit sets clientMsgDeviceId before connectIM, (b) the loggingOut short-circuit skips connectIM, would lock in the fix against future regressions. Suggested follow-up, not a merge blocker.

P2 — { code } as any in the dispatcher

module.tsx:218. The Plan F dispatcher constructs a partial APIClientRejectedError as { code } as any. handleDeviceFetchError only reads err.code, so this is harmless today, but it bypasses the type system; if handleDeviceFetchError ever starts reading another field it will silently get undefined. Consider narrowing the handler's parameter type to Pick<APIClientRejectedError, "code"> so the dispatcher can pass { code } without the as any.

Nit — saveNumericDeviceId silently drops non-numeric res.id

clientMsgDeviceIdCache.ts:saveNumericDeviceId guards on Number.isFinite(n) && n > 0, so if the backend ever returns res.id as a numeric string, the cache silently never populates (each session falls back to the first-login fetch). The original code assigned res.id directly to clientMsgDeviceId, implying it is numeric — so this is fine today. Flagging only as a robustness note if the API contract is not strictly typed.

Notes for human verification (security-sensitive)

  • Confirm the backend contract: err.server.user.device_not_found is returned as HTTP 400 with body.error.http_status=404. The code-only classifier depends on the code string remaining stable; if the backend renames it, stale-device recovery silently stops working (no test can catch a server-side rename). Worth a contract test or a shared constant if one exists across services.
  • Confirm that forcing logout on device_not_found is the desired UX (vs. silently re-registering the device). Current behavior: clear deviceId UUID + numeric cache, then window.location.reload() → re-bootstrap. This is reasonable, but it is a user-visible session drop, so it should be an intentional product decision.

Conclusion

Correctness and security checks pass; the fix targets the documented race precisely and degrades gracefully (timeout → current pre-#256 behavior). The findings above are maintainability/test-coverage improvements, not blockers.

@lml2468 lml2468 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: clientMsgDeviceId=0 race fix (closes #256)

Change overview

  • 11 files, +649/-7; linked to #256.
  • Eliminates the clientMsgDeviceId=0 race by caching the numeric device id in sessionStorage and gating first-login on a device fetch (with a 5s timeout fallback), plus a central stale-local-resource interceptor that clears the device id and logs out on device_not_found.

Verification

  • Tests pass: true; no new failures: true (baseline 57 -> 53).
  • Build: true; Lint: true.

Findings

  • 2 blocking issues; 9 minor/nit issues.
  • 9 initially-suspected findings were dropped after verification as false positives.

The two blocking issues are both about missing test coverage on the highest-risk new logic:

  1. App.startMain dual-track logic (cache-hit / first-login timeout / loggingOut short-circuit) has zero coverage of the #256 race-fix correctness guarantees.
  2. logout()'s concurrent in-flight guard and the clearLocalLoginState try/catch (a clearly-described permanent-stuck failure mode) are untested.

The fix itself looks functionally correct and is a real improvement; the blockers are about protecting these invariants with regression tests before merge.

Verdict

Request changes — please add coverage for the two blocking items above. The minor/nit items are non-blocking quality improvements.

*
* Issue #256: this read happens BEFORE WKSDK.connect() in App.startMain to
* eliminate the race window where outbound messages would carry
* clientMsgNo = <uuid>_0_3 instead of <uuid>_<real-id>_3.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] loadCachedNumericDeviceId uses parseInt, accepting trailing garbage.

parseInt(raw, 10) returns 42 for values like "42abc" or "42.9" instead of rejecting them. Currently saveNumericDeviceId only writes String(positiveInt), so the normal path never produces such values; but if sessionStorage is written with dirty data by an external/older version, it would be silently accepted as a valid id. Tests only cover "not-a-number" (pure non-numeric -> NaN -> null), not partially-parseable garbage like "123abc".

Suggestion: use Number(raw) combined with Number.isInteger, or an explicit /^\d+$/ check before parsing, so any invalid input falls back to the first-login path.

// Cache miss → await device fetch with 5s timeout, then connectIM.
// Either way, the WS is never opened with clientMsgDeviceId=0 unless the
// first-login fetch times out (graceful degradation to current behavior).
async startMain() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] First-login path serializes IM connect / contacts sync / prohibited-words sync behind the device fetch (blocks up to 5s).

The original startMain() (main branch) was synchronous: connectIM(), contactsSync(), and ProhibitwordsService.shared.sync() fired immediately, in parallel with the /user/devices fetch. After this change, the cache-hit path (L850-859) stays parallel and is a net perf win; but the cache-miss (first login) path at L866 runs await this.awaitDeviceIdOrTimeout(5000), deferring connectIM/contactsSync/prohibitWordsSync (L875-877) until the device fetch resolves or the 5s timeout elapses. On first login the WS connection, contacts sync, and prohibited-words sync all wait an extra device-API RTT, and in the worst case (server hangs) IM connect is delayed a full 5s.

This is an intentional trade-off to fix the #256 clientMsgDeviceId=0 race and is functionally correct, but it is a real latency regression on the first-login path.

Suggestion: if IM connect doesn't require clientMsgDeviceId to be in place at the exact moment of connect (the SDK may read config.clientMsgDeviceId after connecting), consider also calling connectIM()/contactsSync()/sync() immediately on the first-login path and gating only the first outbound message on device-id readiness, eliminating this serial wait; or lower the 5s timeout to something commensurate with acceptable first-message latency (e.g. 2-3s) to shrink the worst-case blocking window. If product confirms one extra RTT on first login is acceptable, keep as-is and let this comment record the trade-off.

this.connectIM();
WKApp.dataSource.contactsSync(); // 同步通讯录
ProhibitwordsService.shared.sync(); // 同步敏感词
// Issue #256: dual-track to eliminate the clientMsgDeviceId=0 race window.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] The App.startMain dual-track logic (cache-hit / first-login timeout / loggingOut short-circuit) has zero test coverage.

The highest-risk, most-branched new logic in this PR lives in startMain / awaitDeviceIdOrTimeout / fetchAndCacheDeviceId, yet nothing under packages/dmworkbase/src exercises them (grep startMain|awaitDeviceIdOrTimeout|fetchAndCacheDeviceId|loggingOut hits 0 test files). Uncovered critical paths:

  1. cache hit -> fast path + background refresh;
  2. cache miss -> after 5s timeout, continue with default clientMsgDeviceId=0 (graceful degradation);
  3. a late fetch after timeout still writes back to cache without an Uncaught (in promise);
  4. if (this.loggingOut) return short-circuit -- after the stale-device interceptor triggers logout, startMain no longer calls connectIM/contactsSync (the comment specifically notes this avoids 401 noise on an already-cleared token).

These are exactly the correctness guarantees of the #256 race-window fix, currently resting on manual reasoning with no regression net.

Suggestion: add integration tests for the WKApp instance methods (mock WKApp.apiClient / WKSDK / loadCachedNumericDeviceId) asserting at least: on cache hit, clientMsgDeviceId is optimistically set and connectIM is called immediately; the timeout branch does not throw an unhandled rejection; and when loggingOut=true, connectIM is not called.

@@ -883,7 +997,27 @@ export default class WKApp extends ProviderListener {

// 登出
logout() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] logout() concurrent in-flight guard and the clearLocalLoginState try/catch are untested.

logout() adds two defensive pieces: if (this.loggingOut) return to prevent concurrent re-entry, and a try/catch wrapping clearLocalLoginState (the comment explicitly notes: if clearLocalLoginState throws in Safari private mode / on QuotaExceeded with no catch, loggingOut stays permanently true and the reload is skipped, leaving the user permanently stuck). This is a clearly described severe failure mode, yet no test verifies 'still reload on throw' or 'concurrent logout only runs clearLocalLoginState once'. deviceFailure.test.ts only tests the three handler calls of the pure handleDeviceFetchError, and does not cover these two new WKApp.logout invariants at all.

Suggestion: add a test mocking clearLocalLoginState to throw, asserting window.location.reload is still called and loggingOut is set; and a test calling logout() twice in a row, asserting clearLocalLoginState runs only once.

// can be added to the staleLocalResourceCodes set in apiError.ts and
// routed here without touching the interceptor or business catch sites.
APIClient.shared.staleLocalResourceCallback = (code: string) => {
if (code === "err.server.user.device_not_found") {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] The staleLocalResourceCallback dispatcher closure in module.tsx is untested; key handler behavior is unverified.

module.tsx assembles the real recovery handlers inside staleLocalResourceCallback: removeDeviceId = StorageService.shared.removeItem('deviceId'), resetInMemoryDeviceId = WKApp.shared.deviceId='', triggerLogout = WKApp.shared.logout(). deviceFailure.test.ts verifies these three handlers are called in the right order (with vi.fn() placeholders), but the wiring -- 'is the key passed to removeItem actually "deviceId"', 'is deviceId actually cleared', 'does a non-matching code avoid triggering' -- is never tested. APIClient.staleResource.test.ts only verifies the interceptor invokes the callback with the correct code, stopping at the callback boundary. Net result: the end-to-end chain from server device_not_found to actually clearing the local deviceId has a blind spot at the module.tsx assembly layer. The order assertion in deviceFailure (its comment stresses removeDeviceId must precede triggerLogout, else a reload race reuses the stale UUID and loops forever) protects placeholder fns, not the real production handlers in module.tsx.

Suggestion: add tests for the callback registered in module.tsx, asserting that on device_not_found StorageService.removeItem is called with 'deviceId', WKApp.shared.deviceId becomes empty, and a different code does not trigger.

APIClient.shared.staleLocalResourceCallback = (code: string) => {
if (code === "err.server.user.device_not_found") {
handleDeviceFetchError(
{ code } as any,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] module.tsx fakes an APIClientRejectedError with { code } as any, bypassing the type contract.

handleDeviceFetchError's signature declares its first parameter as a full APIClientRejectedError (deviceFailure.ts:23), but the only call site (module.tsx:218) passes { code } as any -- a code-only object with the type check erased via as any. This shows the function's parameter type (the whole rejected error) is over-engineered: it actually only needs the code string.

This is a sign of a mismatch between the abstraction boundary and real usage: the function claims to consume APIClientRejectedError (and deviceFailure.test.ts's makeRejected builds a pile of error/msg/normalized fields), but the production path has no real rejected-error object to pass (the interceptor only forwards code to the callback). The as any will mask breakage from future APIClientRejectedError shape changes.

Suggestion: if keeping the helper, narrow the signature to handleDeviceFetchError(code: string, handlers) or ({ code }: Pick<APIClientRejectedError,'code'>, handlers) to remove the as any; if the layer is deleted per the related comment, this is resolved too.

WKApp.dataSource.contactsSync();
ProhibitwordsService.shared.sync();
this.fetchAndCacheDeviceId();
return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Device-error recovery has two paths; responsibility is split across the interceptor and App.startMain.

The same device_not_found failure is handled both via the central APIClient interceptor -> staleLocalResourceCallback -> logout path, and again explicitly distinguished in the .catch of App.startMain's fetchAndCacheDeviceId / awaitDeviceIdOrTimeout (if (err?.code !== 'err.server.user.device_not_found') then warn, else comment 'handled centrally'). That is, the device-fetch catch sites must know which codes are centrally handled vs. self-logged, leaking the interceptor's classification knowledge back into business catch blocks; the two must stay in sync.

This is a coupling point: ownership of device_not_found (central vs. local) is simultaneously encoded in apiError.ts's staleLocalResourceCodes and in the string comparisons in both App.tsx catches. Adding a new stale code means that if you forget to update the App.tsx catch, that code gets incorrectly warned as 'non-stale'.

Suggestion: have the catch sites stop hardcoding the code string and reuse isStaleLocalResourceApiError(err) to decide whether to stay silent (sharing one classification source with the interceptor), eliminating the string duplication; or document in a comment that this is intentional ('interceptor already handled; this is just a fallback log') and accept the coupling.

this.connectIM();
WKApp.dataSource.contactsSync(); // 同步通讯录
ProhibitwordsService.shared.sync(); // 同步敏感词
// Issue #256: dual-track to eliminate the clientMsgDeviceId=0 race window.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] Comments reference internal planning codenames (Plan F / Task E / Task 9 / Risk A / bundled fix #1) that are meaningless to later readers.

New code uses internal planning jargon as the subject of many comments: // Plan F: ... (App.tsx:849, 888, 919, 1000), // Task E guard: (875), // Plan F Task 9 hardening: (~1008), // Risk A guard: (895, 928), plus // Plan F: blocks in apiError.ts:33, deviceFailure.ts, module.tsx:207. These codenames exist only in this PR/issue discussion; once merged, readers cannot look up what 'Plan F' / 'Task E' mean, so the comments lose explanatory value and add noise.

Suggestion: drop the codename prefixes, keep the substantive 'what/why'. E.g. rewrite // Task E guard: if the await resolved because Plan F's interceptor fired ... as // if the device fetch triggered staleLocalResourceCallback -> logout(), loggingOut is set and a reload is imminent; return here to avoid 401 noise from requests on an already-cleared token. Put issue-linking info in one place (e.g. // see #256 / octo-web#76) instead of repeating it in every block.

// APIClient.normalizeApiError reads the body's inconsistent `http_status`
// field (=404) into err.status, not the actual HTTP status code. The code
// string is the stable backend contract — that's what we match on.
if (err.code === 'err.server.user.device_not_found') {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] The magic string err.server.user.device_not_found is scattered across 5 places with no single source of truth, hurting readability and consistency.

The same backend error-code string is hardcoded in: apiError.ts:41 (the staleLocalResourceCodes set), App.tsx:909 and 941 (two .catch blocks' err?.code !== "err.server.user.device_not_found"), deviceFailure.ts:30, and module.tsx:214 (if (code === "err.server.user.device_not_found")). A reader must cross-check 5 locations to confirm they refer to the same code, and any spelling drift in one place won't be caught by the compiler.

Suggestion: export a named constant in apiError.ts (e.g. export const DEVICE_NOT_FOUND_CODE = "err.server.user.device_not_found") and reference it everywhere; the set and every check then derive from a single source, improving readability and removing the spelling-mismatch risk.

//
// Risk A guard: same as fetchAndCacheDeviceId — skip writes if logout
// started between dispatch and resolve.
private async awaitDeviceIdOrTimeout(timeoutMs: number): Promise<void> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] fetchAndCacheDeviceId and awaitDeviceIdOrTimeout contain duplicated .then/.catch device-fetch blocks.

App.tsx:898-916 and 931-948 each inline a nearly verbatim WKApp.apiClient.get(/user/devices/${...}).then(res => { if(loggingOut) return; if(res?.id){ set + saveNumericDeviceId } }).catch(err => { if(err?.code !== "...device_not_found") console.warn(...) }) block. Only the log text differs. The duplication forces readers to diff word-by-word to confirm the two behave identically, and future edits risk changing only one copy.

Suggestion: extract a private method (e.g. private fetchDeviceIdAndApply(logLabel: string): Promise<void>) reused by both paths; fetchAndCacheDeviceId and awaitDeviceIdOrTimeout then keep only their own 'fire-and-forget' vs. 'race+timeout' shells.

@lml2468 lml2468 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ APPROVE — #274 9039a317(校准 workflow 的 request-changes → approve)

跑了 pr-review workflow(自动判 request-changes,2 个 "blocker"),但我独立核实了核心竞态修复与那两条 blocker:两条所谓 blocker 实为 major 级「缺测试」,非功能缺陷;核心 #256 竞态修复正确、CI 全绿、另三位 reviewer 已 approve。据此校准为 approve,缺测试列为强烈建议补的 fast-follow。

本 PR depends on #255 且把 #255 改动卷入自身 diff(#255 未合并)。#255 部分我已单独审过;此处聚焦 #274 增量(clientMsgDeviceIdCache.ts + startMain 双轨)。

核心竞态修复正确(已独立验证)

#256:出站消息在 clientMsgDeviceId=0 时带 <uuid>_0_3 而非真实 id。修复在 WKSDK.connect() 前读 sessionStorage 缓存:

  • 缓存命中 → 乐观设值 + 立即 connectIM/sync + 后台刷新;缓存未命中await awaitDeviceIdOrTimeout(5s) 再 connect。WS 永不在 id=0 时打开(除非首登超时,优雅降级到旧行为)。✓
  • loadCachedNumericDeviceId 返回 null(非 0)区分「无 id / 真 id」,与 SDK 默认 0 语义对齐;sessionStorage per-tab 与 UUID 同 scope,配对一致。✓
  • 未处理 rejection 防御严密awaitDeviceIdOrTimeoutfetchPromise 自带 inline .catch 使其 never-reject,只有 timeout 能让 Promise.race reject——迟到的 fetch rejection 不会变成 Uncaught。✓
  • logout 竞态守卫.thenif (this.loggingOut) return 避免回种已清空缓存;startMain 的 Task E guard 避免对已清 token 发 401 噪声。✓

🟡 强烈建议补(should-fix,缺测试;不阻断)

  1. startMain 双轨逻辑无测试 App.tsx:849 — 缓存命中快路径 / 首登 5s 超时降级 / 超时后迟到 fetch 写回不产生 Uncaught / loggingOut 短路,这些是 #256 修复的核心正确性保证,目前无回归网。建议 mock apiClient/WKSDK/loadCachedNumericDeviceId 补集成测试。
  2. logout 守卫 + try/catch 无测试 App.tsx:999loggingOut 并发守卫、clearLocalLoginState 抛错仍 reload(注释明确「否则用户永久卡死」)这两个不变量未测。建议 mock clearLocalLoginState 抛错断言 reload 仍调用 + 连续两次 logout 只清理一次。
  3. module.tsx 装配层无测试 :216 — interceptor→handler 的 'deviceId' key / deviceId='' / 非匹配 code 不触发未测(纯函数测的是 vi.fn() 桩)。

🟡 Fast-follow(非阻塞)

  1. loggingOut 守卫未覆盖 OIDC 主动登出(同我在 #255 提的那条,随 #255 卷入)— OIDC 分支不设 loggingOut,并发下双导航 + clearLocalLoginState 两次(幂等、终态都是登出,低危)。建议 OIDC 分支也设守卫。
  2. loadCachedNumericDeviceIdparseInt :22 — 接受 "42abc"→42 尾随垃圾。当前写路径只写正整数,但 sessionStorage 被脏数据污染时会静默接受。建议 /^\d+$/ 校验或 Number+Number.isInteger
  3. { code } as any module.tsx:218 — handler 只读 code,签名收窄为 Pick<…,'code'> 去掉 as any。
  4. device_not_found 魔法字符串散落 5 处 — 建议导出 DEVICE_NOT_FOUND_CODE 常量;catch 站点改用 isStaleLocalResourceApiError 复用分类源,消除字符串重复 + 双路径耦合。
  5. 首登路径串行延迟 :854 — connectIM/contactsSync/sync 被推到 device fetch 之后(最坏 5s)。有意取舍修 #256,但确是首登延迟回归;建议评估 connect 是否可不等 id 就绪、仅对首条消息门控,或下调超时到 2-3s。
  6. nitfetchAndCacheDeviceId/awaitDeviceIdOrTimeout 两处设备拉取代码块近乎逐字重复,建议抽 fetchDeviceIdAndApply(label)Plan F/Task E/Risk A 内部代号注释无法溯源,建议改引可定位 issue。

核心竞态修复正确、CI 全绿、零功能阻断 → APPROVE。三条缺测试请尽快补(逻辑复杂、分支多,最值得回归网保护)。


校准说明:workflow 这次又把 major 级「缺测试」判成 blocker 触发 request-changes(同 #365/#368 模式)。我读了 clientMsgDeviceIdCache.ts + startMain/awaitDeviceIdOrTimeout/fetchAndCacheDeviceId 全部源码,确认竞态修复、never-reject 防御、logout 守卫均正确,无功能缺陷,故校准为 approve。缺测试是真实 should-fix,但不应拦一个核心逻辑正确的竞态修复。

@github-actions

Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity. Please add an update or it will be closed.

@github-actions github-actions Bot added the stale label Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: startMain connectIM races with device fetch — outbound messages built with clientMsgDeviceId=0 during the gap

5 participants