Skip to content

[CLAUDE ROUTINE]: Reliability enhancement — make writeCacheEntry in lib/codex-sessions.ts race-safe so concurrent dashboard requests can't clobber each other's cache writes #276

@NiveditJain

Description

@NiveditJain

Summary

writeCacheEntry keeps a small ~/.failproofai/cache/codex-session-paths.json map that lets findCodexTranscript skip the multi-level directory scan on subsequent lookups — a nice perf win. The current implementation does a classic read-modify-write cycle (readCache() → mutate → writeFileSync) without any locking or atomic rename, so when the dashboard kicks off several Codex session lookups in parallel (which it does on the projects page), two writes that race can stomp on each other and one of the entries vanishes from the cache.

This is purely a "miss" — the next lookup just falls back to the directory scan and re-populates — so the user-visible effect is "the page is occasionally a hair slower than it should be." Tightening this is a friendly perf-on-the-margin improvement that aligns Codex with the atomic-write pattern we already use in hook-activity-store.ts.

Where

lib/codex-sessions.ts:50-59

function writeCacheEntry(sessionId: string, path: string): void {
  try {
    mkdirSync(dirname(CACHE_PATH), { recursive: true });
    const cache = readCache();          // ← step 1: read whole map
    cache[sessionId] = path;            // ← step 2: mutate in memory
    writeFileSync(CACHE_PATH, JSON.stringify(cache), "utf-8");  // ← step 3: clobber-write
  } catch {
    // Cache is best-effort
  }
}

For comparison, src/hooks/hook-activity-store.ts:201-209 already uses the right pattern:

const tmpPath = join(storeDir, `stats.json.${process.pid}.tmp`);
writeFileSync(tmpPath, JSON.stringify(s), "utf-8");
renameSync(tmpPath, join(storeDir, STATS_FILE));   // atomic on POSIX

…and src/auth/token-store.ts:46-56 does a similar tmp-file-then-rename for auth.json.

Why this matters

sequenceDiagram
    participant R1 as Request A<br/>(sessionId X)
    participant R2 as Request B<br/>(sessionId Y)
    participant FS as codex-session-paths.json<br/>{}

    R1->>FS: readCache → {}
    R2->>FS: readCache → {}
    R1->>R1: cache[X] = pathX → {X: pathX}
    R2->>R2: cache[Y] = pathY → {Y: pathY}
    R1->>FS: writeFileSync({X: pathX})
    R2->>FS: writeFileSync({Y: pathY})
    Note over FS: Final state = {Y: pathY}<br/>X is silently dropped
Loading

How this surfaces today:

  • The dashboard's projects view calls getCodexSessionLog(sessionId) for many sessions in parallel via batchAll (lib/concurrency.ts). Each one hits findCodexTranscript, and the misses each call writeCacheEntry.
  • A few of those writes will overlap; whichever lands last wins.
  • On the next page load, the dropped entries miss the cache and fall back to the full year/month/day scan — visible as a small extra latency for those few sessions.
  • Worst case in disk-write timing: writeFileSync is non-atomic, so if the process is interrupted mid-write the file can also be left as a partial JSON, which readCache's try/catch already converts into "treat as empty" — so we never crash, we just throw away the whole cache. Annoying but not destructive.

There is no security or data-integrity impact — this cache is purely an optimisation, and findCodexTranscript always falls back to the on-disk truth when the cache is wrong.

Suggested approach

The simplest fix mirrors what hook-activity-store.ts and token-store.ts already do — and it doesn't introduce any new locking primitives:

function writeCacheEntry(sessionId: string, path: string): void {
  try {
    mkdirSync(dirname(CACHE_PATH), { recursive: true });
    const cache = readCache();
    cache[sessionId] = path;
    const tmp = `${CACHE_PATH}.${process.pid}.tmp`;
    writeFileSync(tmp, JSON.stringify(cache), "utf-8");
    renameSync(tmp, CACHE_PATH);   // atomic replace on POSIX
  } catch {
    // best-effort
  }
}

That alone removes the partial-write window. To also preserve the other request's entry under the race shown above, batch the read+write under a tiny advisory lock — same shape as acquireLock/releaseLock at src/hooks/hook-activity-store.ts:85-110. Optional and only worth it if telemetry shows the cache miss rate is high.

A nice cherry on top: bound the cache to, say, 500 entries with simple FIFO drop, so a long-running dashboard process doesn't grow codex-session-paths.json unboundedly. (Not strictly required to close this issue.)

Customer impact

  • Today: Dashboard projects page may take a small extra hop on subsequent loads for the small fraction of sessions whose cache entry got clobbered. Imperceptible for ≤ 100 sessions; noticeable for users with thousands of Codex sessions.
  • After fix: Cache hit rate is what we'd intuit it to be from the code; partial-write windows go away; behaviour matches the rest of the codebase's atomic-write pattern.
  • Zero impact on correctness or hot path — writeCacheEntry is already fire-and-forget.

Acceptance criteria

  • writeCacheEntry writes to a tmp path and renames into place.
  • readCache never sees a half-written JSON blob.
  • (Stretch) writeCacheEntry preserves concurrent writers' entries under load.
  • Optional: small FIFO bound on cache size.

Related

https://claude.ai/code/session_01SAwaAnE9bTuLujnvksoCYN

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions