Skip to content

Filedb improv#537

Closed
satyakigh wants to merge 2 commits intomainfrom
filedb-improv
Closed

Filedb improv#537
satyakigh wants to merge 2 commits intomainfrom
filedb-improv

Conversation

@satyakigh
Copy link
Copy Markdown
Collaborator

@satyakigh satyakigh commented Apr 23, 2026

Purpose

Prepare the CloudFormation Language Server to migrate its persistence layer from LMDB to an encrypted file-based datastore on all platforms (currently only Windows uses it). The migration is gated behind a new feature flag and the file store itself is hardened for production use.


Two Core Changes

1. New FileDb Feature Flag

A new feature flag controls whether the file-based datastore is used instead of LMDB on non-Windows systems.

Environment Enabled Fleet %
Alpha Yes 100%
Beta No 100%
Prod No 0%

Only alpha is opted in

2. EncryptedFileStore Hardening

The file-based datastore received significant reliability improvements across six areas:

  • Atomic writes with fsync — After writing a temp file, data is flushed to disk before the atomic rename. The parent directory is also fsynced to ensure the directory entry is durable. Windows silently skips directory fsync since NTFS journaling handles durability.

  • Rename retry logic — File renames now retry up to 10 times on transient OS errors (EPERM, EACCES, EBUSY, ENOENT). This handles Windows-specific issues like antivirus or indexers holding file handles. Failed temp files are cleaned up on final failure.

  • Serialized write queue — Writes are chained through an internal promise queue so multiple async operations within the same process don't interleave. Each write waits for the previous one to finish before acquiring the cross-process file lock.

  • Improved locking strategy — The lock target changed from the data file itself to the parent directory with an explicit lockfile path. This avoids issues where the data file gets replaced via rename while a lock still references it. Lock retry parameters were retuned: more retries, shorter intervals, and randomized jitter to reduce contention. The synchronous lock used during construction now retries instead of failing immediately.

  • Stale temp file cleanup — On startup, the store scans its directory and removes leftover .tmp files from prior crashed processes. Only files matching the store's own name prefix are removed.

  • Resilience to external file deletion — If the data file is deleted between operations (e.g., by another process), the store gracefully starts from an empty state instead of throwing an error.


@satyakigh satyakigh requested a review from a team as a code owner April 23, 2026 20:36
Comment thread tools/telemetry-generator.ts
const LOCK_OPTIONS: LockOptions = { ...LOCK_OPTIONS_SYNC, retries: { retries: 20, minTimeout: 50, maxTimeout: 1000 } };
const STALE_MS = 30_000;
const LOCK_OPTIONS_SYNC: LockOptions = { stale: STALE_MS, realpath: false };
const LOCK_OPTIONS: LockOptions = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cleanupStaleTmpFiles() runs before the lock is acquired. Since tmp files are named {name}.enc.{pid}.{counter}.tmp and the cleanup matches any PID prefix for this store name, there's a small race with other language server processes (e.g. multiple VS Code windows sharing the same storageDir) — process A could delete a .tmp file that process B is about to rename.

Moving this call inside the lock block (after lockSyncWithRetry, before the existsSync check) would close that window.

this.saveSync();
}
} else {
this.saveSync();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lockSyncWithRetry does up to 200 × 50ms of Atomics.wait, which hard-blocks the Node.js event loop for up to 10 seconds. This runs in the constructor during LSP initialization, so the editor gets no responses while it's spinning.

The main scenario: process A crashes while holding the lock. The stale timeout is 30s, so process B retries synchronously until proper-lockfile considers the lock stale. If that takes longer than the 10s retry budget, the server fails to start.

This is an improvement over the previous no-retry lockSync (which would just fail immediately), but worth noting the tradeoff. Could the constructor fail fast and defer store initialization to an async path to avoid blocking?

@satyakigh satyakigh closed this Apr 26, 2026
@satyakigh satyakigh deleted the filedb-improv branch May 1, 2026 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants