Skip to content

feat(pegboard): actor startup KV preloading#4452

Open
NathanFlurry wants to merge 1 commit intomainfrom
ralph/actor-startup-kv-preload
Open

feat(pegboard): actor startup KV preloading#4452
NathanFlurry wants to merge 1 commit intomainfrom
ralph/actor-startup-kv-preload

Conversation

@NathanFlurry
Copy link
Member

Summary

  • Eliminate KV round-trips during actor startup by pre-fetching required keys from FoundationDB and delivering them alongside the CommandStartActor protocol message (v8)
  • Preloads persist data, connections, inspector token, queue metadata, workflow data, and SQLite VFS chunks in a single FDB snapshot transaction
  • Adds PreloadMap interface with binary search for efficient subsystem consumption, WriteCollector for batching new-actor initialization writes, and #expectNoKvRoundTrips detection flag
  • Configurable per-deployment limits (1 MB global, 768 KB SQLite, 128 KB workflows, 64 KB connections) with per-actor overrides via actor config options

Implementation

Engine (Rust)

  • US-001: Preload config fields in engine/packages/config/src/config/pegboard.rs
  • US-002: v8 runner protocol schema with PreloadedKv type and v7/v8 converters
  • US-003: batch_preload() function in actor_kv/preload.rs with single FDB snapshot transaction
  • US-004: Populate preloaded KV at send time in InsertAndSendCommands activity (not persisted in workflow history)
  • US-005: Runner stores preload config from prepopulateActorNames metadata

TypeScript (RivetKit)

  • US-006: Parse preloaded KV in runner SDK from v8 protocol
  • US-007: Per-actor preload size overrides in actor config schema
  • US-008: PreloadMap with binary search (no hex string conversion)
  • US-009: Subsystems consume PreloadMap before issuing KV reads
  • US-010: SQLite VFS preload integration with partial preload support
  • US-011: Wire preloaded data into ActorInstance.start()
  • US-012: WriteCollector batches new-actor init writes into single kvBatchPut
  • US-013: #expectNoKvRoundTrips flag warns on unexpected KV reads during startup
  • US-014: Eliminate redundant persist data read in engine driver

Bugfixes (from adversarial review)

  • US-016: Fix requested_prefixes/requested_get_keys including skipped entries (critical: prevented silent data loss for actors exceeding preload budgets)
  • US-017: Fix entry_size to include metadata in budget calculation
  • US-018: Call clearPreload() on SQLite VFS store after migration (768 KB memory leak fix)
  • US-019: Remove dead actor_preload_configs field from runner conn
  • US-020: Avoid building PreloadMap twice from same data

Docs

  • US-015: Updated limits.mdx and configuration.mdx with preload limits

Test plan

  • File system driver tests pass (pnpm test driver-file-system)
  • Engine driver tests pass (pnpm test driver-engine)
  • cargo build -p pegboard succeeds
  • cargo build -p rivet-runner-protocol succeeds
  • cargo build -p rivet-pegboard-runner succeeds
  • Verify actor startup with preloaded data (new and existing actors)
  • Verify actor startup without preloaded data (v7 engine fallback)
  • Verify connections/workflows exceeding budget fall back to KV reads
  • Verify SQLite partial preload with KV fallback on miss

🤖 Generated with Claude Code

@railway-app
Copy link

railway-app bot commented Mar 18, 2026

🚅 Deployed to the rivet-pr-4452 environment in rivet-frontend

Service Status Web Updated (UTC)
website 😴 Sleeping (View Logs) Web Mar 24, 2026 at 8:34 am
frontend-cloud 😴 Sleeping (View Logs) Web Mar 24, 2026 at 8:33 am
frontend-inspector 😴 Sleeping (View Logs) Web Mar 24, 2026 at 8:33 am
mcp-hub ✅ Success (View Logs) Web Mar 24, 2026 at 8:23 am
ladle ❌ Build Failed (View Logs) Web Mar 18, 2026 at 6:20 am

@pkg-pr-new
Copy link

pkg-pr-new bot commented Mar 18, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4452

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4452

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4452

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4452

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4452

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4452

@rivetkit/sqlite-vfs

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sqlite-vfs@4452

@rivetkit/traces

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/traces@4452

@rivetkit/workflow-engine

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/workflow-engine@4452

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4452

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4452

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4452

commit: 27824e8

@claude
Copy link

claude bot commented Mar 18, 2026

PR Review: feat(pegboard): actor startup KV preloading

This is a well-architected optimization. The three-way preload return semantics (Uint8Array / null / undefined), budget tracking in a single FDB snapshot transaction, and WriteCollector batching are solid design choices. The adversarial review bugfixes (US-016 through US-020) show thorough self-review. Some issues to address:


Bug: get-key budget check adds key to requested_get_keys before value-size check

File: engine/packages/pegboard/src/actor_kv/preload.rs

In the get-keys loop, a key is pushed to requested_get_keys unconditionally before checking whether the entry value fits within the budget. If a key exists in FDB but its value exceeds the remaining budget, the key lands in requested_get_keys but not in entries. On the TypeScript side, preload.get(KEY) returns null ("requested but not found"), causing the subsystem to treat it as absent instead of falling back to KV. For PERSIST_DATA specifically, this would start the actor with missing state.

Fix: only add to requested_get_keys after confirming the entry fits budget. Keys absent from FDB (builder stays None) should still be recorded since "not found" is a valid preloaded answer:

if let Some(b) = builder {
    let (k, v, m) = b.build()?;
    let size = entry_size(&k, &v, &m);
    if total_bytes + size <= max_total_bytes {
        total_bytes += size;
        requested_get_keys.push(key.clone()); // only after confirming it fits
        entries.push(...);
    }
    // else: not in requested_get_keys, actor falls back to KV (correct)
} else {
    // Key not found in FDB; record as scanned-but-absent.
    requested_get_keys.push(key.clone());
}

Minor: unwrap_or_default() silently uses empty actor name

File: engine/packages/pegboard/src/workflows/actor/runtime.rs, insert_and_send_commands

The .unwrap_or_default() call when extracting the actor name from commands silently falls back to an empty string. This is guarded by has_start_cmd above, so find_map should always succeed. An empty actor name produces wrong preload config with no error. Prefer an explicit error:

.ok_or_else(|| anyhow::anyhow!("CommandStartActor not found after has_start_cmd check"))?,

Nit: expectNoKvRoundTrips only warns once per startup

File: rivetkit-typescript/packages/rivetkit/src/actor/instance/mod.ts

The warnUnexpectedKvRoundTrip method resets the flag to false after the first warning, which silences all subsequent unexpected KV round-trips during startup. If multiple subsystems fall back to KV (e.g., after a preload regression), only the first will be visible in logs. Consider logging without resetting, or deduplicating by method name.


Nit: .ralph/ planning artifacts in the repo

Files: .ralph/prd.json, .ralph/progress.txt, scripts/ralph/progress.txt

These look like AI-assisted planning artifacts. Unless the team intentionally tracks them, they probably belong in .gitignore rather than committed to source.


Minor: prefix_range trailing-zero assumption

File: engine/packages/pegboard/src/actor_kv/preload.rs

The prefix_range function silently no-ops if the trailing 0 byte is absent when popping from the packed key. A silent no-op produces an incorrect range. If the null terminator is guaranteed by tuple encoding, an assertion would catch regressions immediately:

assert_eq!(start.pop(), Some(0), "expected null terminator from tuple encoding");

Overall

The implementation is solid. The critical item is the get-key budget / requested_get_keys tracking bug - the rest are minor. The version migration plumbing (v7 to v8 converters, backward-compat preloaded_kv: None for old runners) is correct, and the missed-commands reconnect path injecting fresh preloaded KV is an important correctness detail that's handled well.

@claude
Copy link

claude bot commented Mar 18, 2026

Code Review: feat(pegboard): actor startup KV preloading

This is a substantial and well-implemented feature. The adversarial review pass (US-016–US-020) already caught and fixed several critical correctness issues. Here are remaining observations.


Issues

1. Fragile prefix range construction in preload.rs

The prefix_range function silently no-ops if the last byte of the packed key is not 0:

if let Some(&0) = start.last() {
    start.pop();
}

If the tuple encoding changes or an empty prefix is passed, the computed range is wrong with no error or assertion. Consider asserting the byte is present or returning a Result.

2. Preloaded KV fetch can fail actor startup with no fallback

fetch_preloaded_kv is documented as "no silent fallback", so a transient FDB error during preloading fails the entire CommandStartActor. Since preloading is a pure optimization, falling back to the existing non-preloaded startup path on FDB error would improve resilience without risking correctness.

3. #expectNoKvRoundTrips silences after first warning

If a subsystem makes multiple unexpected KV reads, only the first is reported (the flag is cleared after the first hit). This makes the detection less useful when multiple regressions are introduced at once. Consider logging all occurrences during startup and clearing the flag only at the end of start().

4. preloaded_kv cloned per command in insert_and_send_commands

preloaded_kv can be up to 1 MB. The clone is correct (computed once and reused), but if multiple CommandStartActor commands were ever batched together this would clone 1 MB per command. A comment explaining that only one start command is expected per batch would help future reviewers.

5. Missed-command preload in conn.rs reads metadata once per command

fetch_preloaded_kv internally reads the actor name metadata from FDB before calling batch_preload. In conn.rs this runs inside a for loop over missed commands, so the metadata key is re-read once per missed CommandStartActor. Unlikely in practice but worth a comment explaining why this is acceptable.

6. clearPreload() not called on error paths in db/mod.ts and db/drizzle/mod.ts

kvStoreRef?.clearPreload() is called in onReady (after migrations) but not if createClient throws. If migrations fail, preloaded entries remain in memory until GC. Not a correctness issue but deterministic cleanup would be cleaner.


Positive observations

  • Three-way return semantics on PreloadMap.get() (Uint8Array / null / undefined) is the right design. The requested_get_keys/requested_prefixes tracking in Rust correctly ensures actors never mistake "not scanned" for "scanned and not found" -- the US-016 fix is critical.

  • partial: true for SQLite, partial: false for connections/workflows is semantically correct. SQLite VFS can handle individual cache misses; a partial connection list would give an inconsistent view.

  • WriteCollector cleanly batches 3 sequential init writes into 1. The fallback to individual kvBatchPut when no collector is provided maintains backward compatibility.

  • Protocol versioning looks correct: PROTOCOL_MK2_VERSION is bumped in both Rust and TypeScript, and the v7-to-v8 converters handle preloaded_kv: None for older runners properly.

  • Hardcoded byte prefixes in build_startup_preload_params are a maintenance risk. The CLAUDE.md guidance is thorough, but a compile-time or test-time assertion comparing them to the TypeScript key constants would catch out-of-sync regressions before they reach production.

  • Test coverage for PreloadMap and WriteCollector is solid. An integration test asserting #expectNoKvRoundTrips never fires during a standard actor startup would make the optimization regression-proof going forward.


Minor

  • preloadedKvStr in stringify.ts checks === null but the TypeScript type is PreloadedKv | undefined. At runtime (bare deserialization) both may appear; the check is practically correct but tightening the type would remove the ambiguity.
  • StreamingMode::WantAll for exact get-key lookups and StreamingMode::Iterator for prefix scans are the right choices for their respective access patterns.

@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4452 March 18, 2026 06:31 Destroyed
Copy link
Member Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry force-pushed the ralph/actor-startup-kv-preload branch from 27824e8 to 7cb044c Compare March 24, 2026 07:32
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4452 March 24, 2026 07:32 Destroyed
@NathanFlurry NathanFlurry force-pushed the ralph/actor-startup-kv-preload branch from 7cb044c to c7a03c0 Compare March 24, 2026 07:34
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4452 March 24, 2026 07:34 Destroyed
Preloads KV data during actor startup to avoid sequential round-trips
to FoundationDB. The engine reads KV proactively and includes the data
in the CommandStartActor protocol message (v8). The TypeScript runtime
reads from a local PreloadMap instead of issuing KV calls.

Warm start: 5+ round-trips -> 0. Cold start: 8 -> 2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@claude
Copy link

claude bot commented Mar 24, 2026

PR Review: Actor Startup KV Preloading

This is a well-structured optimization that eliminates FDB round-trips during actor startup. The three-way semantics (Uint8Array | null | undefined) for distinguishing "found", "not found but was queried", and "not preloaded" are elegant and correct. Here are my observations:


Critical: WriteCollector.flush() is not protected by try/finally in start()

In mod.ts (around line 3800-3814 of the diff), the initialization sequence is:

await this.#loadState(preload, writeCollector);
await this.queueManager.initialize(preload, writeCollector);
await this.#initializeInspectorToken(preload, writeCollector);
await writeCollector.flush();  // never reached if any of the above throws

If any of those three awaits throws, flush() is never called and new-actor initialization writes (persist data, queue metadata, inspector token) are silently dropped. The actor would subsequently start without its initial state written. Suggest wrapping in try/finally:

try {
    await this.#loadState(preload, writeCollector);
    await this.queueManager.initialize(preload, writeCollector);
    await this.#initializeInspectorToken(preload, writeCollector);
} finally {
    await writeCollector.flush();
}

High: Rust-TypeScript byte prefix sync is load-bearing and fragile

engine/packages/pegboard/src/actor_kv/preload.rs hardcodes byte prefixes (e.g., vec![8u8, 1], vec![6u8, 1], vec![2u8]) that must match keys.ts constants. The prefixes are currently in sync, but updating STORAGE_VERSION.SQLITE in keys.ts without updating Rust would silently preload the wrong data with no error. This is documented in CLAUDE.md, but worth adding an automated check (e.g., a unit test that verifies the preload params match the TS key constants) so drift is caught automatically rather than relying on human diligence.


Medium: Protocol metadata included but discarded in TypeScript

v8.bare defines PreloadedKvEntry with a metadata: KvMetadata field (version + timestamp), but preload-map.ts buildPreloadMap() only uses key and value, discarding metadata entirely. This wastes bandwidth (every preloaded entry carries an unnecessary version vec + i64). If metadata is intentionally excluded, consider dropping it from PreloadedKvEntry in a future schema version, or add a comment explaining the intentional discard.


Medium: KV_PRELOADING.md has an incorrect prefix for SQLite

The table in KV_PRELOADING.md lists SQLite prefix as [8], but the actual prefix used is [8, 1] (SQLITE_PREFIX + STORAGE_VERSION.SQLITE). The Rust code correctly uses vec![8u8, 1]. A maintainer following the doc would hardcode the wrong prefix.


Low: Binary search helper duplication in preload-map.ts

The file has three separate binary search implementations: binarySearch(), binarySearchExists(), and an inline lower-bound search in listPrefix(). The listPrefix() version replicates the same logic without reusing the helpers. Extracting a shared binarySearchLowerBound() helper would reduce duplication and the surface area for divergence.


Low: WARN_UNEXPECTED_KV_ROUND_TRIP warning silences itself after one hit

After the first warning, #expectNoKvRoundTrips is set to false, so a second unexpected round-trip during the same startup is silently ignored. Since this is a development-only diagnostic, consider logging all occurrences or using a counter so engineers can see the full scope of missed preloads.


Non-issues / Observations

  • The partial: true vs partial: false semantics for SQLite vs connections/workflows are clearly reasoned and correctly implemented in both Rust and TypeScript.
  • The decision to fetch preloaded KV at send time (not persisted in workflow history) is correct and avoids stale data on replay.
  • The requested_get_keys / requested_prefixes fix (US-016) is critical and correctly prevents the actor from treating "skipped due to budget" as "found and empty". Good catch.
  • The entry_size fix (US-017) including metadata bytes in budget calculation prevents underestimating the budget for entries with version history.
  • All byte prefix constants are currently in sync between Rust and TypeScript (verified against keys.ts).
  • WriteCollector is a clean, minimal abstraction. The empty-flush short-circuit is correct.
  • The #workflowPreloadEntries lifetime (held through startup, released eagerly after) is correctly bounded.

Overall this is solid work. The main items worth addressing before merge are the try/finally on writeCollector.flush() and the KV_PRELOADING.md documentation inconsistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant