Skip to content

Design: Use libcartesi fast load/store to keep remote-machine-server offline during normal operation #725

@vfusco

Description

@vfusco

Context / Problem

With the upcoming libcartesi version 0.20.0, machines can be loaded/stored very quickly. This unlocks a potentially simpler and cheaper operational model: keep the cartesi-jsonrpc-machine offline for all running applications and instead load/store machine state as part of the input processing loop.

Today, the node architecture assumes an always-on remote machine service for running applications. The current input feed / execution loop and snapshotting strategy were not designed with fast load/store in mind, so adopting this capability will require design work, experimentation, and likely some refactoring.

Primary goal: reduce cloud footprint and cost by removing the need for a continuously running cartesi-jsonrpc-machine while preserving correctness, determinism, and the same externally observable results.

Goals

  • Allow running applications with cartesi-jsonrpc-machine offline, using load/store of machine state on demand.
  • Maintain deterministic execution and consistent outputs/hashes.
  • Keep a safe path for incremental rollout and fallback to the current mode if needed.

Non-goals

  • Full rewrite of the execution engine.
  • Introducing new emulator features beyond what is necessary for load/store integration.
  • Changing external APIs unless required (prefer internal implementation changes).

Proposed work (design + prototype)

1) Understand and validate the new primitives

  • Identify the exact libcartesi APIs/semantics for fast load/store (file format, performance profile, atomicity guarantees, compatibility).
  • Define error handling and recovery expectations (partial writes, corrupted snapshots, incompatible versions).

2) Define the target runtime model

Design how the node processes inputs without a remote machine service:

  • Where machine state lives (local disk, volume, object storage, etc.).
  • When machine state is loaded (per input, per epoch, etc.).
  • When machine state is stored (after each input, after each epoch, etc.).
  • Concurrency rules (single-writer per app; read-only access patterns; parallelism across apps).

3) Revisit snapshot strategy

Decide whether to:

  • Keep the current snapshot approach unchanged and only alter how machines are executed, OR
  • Replace the current snapshots approach with a new design.

Consider special cases:

  • Initial syncing / catch-up: do we avoid storing and only persist at the end? Or do periodic stores to support restarts?
  • Shutdown / restart semantics: what must be persisted to resume safely? Just the machine hash?
  • Disk usage vs recovery time tradeoffs.
  • Usage on disputes
  • Garbage collect old images from accepted epochs?

4) Decide how to phase out cartesi-jsonrpc-machine always online mode

  • Optional flag/config to enable “offline machine mode” per environment.
  • Explicit fallback strategy: if load/store fails, can we fallback to remote server (or fail fast)?
  • Observability requirements: metrics/logs to compare performance and detect regressions.

5) Prototype

  • Implement a minimal prototype behind a feature flag that can:
    • Load machine for an app,
    • Process a small input batch,
    • Store machine,
    • Restart and resume from stored state,
    • Produce identical outputs/hashes compared to the current mode.

Deliverables

  • Design doc (in-repo) covering:
    • runtime model (load/store cadence, storage layout, concurrency)
    • snapshot strategy decision
    • initial sync behavior
    • failure modes & recovery strategy
    • rollout plan + flags/config knobs
  • Prototype implementation behind a feature flag:
    • Offline mode that does not keep cartesi-jsonrpc-machine running
    • Minimal integration into the input feed loop
  • Validation plan + results:
    • Functional parity (outputs/hashes) vs current mode on at least one representative dapp workload
    • Basic performance measurements (ex:. startup time, per-input processing time, disk IO)
  • Follow-up issues created for:
    • production hardening / machine storage
    • metrics/alerts
    • CI coverage additions

Acceptance criteria

  • A written design is available and reviewed by relevant stakeholders.
  • Prototype can run an application end-to-end with cartesi-jsonrpc-machine being shutdown after each input and still:
    • produce the same outputs and computation/machine hashes as the current architecture (within the same libcartesi version)
    • survive a restart and resume from persisted machine state
  • The design includes a clear decision on snapshot handling (keep vs change) and justifies it with tradeoffs.

Notes / Open questions to resolve in the design

  • How do we ensure atomic stores and prevent corruption on crash/power loss?
  • How do we bound disk growth across many apps and long-running nodes? Retention Policy?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

No status

Relationships

None yet

Development

No branches or pull requests

Issue actions