-
Notifications
You must be signed in to change notification settings - Fork 73
Description
Context / Problem
With the upcoming libcartesi version 0.20.0, machines can be loaded/stored very quickly. This unlocks a potentially simpler and cheaper operational model: keep the cartesi-jsonrpc-machine offline for all running applications and instead load/store machine state as part of the input processing loop.
Today, the node architecture assumes an always-on remote machine service for running applications. The current input feed / execution loop and snapshotting strategy were not designed with fast load/store in mind, so adopting this capability will require design work, experimentation, and likely some refactoring.
Primary goal: reduce cloud footprint and cost by removing the need for a continuously running cartesi-jsonrpc-machine while preserving correctness, determinism, and the same externally observable results.
Goals
- Allow running applications with
cartesi-jsonrpc-machineoffline, using load/store of machine state on demand. - Maintain deterministic execution and consistent outputs/hashes.
- Keep a safe path for incremental rollout and fallback to the current mode if needed.
Non-goals
- Full rewrite of the execution engine.
- Introducing new emulator features beyond what is necessary for load/store integration.
- Changing external APIs unless required (prefer internal implementation changes).
Proposed work (design + prototype)
1) Understand and validate the new primitives
- Identify the exact libcartesi APIs/semantics for fast load/store (file format, performance profile, atomicity guarantees, compatibility).
- Define error handling and recovery expectations (partial writes, corrupted snapshots, incompatible versions).
2) Define the target runtime model
Design how the node processes inputs without a remote machine service:
- Where machine state lives (local disk, volume, object storage, etc.).
- When machine state is loaded (per input, per epoch, etc.).
- When machine state is stored (after each input, after each epoch, etc.).
- Concurrency rules (single-writer per app; read-only access patterns; parallelism across apps).
3) Revisit snapshot strategy
Decide whether to:
- Keep the current snapshot approach unchanged and only alter how machines are executed, OR
- Replace the current snapshots approach with a new design.
Consider special cases:
- Initial syncing / catch-up: do we avoid storing and only persist at the end? Or do periodic stores to support restarts?
- Shutdown / restart semantics: what must be persisted to resume safely? Just the machine hash?
- Disk usage vs recovery time tradeoffs.
- Usage on disputes
- Garbage collect old images from accepted epochs?
4) Decide how to phase out cartesi-jsonrpc-machine always online mode
- Optional flag/config to enable “offline machine mode” per environment.
- Explicit fallback strategy: if load/store fails, can we fallback to remote server (or fail fast)?
- Observability requirements: metrics/logs to compare performance and detect regressions.
5) Prototype
- Implement a minimal prototype behind a feature flag that can:
- Load machine for an app,
- Process a small input batch,
- Store machine,
- Restart and resume from stored state,
- Produce identical outputs/hashes compared to the current mode.
Deliverables
- Design doc (in-repo) covering:
- runtime model (load/store cadence, storage layout, concurrency)
- snapshot strategy decision
- initial sync behavior
- failure modes & recovery strategy
- rollout plan + flags/config knobs
- Prototype implementation behind a feature flag:
- Offline mode that does not keep
cartesi-jsonrpc-machinerunning - Minimal integration into the input feed loop
- Offline mode that does not keep
- Validation plan + results:
- Functional parity (outputs/hashes) vs current mode on at least one representative dapp workload
- Basic performance measurements (ex:. startup time, per-input processing time, disk IO)
- Follow-up issues created for:
- production hardening / machine storage
- metrics/alerts
- CI coverage additions
Acceptance criteria
- A written design is available and reviewed by relevant stakeholders.
- Prototype can run an application end-to-end with
cartesi-jsonrpc-machinebeing shutdown after each input and still:- produce the same outputs and computation/machine hashes as the current architecture (within the same libcartesi version)
- survive a restart and resume from persisted machine state
- The design includes a clear decision on snapshot handling (keep vs change) and justifies it with tradeoffs.
Notes / Open questions to resolve in the design
- How do we ensure atomic stores and prevent corruption on crash/power loss?
- How do we bound disk growth across many apps and long-running nodes? Retention Policy?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status