Design: Use libcartesi fast load/store to keep `remote-machine-server` offline during normal operation

## Context / Problem

With the upcoming libcartesi version `0.20.0`, machines can be loaded/stored very quickly. This unlocks a potentially simpler and cheaper operational model: keep the `cartesi-jsonrpc-machine` offline for all running applications and instead load/store machine state as part of the input processing loop.

Today, the node architecture assumes an always-on remote machine service for running applications. The current input feed / execution loop and snapshotting strategy were not designed with fast load/store in mind, so adopting this capability will require design work, experimentation, and likely some refactoring.

Primary goal: reduce cloud footprint and cost by removing the need for a continuously running `cartesi-jsonrpc-machine` while preserving correctness, determinism, and the same externally observable results.

## Goals

- Allow running applications with `cartesi-jsonrpc-machine` offline, using load/store of machine state on demand.
- Maintain deterministic execution and consistent outputs/hashes.
- Keep a safe path for incremental rollout and fallback to the current mode if needed.

## Non-goals

- Full rewrite of the execution engine.
- Introducing new emulator features beyond what is necessary for load/store integration.
- Changing external APIs unless required (prefer internal implementation changes).

## Proposed work (design + prototype)

### 1) Understand and validate the new primitives
- Identify the exact libcartesi APIs/semantics for fast load/store (file format, performance profile, atomicity guarantees, compatibility).
- Define error handling and recovery expectations (partial writes, corrupted snapshots, incompatible versions).

### 2) Define the target runtime model
Design how the node processes inputs without a remote machine service:

- Where machine state lives (local disk, volume, object storage, etc.).
- When machine state is loaded (per input, per epoch, etc.).
- When machine state is stored (after each input, after each epoch, etc.).
- Concurrency rules (single-writer per app; read-only access patterns; parallelism across apps).

### 3) Revisit snapshot strategy
Decide whether to:
- Keep the current snapshot approach unchanged and only alter how machines are executed, OR
- Replace the current snapshots approach with a new design.

Consider special cases:
- Initial syncing / catch-up: do we avoid storing and only persist at the end? Or do periodic stores to support restarts?
- Shutdown / restart semantics: what must be persisted to resume safely? Just the machine hash?
- Disk usage vs recovery time tradeoffs.
- Usage on disputes
- Garbage collect old images from accepted epochs?

### 4) Decide how to phase out `cartesi-jsonrpc-machine` always online mode
- Optional flag/config to enable “offline machine mode” per environment.
- Explicit fallback strategy: if load/store fails, can we fallback to remote server (or fail fast)?
- Observability requirements: metrics/logs to compare performance and detect regressions.

### 5) Prototype
- Implement a minimal prototype behind a feature flag that can:
  - Load machine for an app,
  - Process a small input batch,
  - Store machine,
  - Restart and resume from stored state,
  - Produce identical outputs/hashes compared to the current mode.

## Deliverables

- [ ] Design doc (in-repo) covering:
  - runtime model (load/store cadence, storage layout, concurrency)
  - snapshot strategy decision
  - initial sync behavior
  - failure modes & recovery strategy
  - rollout plan + flags/config knobs
- [ ] Prototype implementation behind a feature flag:
  - [ ] Offline mode that does not keep `cartesi-jsonrpc-machine` running
  - [ ] Minimal integration into the input feed loop
- [ ] Validation plan + results:
  - [ ] Functional parity (outputs/hashes) vs current mode on at least one representative dapp workload
  - [ ] Basic performance measurements (ex:. startup time, per-input processing time, disk IO)
- [ ] Follow-up issues created for:
  - production hardening / machine storage
  - metrics/alerts
  - CI coverage additions

## Acceptance criteria

- [ ] A written design is available and reviewed by relevant stakeholders.
- [ ] Prototype can run an application end-to-end with `cartesi-jsonrpc-machine` being shutdown after each input and still:
  - [ ] produce the same outputs and computation/machine hashes as the current architecture (within the same libcartesi version)
  - [ ] survive a restart and resume from persisted machine state
- [ ] The design includes a clear decision on snapshot handling (keep vs change) and justifies it with tradeoffs.

## Notes / Open questions to resolve in the design

- How do we ensure atomic stores and prevent corruption on crash/power loss?
- How do we bound disk growth across many apps and long-running nodes? Retention Policy?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design: Use libcartesi fast load/store to keep `remote-machine-server` offline during normal operation #725

Context / Problem

Goals

Non-goals

Proposed work (design + prototype)

1) Understand and validate the new primitives

2) Define the target runtime model

3) Revisit snapshot strategy

4) Decide how to phase out `cartesi-jsonrpc-machine` always online mode

5) Prototype

Deliverables

Acceptance criteria

Notes / Open questions to resolve in the design

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design: Use libcartesi fast load/store to keep remote-machine-server offline during normal operation #725

Description

Context / Problem

Goals

Non-goals

Proposed work (design + prototype)

1) Understand and validate the new primitives

2) Define the target runtime model

3) Revisit snapshot strategy

4) Decide how to phase out cartesi-jsonrpc-machine always online mode

5) Prototype

Deliverables

Acceptance criteria

Notes / Open questions to resolve in the design

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Design: Use libcartesi fast load/store to keep `remote-machine-server` offline during normal operation #725

4) Decide how to phase out `cartesi-jsonrpc-machine` always online mode