Skip to content

docs: Design for daemon formula persistence#3121

Draft
kriskowal wants to merge 1 commit intomasterfrom
kriskowal-doc-formula-persistence
Draft

docs: Design for daemon formula persistence#3121
kriskowal wants to merge 1 commit intomasterfrom
kriskowal-doc-formula-persistence

Conversation

@kriskowal
Copy link
Member

This change introduces a design rationale for formula persistence.

@changeset-bot
Copy link

changeset-bot bot commented Mar 8, 2026

⚠️ No Changeset found

Latest commit: aefc1b8

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@kriskowal kriskowal requested a review from erights March 8, 2026 03:25
Comment on lines +28 to +29
loss of connectivity is treated as temporary. The program never observes a broken
reference; it simply waits. Waterken achieves this by combining Joe-E (a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some problems with initial phrasing:

  • promises still reject/break, so "never observes a broken reference" is wrong. "never observes a reference broken due to partition" would be correct.
  • "it simply waits" what simply waits? The Waterken computation model is as non-blocking as E or Endo -- communicating event loops. Programs never wait in the conventional sense of blocking.
Suggested change
loss of connectivity is treated as temporary. The program never observes a broken
reference; it simply waits. Waterken achieves this by combining Joe-E (a
loss of connectivity is treated as temporary. A message sent during loss of connectivity will still be delivered exactly once, after connectivity is reestablished. Waterken achieves this by combining Joe-E (a

Comment on lines +36 to +37
"partitioned," the system can be made deterministic over all communicating
programs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still arrival order non-determinism, which is fundamental. If vatA and vatB both send a message to vatC at the "same time", they will arrive in some order, but the order is not determined by prior distributed semantic state.


**Disadvantages:**
- Sacrifices availability: a single partitioned dependency stalls all dependents
- Entangled distributed heaps require distributed garbage collection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waterken did not do distributed gc at all. It simply had all distributed references leak forever, i.e., as long as those vats lived.

Comment on lines +47 to +50
- Differences in incentives among participants necessitate market-based
approaches to garbage collection (see "The market-sweep algorithms" in Drexler
and Miller, "Incentive Engineering for Computational Resource Management,"
1988)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither Waterken nor any other system has yet actually implemented market-based gc. Check back in another 40 years ;)

Comment on lines +51 to +52
- Upgrading programs in flight is difficult; the heap snapshot encodes
assumptions about program behavior that the upgrade may violate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Tyler has a completely diff perspective on upgrade: https://waterken.sourceforge.net/upgrade/ . I don't know if anyone has actually tried this. It is interesting.

### Exposed partition and revival per-reference (E model)

At the other end, partition and revival are exposed at every individual reference. A
program must be written so that any dereference or message send to a potentially
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
program must be written so that any dereference or message send to a potentially
program must be written so that any dereference or message send of a potentially

Comment on lines +58 to +59
remote reference might fail due to partition. Recovery requires reconstructing
the chain of computation that led to the broken reference, after partition heals.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recovery requires reconstructing the chain of computation

Recovery from sturdy refs need not, and often does not, reconstruct the original chain of computation that produced the sturdy ref.


**Advantages:**
- Simpler runtime implementation
- Does not sacrifice availability to the extent of the Waterken model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tyler argues that E's "react to partition" translates in Waterken to "react to timeout". Tyler also points out what's different: E's partition atomically breaks all references that are multiplexed over the partitioned collection. But I think this detail is below the level you're trying to explain.

Comment on lines +64 to +65
- No obligation to retain "offline capabilities" (sturdyrefs and out-of-band
URL-like references) indefinitely. Both are necessarily weak references.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some version of E, SturdyRefs had timeouts. Not clear what "obligation" means when nothing can force the counter-party to retain, but I certainly consider the host to be obligated to retain the reference until the timeout (or deadline) expires. IIRC, modern E dropped the timeouts. Does that mean the host is obligated to retain forever? I'm unsure how to answer the question.

This comparison with "weak references" is interesting. I never thought to describe it this way, but I think it is valid.

Comment on lines +66 to +68
Sturdyrefs are like out-of-band references but can participate in "distributed
confinement" without revealing cryptographic material to a confined program
with parts running on multiple peers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A way to think of Waterken in E terms is that all Waterken references are sturdy. Just like in E, computation denied the ability to convert between opaque references are bits could be confined in Waterken. I'm not sure about this, but it is definitely consistent with the Waterken model.

Comment on lines +71 to +72
- More complex programming model: every dependent computation must handle
mid-process recovery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "mid-process" mean?

**Disadvantages:**
- More complex programming model: every dependent computation must handle
mid-process recovery
- Programs must reconstruct chains of computation defensively
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a distinct point?

Comment on lines +78 to +79
locator) that weakly retains a capability on a peer and can be redeemed for a
live reference. In the Waterken model, these must be persisted indefinitely, or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waterken had no separate notion of a live reference. All references are sturdy. Because partitions are masked, you just send messages on these sturdy references and they are still delivered exactly once.

confinement" without revealing cryptographic material to a confined program
with parts running on multiple peers.

**Disadvantages:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In E, a partition drops messages in flight, without E itself providing any bookkeeping to ascertain after the fact which messages were lost. Thus, in E, messages are delivered at most once. This creates way more application complexity than an exactly-once guarantee.

Both models share the notion of a URL or URL-like reference (sturdy reference,
locator) that weakly retains a capability on a peer and can be redeemed for a
live reference. In the Waterken model, these must be persisted indefinitely, or
all dependent distributed processes are silently corrupted (they continue waiting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corruption is loss of integrity. The case you're describing is at most loss of availability.

locator) that weakly retains a capability on a peer and can be redeemed for a
live reference. In the Waterken model, these must be persisted indefinitely, or
all dependent distributed processes are silently corrupted (they continue waiting
for references that will never return). In E, sturdy references and locators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

References don't return. They resolve/forward/settle/fulfill/break, or not.

capabilities), patterns, and message passing. Other systems built on the same
Endo components make different choices along the entangled dimensions:

- The choice of **CapTP** determines message ordering.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've converged on the data model of ocapn whose order is point-to-point fifo, with the assumption that we will add an "after" operation eventually to express stronger orders. But in any case, the CapTP choices in Endo should not determine any message order other than point-to-point fifo (+ "after"), yes?

Comment on lines +333 to +335
For example, the Agoric chain uses Endo components with orthogonal persistence to
ensure that all honest validators produce the same deterministic computation,
independent of whether they crashed and restarted or simply continued. Formula
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agoric's use of Endo follows the Waterken model

In both models, petname systems are expected to be built *on top of* these
reference mechanisms.

## Formula Persistence: Inverting the Relationship
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to see a concrete example early with an actual concrete formula.

Comment on lines +173 to +175
The formula graph is acyclic across peers, but admits limited cycles among
certain groups of formulas that must present unique, unforgeable identifiers to
the network while being constructed as facets of a shared underlying capability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runon sentence. Too hard for me to parse even though I'm already more oriented than your readers.

capabilities from their formulas, restoring the user's prior policy decisions
without requiring the user to re-confirm them.

### Revocation by withdrawal of construction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this selective revocation? If not, it fails to express the main use case for revocation: revoking the access I gave to Bob without revoking the access I gave to Carol.

Comment on lines +292 to +294
## Why Not Orthogonal Persistence?

### The upgrade problem dissolves the distinction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might put this as "whether on not persistence is orthogonal, upgrade cannot be orthogonal"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not suggest you go into any depth on the Agoric upgrade model. But just noting:

Agoric's current support for upgrade treats messages sent during an upgrade, under some circumstances, as "deliver at most once", creating all the application defensiveness burden Agoric tried to avoid by adopting the Waterken persistence and communication model. Instead, Agoric's next refinement of its upgrade model, based on durable promises, will preserve the "deliver exactly once" guarantee even across upgrades.

However, I do suggest that for formula persistence you do discuss what the message delivery guarantees are across distributed traumas.

Comment on lines +315 to +319
- **Determinism:** Reconstruction from formula may produce observably different
results from one incarnation to the next (e.g., if a dependency's behavior
has changed).
- **Ephemeral state:** Heap state that is not captured in a formula or in
manually persisted storage is lost across incarnations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true for Agoric as well, since incarnations are upgrade boundaries, not crash recovery boundaries.

A formula is not a snapshot of state. It is a recipe for producing state. The
system persists *construction*, not *content*.

### Destruction by cohort, reconstruction on demand
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get cohorts yet. Probably will once there's an example with a concrete formula?

| Persistence mechanism | Orthogonal | Manual + sturdy refs | Formula graph |
| Programming model | Simple (no partition code) | Defensive (per-reference) | Moderate (cohort-aware) |
| Restart cost | Snapshot restore | Reference re-establishment | Formula evaluation (lazy) |
| Upgrade story | Difficult (heap assumptions) | Natural (references re-resolve) | Natural (formulas re-evaluate) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For E, "natural" is way too generous. It is correct wrt recovering connectivity. But E's persistence was manual, so upgrade was no additional burden. It was just the same manual revival from what was manually stored. But this burden of manual storage and revival was awkward and "unnatural".

| Programming model | Simple (no partition code) | Defensive (per-reference) | Moderate (cohort-aware) |
| Restart cost | Snapshot restore | Reference re-establishment | Formula evaluation (lazy) |
| Upgrade story | Difficult (heap assumptions) | Natural (references re-resolve) | Natural (formulas re-evaluate) |
| Retention: live references | Indefinite (partition masked) | Distributed acyclic GC | Scoped to cohort |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In E, live refs also only retained up to partition.

| Restart cost | Snapshot restore | Reference re-establishment | Formula evaluation (lazy) |
| Upgrade story | Difficult (heap assumptions) | Natural (references re-resolve) | Natural (formulas re-evaluate) |
| Retention: live references | Indefinite (partition masked) | Distributed acyclic GC | Scoped to cohort |
| Retention: durable references | Indefinite (web-keys) | Weak (sturdyrefs) | Local reference counting (formula graph) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, for E, "Weak" is too weak. It is part of the application semantics, rather than the platform semantics, whether a host must honor a sturdy ref. It has to be a host/app choice. Otherwise it is impossible to write any correct distributed system. For example, an ERTP issuer must retain all purses for which there may be an outstanding sturdy ref.

"Weak" implies that it is not an app choice, but rather a platform choice based on strong reachability. Yes, one can build ERTP's obligations on top by arranging permanent strong reachability. But this goes against the connotations of "weak".

| Upgrade story | Difficult (heap assumptions) | Natural (references re-resolve) | Natural (formulas re-evaluate) |
| Retention: live references | Indefinite (partition masked) | Distributed acyclic GC | Scoped to cohort |
| Retention: durable references | Indefinite (web-keys) | Weak (sturdyrefs) | Local reference counting (formula graph) |
| Availability | Sacrificed for consistency | Maintained per-reference | Maintained per-cohort |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For E, I don't understand "availability" "Maintained per-reference".

By using timeouts where E uses partition, Waterken still provides extremely high availability.

I also don't yet understand "per cohort" because I don't yet understand "cohort". Will revisit once I do.

Comment on lines +463 to +464
market-based solutions. Formula Persistence sidesteps this obligation by keeping
the formula graph acyclic and locally reference-counted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For formula persistence, when is a host required to preserve state on behalf of remote clients? This answer cannot be "none" or "never". If there is such an obligation, as there always must be in any significant distributed app, then you're not sidestepping the need for incentives. You're ignoring the incentives, just as every other distributed system has always done. Including, ironically, Agoric.


### Petname systems

Marc Stiegler's petname systems describe a naming architecture with three
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grumble. AFAIK Marc everywhere acknowledges that I invented petname systems. But he's certainly done much much more to explain, implement, and promulgate them than I have.

My first and only petname paper: https://erights.org/elib/capability/pnml.html
Has the notion of embedded cards with different terminology. Has paths. Worth a look when you have time ;)

Having gotten all that off my chest, please keep "Marc Stiegler's petname systems". No change suggested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Acknowledgements at the bottom of the PNML paper is a nice historical record of proper credit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants