docs: Design for daemon formula persistence by kriskowal · Pull Request #3121 · endojs/endo

kriskowal · 2026-03-08T03:25:35Z

This change introduces a design rationale for formula persistence.

changeset-bot · 2026-03-08T03:25:40Z

⚠️ No Changeset found

Latest commit: aefc1b8

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

erights · 2026-03-08T04:25:40Z

designs/daemon-persistence.md

+loss of connectivity is treated as temporary. The program never observes a broken
+reference; it simply waits. Waterken achieves this by combining Joe-E (a


Some problems with initial phrasing:

promises still reject/break, so "never observes a broken reference" is wrong. "never observes a reference broken due to partition" would be correct.

"it simply waits" what simply waits? The Waterken computation model is as non-blocking as E or Endo -- communicating event loops. Programs never wait in the conventional sense of blocking.

Suggested change

loss of connectivity is treated as temporary. The program never observes a broken

reference; it simply waits. Waterken achieves this by combining Joe-E (a

loss of connectivity is treated as temporary. A message sent during loss of connectivity will still be delivered exactly once, after connectivity is reestablished. Waterken achieves this by combining Joe-E (a

erights · 2026-03-08T04:28:01Z

designs/daemon-persistence.md

+"partitioned," the system can be made deterministic over all communicating
+programs.


There is still arrival order non-determinism, which is fundamental. If vatA and vatB both send a message to vatC at the "same time", they will arrive in some order, but the order is not determined by prior distributed semantic state.

erights · 2026-03-08T04:30:07Z

designs/daemon-persistence.md

+
+**Disadvantages:**
+- Sacrifices availability: a single partitioned dependency stalls all dependents
+- Entangled distributed heaps require distributed garbage collection


Waterken did not do distributed gc at all. It simply had all distributed references leak forever, i.e., as long as those vats lived.

erights · 2026-03-08T04:34:32Z

designs/daemon-persistence.md

+- Differences in incentives among participants necessitate market-based
+  approaches to garbage collection (see "The market-sweep algorithms" in Drexler
+  and Miller, "Incentive Engineering for Computational Resource Management,"
+  1988)


Neither Waterken nor any other system has yet actually implemented market-based gc. Check back in another 40 years ;)

erights · 2026-03-08T04:36:52Z

designs/daemon-persistence.md

+- Upgrading programs in flight is difficult; the heap snapshot encodes
+  assumptions about program behavior that the upgrade may violate


Yeah, Tyler has a completely diff perspective on upgrade: https://waterken.sourceforge.net/upgrade/ . I don't know if anyone has actually tried this. It is interesting.

erights · 2026-03-08T04:37:43Z

designs/daemon-persistence.md

+### Exposed partition and revival per-reference (E model)
+
+At the other end, partition and revival are exposed at every individual reference. A
+program must be written so that any dereference or message send to a potentially


Suggested change

program must be written so that any dereference or message send to a potentially

program must be written so that any dereference or message send of a potentially

erights · 2026-03-08T04:39:59Z

designs/daemon-persistence.md

+remote reference might fail due to partition. Recovery requires reconstructing
+the chain of computation that led to the broken reference, after partition heals.


Recovery requires reconstructing the chain of computation

Recovery from sturdy refs need not, and often does not, reconstruct the original chain of computation that produced the sturdy ref.

erights · 2026-03-08T04:42:36Z

designs/daemon-persistence.md

+
+**Advantages:**
+- Simpler runtime implementation
+- Does not sacrifice availability to the extent of the Waterken model


Tyler argues that E's "react to partition" translates in Waterken to "react to timeout". Tyler also points out what's different: E's partition atomically breaks all references that are multiplexed over the partitioned collection. But I think this detail is below the level you're trying to explain.

erights · 2026-03-08T04:48:28Z

designs/daemon-persistence.md

+- No obligation to retain "offline capabilities" (sturdyrefs and out-of-band
+  URL-like references) indefinitely. Both are necessarily weak references.


In some version of E, SturdyRefs had timeouts. Not clear what "obligation" means when nothing can force the counter-party to retain, but I certainly consider the host to be obligated to retain the reference until the timeout (or deadline) expires. IIRC, modern E dropped the timeouts. Does that mean the host is obligated to retain forever? I'm unsure how to answer the question.

This comparison with "weak references" is interesting. I never thought to describe it this way, but I think it is valid.

erights · 2026-03-08T04:51:13Z

designs/daemon-persistence.md

+  Sturdyrefs are like out-of-band references but can participate in "distributed
+  confinement" without revealing cryptographic material to a confined program
+  with parts running on multiple peers.


A way to think of Waterken in E terms is that all Waterken references are sturdy. Just like in E, computation denied the ability to convert between opaque references are bits could be confined in Waterken. I'm not sure about this, but it is definitely consistent with the Waterken model.

erights · 2026-03-08T04:51:36Z

designs/daemon-persistence.md

+- More complex programming model: every dependent computation must handle
+  mid-process recovery


What does "mid-process" mean?

erights · 2026-03-08T04:52:05Z

designs/daemon-persistence.md

+**Disadvantages:**
+- More complex programming model: every dependent computation must handle
+  mid-process recovery
+- Programs must reconstruct chains of computation defensively


Is this a distinct point?

erights · 2026-03-08T04:53:49Z

designs/daemon-persistence.md

+locator) that weakly retains a capability on a peer and can be redeemed for a
+live reference. In the Waterken model, these must be persisted indefinitely, or


Waterken had no separate notion of a live reference. All references are sturdy. Because partitions are masked, you just send messages on these sturdy references and they are still delivered exactly once.

erights · 2026-03-08T04:56:06Z

designs/daemon-persistence.md

+  confinement" without revealing cryptographic material to a confined program
+  with parts running on multiple peers.
+
+**Disadvantages:**


In E, a partition drops messages in flight, without E itself providing any bookkeeping to ascertain after the fact which messages were lost. Thus, in E, messages are delivered at most once. This creates way more application complexity than an exactly-once guarantee.

erights · 2026-03-08T04:57:24Z

designs/daemon-persistence.md

+Both models share the notion of a URL or URL-like reference (sturdy reference,
+locator) that weakly retains a capability on a peer and can be redeemed for a
+live reference. In the Waterken model, these must be persisted indefinitely, or
+all dependent distributed processes are silently corrupted (they continue waiting


Corruption is loss of integrity. The case you're describing is at most loss of availability.

erights · 2026-03-08T04:58:44Z

designs/daemon-persistence.md

+locator) that weakly retains a capability on a peer and can be redeemed for a
+live reference. In the Waterken model, these must be persisted indefinitely, or
+all dependent distributed processes are silently corrupted (they continue waiting
+for references that will never return). In E, sturdy references and locators


References don't return. They resolve/forward/settle/fulfill/break, or not.

erights · 2026-03-08T05:10:31Z

designs/daemon-persistence.md

+capabilities), patterns, and message passing. Other systems built on the same
+Endo components make different choices along the entangled dimensions:
+
+- The choice of **CapTP** determines message ordering.


We've converged on the data model of ocapn whose order is point-to-point fifo, with the assumption that we will add an "after" operation eventually to express stronger orders. But in any case, the CapTP choices in Endo should not determine any message order other than point-to-point fifo (+ "after"), yes?

erights · 2026-03-08T05:12:24Z

designs/daemon-persistence.md

+For example, the Agoric chain uses Endo components with orthogonal persistence to
+ensure that all honest validators produce the same deterministic computation,
+independent of whether they crashed and restarted or simply continued. Formula


Agoric's use of Endo follows the Waterken model

erights · 2026-03-08T05:16:56Z

designs/daemon-persistence.md

+In both models, petname systems are expected to be built *on top of* these
+reference mechanisms.
+
+## Formula Persistence: Inverting the Relationship


Need to see a concrete example early with an actual concrete formula.

erights · 2026-03-08T05:19:40Z

designs/daemon-persistence.md

+The formula graph is acyclic across peers, but admits limited cycles among
+certain groups of formulas that must present unique, unforgeable identifiers to
+the network while being constructed as facets of a shared underlying capability.


runon sentence. Too hard for me to parse even though I'm already more oriented than your readers.

erights · 2026-03-08T05:24:09Z

designs/daemon-persistence.md

+capabilities from their formulas, restoring the user's prior policy decisions
+without requiring the user to re-confirm them.
+
+### Revocation by withdrawal of construction


Is this selective revocation? If not, it fails to express the main use case for revocation: revoking the access I gave to Bob without revoking the access I gave to Carol.

erights · 2026-03-08T05:25:59Z

designs/daemon-persistence.md

+## Why Not Orthogonal Persistence?
+
+### The upgrade problem dissolves the distinction


I might put this as "whether on not persistence is orthogonal, upgrade cannot be orthogonal"

I do not suggest you go into any depth on the Agoric upgrade model. But just noting:

Agoric's current support for upgrade treats messages sent during an upgrade, under some circumstances, as "deliver at most once", creating all the application defensiveness burden Agoric tried to avoid by adopting the Waterken persistence and communication model. Instead, Agoric's next refinement of its upgrade model, based on durable promises, will preserve the "deliver exactly once" guarantee even across upgrades.

However, I do suggest that for formula persistence you do discuss what the message delivery guarantees are across distributed traumas.

erights · 2026-03-08T05:34:36Z

designs/daemon-persistence.md

+- **Determinism:** Reconstruction from formula may produce observably different
+  results from one incarnation to the next (e.g., if a dependency's behavior
+  has changed).
+- **Ephemeral state:** Heap state that is not captured in a formula or in
+  manually persisted storage is lost across incarnations.


This is true for Agoric as well, since incarnations are upgrade boundaries, not crash recovery boundaries.

erights · 2026-03-08T05:38:05Z

designs/daemon-persistence.md

+A formula is not a snapshot of state. It is a recipe for producing state. The
+system persists *construction*, not *content*.
+
+### Destruction by cohort, reconstruction on demand


I don't get cohorts yet. Probably will once there's an example with a concrete formula?

erights · 2026-03-08T05:42:35Z

designs/daemon-persistence.md

+| Persistence mechanism | Orthogonal | Manual + sturdy refs | Formula graph |
+| Programming model | Simple (no partition code) | Defensive (per-reference) | Moderate (cohort-aware) |
+| Restart cost | Snapshot restore | Reference re-establishment | Formula evaluation (lazy) |
+| Upgrade story | Difficult (heap assumptions) | Natural (references re-resolve) | Natural (formulas re-evaluate) |


For E, "natural" is way too generous. It is correct wrt recovering connectivity. But E's persistence was manual, so upgrade was no additional burden. It was just the same manual revival from what was manually stored. But this burden of manual storage and revival was awkward and "unnatural".

erights · 2026-03-08T05:44:07Z

designs/daemon-persistence.md

+| Programming model | Simple (no partition code) | Defensive (per-reference) | Moderate (cohort-aware) |
+| Restart cost | Snapshot restore | Reference re-establishment | Formula evaluation (lazy) |
+| Upgrade story | Difficult (heap assumptions) | Natural (references re-resolve) | Natural (formulas re-evaluate) |
+| Retention: live references | Indefinite (partition masked) | Distributed acyclic GC | Scoped to cohort |


In E, live refs also only retained up to partition.

erights · 2026-03-08T05:51:20Z

designs/daemon-persistence.md

+| Restart cost | Snapshot restore | Reference re-establishment | Formula evaluation (lazy) |
+| Upgrade story | Difficult (heap assumptions) | Natural (references re-resolve) | Natural (formulas re-evaluate) |
+| Retention: live references | Indefinite (partition masked) | Distributed acyclic GC | Scoped to cohort |
+| Retention: durable references | Indefinite (web-keys) | Weak (sturdyrefs) | Local reference counting (formula graph) |


On second thought, for E, "Weak" is too weak. It is part of the application semantics, rather than the platform semantics, whether a host must honor a sturdy ref. It has to be a host/app choice. Otherwise it is impossible to write any correct distributed system. For example, an ERTP issuer must retain all purses for which there may be an outstanding sturdy ref.

"Weak" implies that it is not an app choice, but rather a platform choice based on strong reachability. Yes, one can build ERTP's obligations on top by arranging permanent strong reachability. But this goes against the connotations of "weak".

erights · 2026-03-08T05:54:38Z

designs/daemon-persistence.md

+| Upgrade story | Difficult (heap assumptions) | Natural (references re-resolve) | Natural (formulas re-evaluate) |
+| Retention: live references | Indefinite (partition masked) | Distributed acyclic GC | Scoped to cohort |
+| Retention: durable references | Indefinite (web-keys) | Weak (sturdyrefs) | Local reference counting (formula graph) |
+| Availability | Sacrificed for consistency | Maintained per-reference | Maintained per-cohort |


For E, I don't understand "availability" "Maintained per-reference".

By using timeouts where E uses partition, Waterken still provides extremely high availability.

I also don't yet understand "per cohort" because I don't yet understand "cohort". Will revisit once I do.

erights · 2026-03-08T05:59:52Z

designs/daemon-persistence.md

+market-based solutions. Formula Persistence sidesteps this obligation by keeping
+the formula graph acyclic and locally reference-counted.


For formula persistence, when is a host required to preserve state on behalf of remote clients? This answer cannot be "none" or "never". If there is such an obligation, as there always must be in any significant distributed app, then you're not sidestepping the need for incentives. You're ignoring the incentives, just as every other distributed system has always done. Including, ironically, Agoric.

erights · 2026-03-08T06:04:52Z

designs/daemon-persistence.md

+
+### Petname systems
+
+Marc Stiegler's petname systems describe a naming architecture with three


Grumble. AFAIK Marc everywhere acknowledges that I invented petname systems. But he's certainly done much much more to explain, implement, and promulgate them than I have.

My first and only petname paper: https://erights.org/elib/capability/pnml.html
Has the notion of embedded cards with different terminology. Has paths. Worth a look when you have time ;)

Having gotten all that off my chest, please keep "Marc Stiegler's petname systems". No change suggested.

The Acknowledgements at the bottom of the PNML paper is a nice historical record of proper credit.

docs: Design for daemon formula persistence

aefc1b8

kriskowal requested a review from erights March 8, 2026 03:25

erights reviewed Mar 8, 2026

View reviewed changes

		loss of connectivity is treated as temporary. The program never observes a broken
		reference; it simply waits. Waterken achieves this by combining Joe-E (a

		"partitioned," the system can be made deterministic over all communicating
		programs.

		- Upgrading programs in flight is difficult; the heap snapshot encodes
		assumptions about program behavior that the upgrade may violate

	program must be written so that any dereference or message send to a potentially
	program must be written so that any dereference or message send of a potentially

		remote reference might fail due to partition. Recovery requires reconstructing
		the chain of computation that led to the broken reference, after partition heals.

		- No obligation to retain "offline capabilities" (sturdyrefs and out-of-band
		URL-like references) indefinitely. Both are necessarily weak references.

		- More complex programming model: every dependent computation must handle
		mid-process recovery

		locator) that weakly retains a capability on a peer and can be redeemed for a
		live reference. In the Waterken model, these must be persisted indefinitely, or

		## Why Not Orthogonal Persistence?

		### The upgrade problem dissolves the distinction

		market-based solutions. Formula Persistence sidesteps this obligation by keeping
		the formula graph acyclic and locally reference-counted.


		### Petname systems

		Marc Stiegler's petname systems describe a naming architecture with three

Conversation

kriskowal commented Mar 8, 2026

Uh oh!

changeset-bot bot commented Mar 8, 2026

⚠️ No Changeset found

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants