docs: Design for daemon formula persistence#3121
Conversation
|
| loss of connectivity is treated as temporary. The program never observes a broken | ||
| reference; it simply waits. Waterken achieves this by combining Joe-E (a |
There was a problem hiding this comment.
Some problems with initial phrasing:
- promises still reject/break, so "never observes a broken reference" is wrong. "never observes a reference broken due to partition" would be correct.
- "it simply waits" what simply waits? The Waterken computation model is as non-blocking as E or Endo -- communicating event loops. Programs never wait in the conventional sense of blocking.
| loss of connectivity is treated as temporary. The program never observes a broken | |
| reference; it simply waits. Waterken achieves this by combining Joe-E (a | |
| loss of connectivity is treated as temporary. A message sent during loss of connectivity will still be delivered exactly once, after connectivity is reestablished. Waterken achieves this by combining Joe-E (a |
| "partitioned," the system can be made deterministic over all communicating | ||
| programs. |
There was a problem hiding this comment.
There is still arrival order non-determinism, which is fundamental. If vatA and vatB both send a message to vatC at the "same time", they will arrive in some order, but the order is not determined by prior distributed semantic state.
|
|
||
| **Disadvantages:** | ||
| - Sacrifices availability: a single partitioned dependency stalls all dependents | ||
| - Entangled distributed heaps require distributed garbage collection |
There was a problem hiding this comment.
Waterken did not do distributed gc at all. It simply had all distributed references leak forever, i.e., as long as those vats lived.
| - Differences in incentives among participants necessitate market-based | ||
| approaches to garbage collection (see "The market-sweep algorithms" in Drexler | ||
| and Miller, "Incentive Engineering for Computational Resource Management," | ||
| 1988) |
There was a problem hiding this comment.
Neither Waterken nor any other system has yet actually implemented market-based gc. Check back in another 40 years ;)
| - Upgrading programs in flight is difficult; the heap snapshot encodes | ||
| assumptions about program behavior that the upgrade may violate |
There was a problem hiding this comment.
Yeah, Tyler has a completely diff perspective on upgrade: https://waterken.sourceforge.net/upgrade/ . I don't know if anyone has actually tried this. It is interesting.
| ### Exposed partition and revival per-reference (E model) | ||
|
|
||
| At the other end, partition and revival are exposed at every individual reference. A | ||
| program must be written so that any dereference or message send to a potentially |
There was a problem hiding this comment.
| program must be written so that any dereference or message send to a potentially | |
| program must be written so that any dereference or message send of a potentially |
| remote reference might fail due to partition. Recovery requires reconstructing | ||
| the chain of computation that led to the broken reference, after partition heals. |
There was a problem hiding this comment.
Recovery requires reconstructing the chain of computation
Recovery from sturdy refs need not, and often does not, reconstruct the original chain of computation that produced the sturdy ref.
|
|
||
| **Advantages:** | ||
| - Simpler runtime implementation | ||
| - Does not sacrifice availability to the extent of the Waterken model |
There was a problem hiding this comment.
Tyler argues that E's "react to partition" translates in Waterken to "react to timeout". Tyler also points out what's different: E's partition atomically breaks all references that are multiplexed over the partitioned collection. But I think this detail is below the level you're trying to explain.
| - No obligation to retain "offline capabilities" (sturdyrefs and out-of-band | ||
| URL-like references) indefinitely. Both are necessarily weak references. |
There was a problem hiding this comment.
In some version of E, SturdyRefs had timeouts. Not clear what "obligation" means when nothing can force the counter-party to retain, but I certainly consider the host to be obligated to retain the reference until the timeout (or deadline) expires. IIRC, modern E dropped the timeouts. Does that mean the host is obligated to retain forever? I'm unsure how to answer the question.
This comparison with "weak references" is interesting. I never thought to describe it this way, but I think it is valid.
| Sturdyrefs are like out-of-band references but can participate in "distributed | ||
| confinement" without revealing cryptographic material to a confined program | ||
| with parts running on multiple peers. |
There was a problem hiding this comment.
A way to think of Waterken in E terms is that all Waterken references are sturdy. Just like in E, computation denied the ability to convert between opaque references are bits could be confined in Waterken. I'm not sure about this, but it is definitely consistent with the Waterken model.
| - More complex programming model: every dependent computation must handle | ||
| mid-process recovery |
There was a problem hiding this comment.
What does "mid-process" mean?
| **Disadvantages:** | ||
| - More complex programming model: every dependent computation must handle | ||
| mid-process recovery | ||
| - Programs must reconstruct chains of computation defensively |
| locator) that weakly retains a capability on a peer and can be redeemed for a | ||
| live reference. In the Waterken model, these must be persisted indefinitely, or |
There was a problem hiding this comment.
Waterken had no separate notion of a live reference. All references are sturdy. Because partitions are masked, you just send messages on these sturdy references and they are still delivered exactly once.
| confinement" without revealing cryptographic material to a confined program | ||
| with parts running on multiple peers. | ||
|
|
||
| **Disadvantages:** |
There was a problem hiding this comment.
In E, a partition drops messages in flight, without E itself providing any bookkeeping to ascertain after the fact which messages were lost. Thus, in E, messages are delivered at most once. This creates way more application complexity than an exactly-once guarantee.
| Both models share the notion of a URL or URL-like reference (sturdy reference, | ||
| locator) that weakly retains a capability on a peer and can be redeemed for a | ||
| live reference. In the Waterken model, these must be persisted indefinitely, or | ||
| all dependent distributed processes are silently corrupted (they continue waiting |
There was a problem hiding this comment.
Corruption is loss of integrity. The case you're describing is at most loss of availability.
| locator) that weakly retains a capability on a peer and can be redeemed for a | ||
| live reference. In the Waterken model, these must be persisted indefinitely, or | ||
| all dependent distributed processes are silently corrupted (they continue waiting | ||
| for references that will never return). In E, sturdy references and locators |
There was a problem hiding this comment.
References don't return. They resolve/forward/settle/fulfill/break, or not.
| capabilities), patterns, and message passing. Other systems built on the same | ||
| Endo components make different choices along the entangled dimensions: | ||
|
|
||
| - The choice of **CapTP** determines message ordering. |
There was a problem hiding this comment.
We've converged on the data model of ocapn whose order is point-to-point fifo, with the assumption that we will add an "after" operation eventually to express stronger orders. But in any case, the CapTP choices in Endo should not determine any message order other than point-to-point fifo (+ "after"), yes?
| For example, the Agoric chain uses Endo components with orthogonal persistence to | ||
| ensure that all honest validators produce the same deterministic computation, | ||
| independent of whether they crashed and restarted or simply continued. Formula |
There was a problem hiding this comment.
Agoric's use of Endo follows the Waterken model
| In both models, petname systems are expected to be built *on top of* these | ||
| reference mechanisms. | ||
|
|
||
| ## Formula Persistence: Inverting the Relationship |
There was a problem hiding this comment.
Need to see a concrete example early with an actual concrete formula.
| The formula graph is acyclic across peers, but admits limited cycles among | ||
| certain groups of formulas that must present unique, unforgeable identifiers to | ||
| the network while being constructed as facets of a shared underlying capability. |
There was a problem hiding this comment.
runon sentence. Too hard for me to parse even though I'm already more oriented than your readers.
| capabilities from their formulas, restoring the user's prior policy decisions | ||
| without requiring the user to re-confirm them. | ||
|
|
||
| ### Revocation by withdrawal of construction |
There was a problem hiding this comment.
Is this selective revocation? If not, it fails to express the main use case for revocation: revoking the access I gave to Bob without revoking the access I gave to Carol.
| ## Why Not Orthogonal Persistence? | ||
|
|
||
| ### The upgrade problem dissolves the distinction |
There was a problem hiding this comment.
I might put this as "whether on not persistence is orthogonal, upgrade cannot be orthogonal"
There was a problem hiding this comment.
I do not suggest you go into any depth on the Agoric upgrade model. But just noting:
Agoric's current support for upgrade treats messages sent during an upgrade, under some circumstances, as "deliver at most once", creating all the application defensiveness burden Agoric tried to avoid by adopting the Waterken persistence and communication model. Instead, Agoric's next refinement of its upgrade model, based on durable promises, will preserve the "deliver exactly once" guarantee even across upgrades.
However, I do suggest that for formula persistence you do discuss what the message delivery guarantees are across distributed traumas.
| - **Determinism:** Reconstruction from formula may produce observably different | ||
| results from one incarnation to the next (e.g., if a dependency's behavior | ||
| has changed). | ||
| - **Ephemeral state:** Heap state that is not captured in a formula or in | ||
| manually persisted storage is lost across incarnations. |
There was a problem hiding this comment.
This is true for Agoric as well, since incarnations are upgrade boundaries, not crash recovery boundaries.
| A formula is not a snapshot of state. It is a recipe for producing state. The | ||
| system persists *construction*, not *content*. | ||
|
|
||
| ### Destruction by cohort, reconstruction on demand |
There was a problem hiding this comment.
I don't get cohorts yet. Probably will once there's an example with a concrete formula?
| | Persistence mechanism | Orthogonal | Manual + sturdy refs | Formula graph | | ||
| | Programming model | Simple (no partition code) | Defensive (per-reference) | Moderate (cohort-aware) | | ||
| | Restart cost | Snapshot restore | Reference re-establishment | Formula evaluation (lazy) | | ||
| | Upgrade story | Difficult (heap assumptions) | Natural (references re-resolve) | Natural (formulas re-evaluate) | |
There was a problem hiding this comment.
For E, "natural" is way too generous. It is correct wrt recovering connectivity. But E's persistence was manual, so upgrade was no additional burden. It was just the same manual revival from what was manually stored. But this burden of manual storage and revival was awkward and "unnatural".
| | Programming model | Simple (no partition code) | Defensive (per-reference) | Moderate (cohort-aware) | | ||
| | Restart cost | Snapshot restore | Reference re-establishment | Formula evaluation (lazy) | | ||
| | Upgrade story | Difficult (heap assumptions) | Natural (references re-resolve) | Natural (formulas re-evaluate) | | ||
| | Retention: live references | Indefinite (partition masked) | Distributed acyclic GC | Scoped to cohort | |
There was a problem hiding this comment.
In E, live refs also only retained up to partition.
| | Restart cost | Snapshot restore | Reference re-establishment | Formula evaluation (lazy) | | ||
| | Upgrade story | Difficult (heap assumptions) | Natural (references re-resolve) | Natural (formulas re-evaluate) | | ||
| | Retention: live references | Indefinite (partition masked) | Distributed acyclic GC | Scoped to cohort | | ||
| | Retention: durable references | Indefinite (web-keys) | Weak (sturdyrefs) | Local reference counting (formula graph) | |
There was a problem hiding this comment.
On second thought, for E, "Weak" is too weak. It is part of the application semantics, rather than the platform semantics, whether a host must honor a sturdy ref. It has to be a host/app choice. Otherwise it is impossible to write any correct distributed system. For example, an ERTP issuer must retain all purses for which there may be an outstanding sturdy ref.
"Weak" implies that it is not an app choice, but rather a platform choice based on strong reachability. Yes, one can build ERTP's obligations on top by arranging permanent strong reachability. But this goes against the connotations of "weak".
| | Upgrade story | Difficult (heap assumptions) | Natural (references re-resolve) | Natural (formulas re-evaluate) | | ||
| | Retention: live references | Indefinite (partition masked) | Distributed acyclic GC | Scoped to cohort | | ||
| | Retention: durable references | Indefinite (web-keys) | Weak (sturdyrefs) | Local reference counting (formula graph) | | ||
| | Availability | Sacrificed for consistency | Maintained per-reference | Maintained per-cohort | |
There was a problem hiding this comment.
For E, I don't understand "availability" "Maintained per-reference".
By using timeouts where E uses partition, Waterken still provides extremely high availability.
I also don't yet understand "per cohort" because I don't yet understand "cohort". Will revisit once I do.
| market-based solutions. Formula Persistence sidesteps this obligation by keeping | ||
| the formula graph acyclic and locally reference-counted. |
There was a problem hiding this comment.
For formula persistence, when is a host required to preserve state on behalf of remote clients? This answer cannot be "none" or "never". If there is such an obligation, as there always must be in any significant distributed app, then you're not sidestepping the need for incentives. You're ignoring the incentives, just as every other distributed system has always done. Including, ironically, Agoric.
|
|
||
| ### Petname systems | ||
|
|
||
| Marc Stiegler's petname systems describe a naming architecture with three |
There was a problem hiding this comment.
Grumble. AFAIK Marc everywhere acknowledges that I invented petname systems. But he's certainly done much much more to explain, implement, and promulgate them than I have.
My first and only petname paper: https://erights.org/elib/capability/pnml.html
Has the notion of embedded cards with different terminology. Has paths. Worth a look when you have time ;)
Having gotten all that off my chest, please keep "Marc Stiegler's petname systems". No change suggested.
There was a problem hiding this comment.
The Acknowledgements at the bottom of the PNML paper is a nice historical record of proper credit.
This change introduces a design rationale for formula persistence.