TOSQUASH consensus: refine sketch implementation plan for Ouroboros Genesis

nfrisby · nfrisby · commit ec331075572d · 2021-08-09T15:41:05.000-07:00
diff --git a/ouroboros-consensus/docs/GenesisDecomposition.md b/ouroboros-consensus/docs/GenesisDecomposition.md
@@ -187,3 +187,139 @@ I've also updated this file after a discussion with Duncan on 2021 Apr 13.
 
       * TODO testing etc - we'd _very much really_ like to use the ThreadNet
         rewrite for this
+
+-----
+
+Updated on 2021 August 9, after much additional thought and broader
+reconsiderations, kicked off by Javier Sagredo's observation of a stalling
+attack vector in the original sketch above.
+
+This new sketch updates much-but-not-all of the origial sketch above.
+
+- Execution begins in the _Syncing_ state.
+
+- While we are Syncing:
+
+    - If our valency falls below some threshold, then BlockFetch stops sending
+      new fetch requests until sufficient valency is recovered.
+
+    - BlockFetch can only download blocks from the headers that the density
+      rule approves.
+
+        - The density rule is: compare header chains based on the number of
+          headers in the relevant Genesis window (the 3k/f slots after the
+          intersection), though if the headers do not span the Genesis window
+          and the peer claims to have more headers we must wait for them
+          (because they might also be in the window).
+
+        - The Ouroboros Genesis paper proves -- excepting only disasterous
+          intervals -- that density rule will always strictly prefer the honest
+          chain over any possible alternative.
+
+    - Therefore, we require that each peer's highwater blockno is increasing
+      "fast enough on average" until we're at their tip, with the only
+      exceptional circumstance being when their latest header is beyond our
+      forecast range (since we don't even request a next header while that is
+      true).
+
+        - TODO Do we actually need that exception? Under what circumstances
+          would it be relevant, during Syncing?
+
+        - TODO I'm anticipating a token bucket for enforcing "fast enough on
+          average", but there remain plenty of details and thresholds to
+          consider.
+
+        - A possible refinement: if they can promise to send a specific k+1st
+          block (which the honest nodes would always do, up to their immutable
+          tip), then they're allowed to be somewhat slower, since we'll
+          disconnect from them if either they don't deliver that block or if
+          the eventual densest chain does not include that block.
+
+        - A possible refinment: each peer can offer _jump points_ that are
+          usefully ahead of their latest header. If some other peer has already
+          sent the jump point's header, then we can advance the slower peer's
+          ChainSync state accordingly. This can help a relatively slow
+          redundant peer remain connected.
+
+- Transition from Syncing to _CaughtUp_ whenever all of:
+
+    - No peer has sent a header binary-preferable to my selection.
+
+    - No peer has sent >k headers from an intersection with my selection.
+
+    - We see every peer to its tip.
+
+        - TODO To what extent can the adversary abuse this to prevent our
+          transition? Even supposing validated, uninterruptible ChainSync
+          switches?
+
+        - TODO Perhaps we don't need it, since we assume we'll have at least
+          one honest peer. Their stream of headers should race ahead of the
+          corresponding stream of blocks until we're CaughtUp, and so that'll
+          hold back at least one of the other conjuncts. On the other hand, it
+          seems fine if we do need this, because of the timeout discussed
+          above.
+
+- While we are CaughtUp:
+
+    - BlockFetch is free to download the blocks from any of our peers' headers.
+      It has two primary requirements, which are in tension.
+
+        - The ultimate goal of BlockFetch is to get the best blocks ASAP.
+          However, an imperfect best effort is tolerable, up to a point; we
+          consider the only consequences of the best effort's inefficiency to
+          be additional chain propagation delay.
+
+            - The Ouroboros protocol only considers chain length. Tiebreakers
+              are out of scope, so "best block" in the requirement above only
+              means greatest blockno. (BlockFetch is free to also consider
+              tiebreakers; the protocol does not care.)
+
+            - Note that the adversary claiming to have additional headers but
+              refusing to send them has no effect on BlockFetch while we are
+              CaughtUp. Only received headers matter. The worst the adversary
+              could do by withholding headers is intentionally timeout in order
+              to decrement our valency (which we might choose to require stays
+              about some value, see below) -- but presumably they can't ensure
+              we reconnect to them, so they've revealed their nature, losing
+              access to us, in order to possibly create a short delay.
+
+        - BlockFetch should avoid unnecessary downloads (the same block more
+          than once or a block we'll never select).
+
+            - When CaughtUp, we have a high priority design goal that
+              worst-case resource utilization is approximately the same as
+              average-case. If not, even well-meaning node operators will
+              eventually prune their node's allocated resources, thereby
+              creating a DoS attack vector.
+
+            - This is why we can't simply download "all blocks ASAP" or even the
+              same block from all peers currently offering it. Recall that the
+              adversary can forge arbitrarily many blocks whenever it is
+              elected, just not on the same chain.
+
+- Transition from CaughtUp to Syncing whenever any of:
+
+    - The wallclock is "too far ahead" of the latest "meaningful" peer
+      interaction.
+
+        - TODO Sketch: we transition as soon N (?) of our peers' tips have a
+          time point that is more than LIM (?) behind our wallclock.
+
+        - TODO Our ChainSync timeouts will disconnect naturally, right? And so
+          maybe this is really just another valency limit, like that of Syncing
+          above.
+
+        - TODO It's safe to assume the computer has access to "inertial
+          reckoning" via a real-time clock hardware, right? If so, we can
+          immediately detect this even upon eg the machine waking from a
+          hibernation state. IE instead of totally relying an NTP connection,
+          which could also be compromised.
+
+    - Some peer sends >k headers from an intersection with my selection.
+
+        - This rule is a failsafe: We assume this shouldn't happen under
+          nominal circumstances (by the Common Prefix theorem in the Ouroboros
+          Praos paper; TODO Confirm with researchers), so we downgrade to the
+          more conservative state if we do observe it, since we must have
+          somehow fallen "too far" behind again without otherwise noticing.