consensus-testlib: add Test.CsjModel #1466

nfrisby · 2025-04-15T21:35:08Z

This PR model is the result of troubleshooting some test failures. I found a minor non-optimality in CSJ (recall that CSJ is fundamentally an optimization), and with the non-centralized definition of the existing CSJ implementation it was challenging for me to establish confidence that there weren't similar issues. I developed this centralized model for more legibility/clarity, and I intend for it to be used at least to test the non-centralized CSJ implementation.

nfrisby · 2025-04-15T21:39:10Z

ouroboros-consensus/src/unstable-consensus-testlib/Test/CsjModel/NonEmptySeq.hs

+import qualified Data.Sequence as Seq
+
+-- | A non-empty 'Seq'
+newtype NonEmptySeq a = UnsafeNonEmptySeq (Seq a)


This type doesn't use AnchoredFragment for two reasons.

As a model, I'm trying to keep it as simple as possible: so far, I don't need headers at all. And I can't put points in an AnchoredFragment because of the HasHeader constraint.

So far, the model deals with chains that reach all the way back to Origin, not fragments. As long as that's true, it seems misleading to use AnchoredFragment.

The code no longer reaches all the way back to Genesis. It still doesn't involve headers explicitly, but perhaps AnchoredSeq could still be reused?

This commit adds the CSJ model but does not yet use it to test the implementation.

This small alteration accomodate the fact that the ChainSync client cannot necessarily a MsgFindIntersect _immediately_ when the CSJ governor emits it.

Payloads and connecting peers forced me ot figure out how to not always persist chains all the way back to Origin. Which forced me to realize and handle some weird corner cases. I'm optimistic, but I do expect friction once this is finally connected to the actual tests.

…tion

…nd the MsgFindIntersect reaction

nfrisby · 2025-05-14T12:25:41Z

Overnight, I ran:

4 copies of -p\"/CSJ/&&/With some/&&/own thing/\" --quickcheck-tests 10000
28 copies of -p\"/CSJ/" --quickcheck-shrinks 1000 --quickcheck-tests 1000

The testsuite has a 10x factor, so three of the four CSJ tests ran 280000 times, and the fourth ran 680000 times.

All of those tests passed! 🙌 Encouraging.

Edit: .... but the Byron ThreadNet tests are failing 🤦
Edit 2: Nevermind, those Byron tests also fail (in my local build) for the intersection of this PR and origin/main as well as on origin/main itself.

...os-consensus/src/unstable-consensus-testlib/Test/CsjModel/UglyStylishHaskellCppWorkaround.hs

nfrisby · 2025-05-15T17:45:34Z

I think the tree of the current tip commit is in pretty good shape.; passing huge number of -p/CSJ/ tests locally. We discussed next steps in our call today.

I'll pause and ramp up on Leios Innovation work.
@amesgen will start reviewing this PR.
The nearest milestone seems to be: split out a smaller PR that merely abstracts the ChainSync client over the CSJ registerClient function and reorganize the tests in this PR just slightly so that CsjModel.hs is also used in the Genesis tests (maybe the TestSetup determines which governor, 50-50 and shrink to CsjModel.hs?).
Principal remaining work for a follow-up PR is to confirm that the gap between the CSJ implementation and CsjModel.hs is just the risk of unnecessary header downloads and nothing more than that. Secondary remaining work is the flaky "node restart during sync" test.

nfrisby · 2025-05-15T20:00:00Z

The ~275000th local iteration of a test failed because I forgot to change the rotateDynamo call from the CSJ implementation to CsjModel.hs. I just pushed up a commit that fixed that.

However, the test case still fails, because CsjModel.hs's Demoting/Demoted pair doesn't account for the fact that the Dynamo being demoted might be stuck on the forecast range 🤦. So that's a TODO

Repro: cabal run ouroboros-consensus-diffusion:consensus-test -- --quickcheck-replay="(SMGen 646821712187648738 7938881144433598775,0)" -p '/stalling leashing attack/'

nfrisby · 2025-05-16T12:28:54Z

ouroboros-consensus/src/unstable-consensus-testlib/Test/CsjModel.hs

+                Jumped{}   -> True
+                Jumping{}  -> False
+                Objector{} -> True
+        pure (x, if not shouldDemote then [] else [(pid, Demoting)])


This Demoting isn't enough. Specifically:

The Dynamo is adversarial, withholding blocks.

But the Dynamo is serving the honest chain, and all Jumpers have fully accepted the latest jump request.

The Dynamo has served some headers after that jump request, but not enough for another jump.

In this scenario, the Dynamo has the longest chain, so Devoted BlockFetch will favor them. Also, the Dynamo's ChainSync client is fully blocked until a new block arrives, which it won't. So it'll never send the Demoted stimulus---the demoted definition just below won't happen. So the CSJ state and the syncing node overall are now stuck.

I anticipate the smoothest way to fix this is to introduce a new NonDynamo which is FormerDynamo. The CSJ governor expects it to send MsgRollForward etc and also Demoted. Once it sends Demoted, it's just a Jumper again. The benefit is that this way a new Dynamo can be elected before this Dynamo being demoted raises Demoted.

nfrisby requested review from jasagredo, amesgen, fraser-iohk, dnadales and geo2a as code owners April 15, 2025 21:35

nfrisby added the Genesis PRs related to Genesis testing and implementation label Apr 15, 2025

nfrisby added this to Consensus Team Backlog Apr 15, 2025

nfrisby commented Apr 15, 2025

View reviewed changes

nfrisby requested a review from jorisdral as a code owner April 22, 2025 00:10

nfrisby force-pushed the nfrisby/csj-model branch 2 times, most recently from d0c427c to d408094 Compare April 22, 2025 00:24

nfrisby moved this to 🏗 In progress in Consensus Team Backlog Apr 22, 2025

nfrisby force-pushed the nfrisby/csj-model branch 5 times, most recently from e1444da to ebd743b Compare April 22, 2025 14:40

nfrisby force-pushed the nfrisby/csj-model branch 2 times, most recently from edc4bc0 to 27c8abe Compare April 30, 2025 13:53

nfrisby and others added 6 commits May 2, 2025 07:55

consensus-testlib: add Test.CsjModel

7d17013

This commit adds the CSJ model but does not yet use it to test the implementation.

TOSQUASH add Jumping123 and Offered

af3e355

This small alteration accomodate the fact that the ChainSync client cannot necessarily a MsgFindIntersect _immediately_ when the CSJ governor emits it.

TOSQUASH some style checks

d439a93

TOSQUASH stylish

534135f

TOSQUASH typos

0f01420

nfrisby force-pushed the nfrisby/csj-model branch from 27c8abe to 833d6bf Compare May 2, 2025 15:02

nfrisby added 4 commits May 2, 2025 08:05

TOSQUASH consolidate imm tip input and trimCandidate calls

c508600

TOSQUASH add TODO re AnchoredSeq

a42af79

TOSQUASH tidy up the queue maintenance

c3aab0c

TOSQUASH haddock typos

06ef1a3

nfrisby added 9 commits May 13, 2025 06:51

TOSQUASH FullTrim workaround for huge text output in CsjModel traces

b628b04

TOSQUASH bugfix: the CsjGov should not be maintaint the handle collec…

ee7277b

…tion

TOSQUASH bugfix: candidate and jump might have different anchors

1716d5d

TOSQUASH discard JumpInfo when pretty printing CsjState

ab06fff

TOSQUASH backfill Dynamo, then check jump, then backfill Objectors

59b49f8

TOSQUASH if a new request resets a Jumping NotYetSet to Jumped, resci…

ef3b8df

…nd the MsgFindIntersect reaction

TOSQUASH fixup pretty-printing commit

f00a02c

TOSQUASH o-c-d:consensus-test:Test.Util.NoThunks

a2b0cb2

TOSQUASH Dynamo must be younger than Objectors

eabecf7

nfrisby added 2 commits May 14, 2025 05:58

TOSQUASH fixup ChainSync client bench and tests

512c83c

TOSQUASH fixup Test.Util.NoThunks

d19b18a

nfrisby force-pushed the nfrisby/csj-model branch from e3c5b5b to 16e5d16 Compare May 14, 2025 18:12

nfrisby added 2 commits May 14, 2025 11:20

TOSQUASH small stylish diff

a84f66f

TOSQUASH remove redundant constraint

aeda17a

nfrisby force-pushed the nfrisby/csj-model branch 3 times, most recently from 5ae55bb to 4636dc7 Compare May 14, 2025 19:21

nfrisby commented May 14, 2025

View reviewed changes

...os-consensus/src/unstable-consensus-testlib/Test/CsjModel/UglyStylishHaskellCppWorkaround.hs Outdated Show resolved Hide resolved

nfrisby force-pushed the nfrisby/csj-model branch from 4636dc7 to 9e7ec69 Compare May 14, 2025 19:52

TOSQUASH annoying GHC version fixup

fb07869

nfrisby force-pushed the nfrisby/csj-model branch from 9e7ec69 to fb07869 Compare May 14, 2025 19:59

nfrisby added 2 commits May 14, 2025 13:39

TOSQUASH bugfix apply CsjModel messages in the correct order

4670e8a

TOSQUASH only pretty print the new state CsjState

7b6de0b

TOSQUASH I forgot to also abstract over rotateDynamo

b2ceb4f

nfrisby force-pushed the nfrisby/csj-model branch from 894b535 to b2ceb4f Compare May 15, 2025 20:02

nfrisby commented May 16, 2025

View reviewed changes

TOSQUASH add FormerDynamo

716e699

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consensus-testlib: add Test.CsjModel #1466

consensus-testlib: add Test.CsjModel #1466

nfrisby commented Apr 15, 2025 •

edited

Loading

nfrisby Apr 15, 2025

nfrisby Apr 30, 2025

nfrisby commented May 14, 2025 •

edited

Loading

nfrisby commented May 15, 2025

nfrisby commented May 15, 2025 •

edited

Loading

nfrisby May 16, 2025

consensus-testlib: add Test.CsjModel #1466

Are you sure you want to change the base?

consensus-testlib: add Test.CsjModel #1466

Conversation

nfrisby commented Apr 15, 2025 • edited Loading

nfrisby Apr 15, 2025

Choose a reason for hiding this comment

nfrisby Apr 30, 2025

Choose a reason for hiding this comment

nfrisby commented May 14, 2025 • edited Loading

nfrisby commented May 15, 2025

nfrisby commented May 15, 2025 • edited Loading

nfrisby May 16, 2025

Choose a reason for hiding this comment

nfrisby commented Apr 15, 2025 •

edited

Loading

nfrisby commented May 14, 2025 •

edited

Loading

nfrisby commented May 15, 2025 •

edited

Loading