Skip to content

consensus-testlib: add Test.CsjModel #1466

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open

consensus-testlib: add Test.CsjModel #1466

wants to merge 40 commits into from

Conversation

nfrisby
Copy link
Contributor

@nfrisby nfrisby commented Apr 15, 2025

This PR model is the result of troubleshooting some test failures. I found a minor non-optimality in CSJ (recall that CSJ is fundamentally an optimization), and with the non-centralized definition of the existing CSJ implementation it was challenging for me to establish confidence that there weren't similar issues. I developed this centralized model for more legibility/clarity, and I intend for it to be used at least to test the non-centralized CSJ implementation.

@nfrisby nfrisby added the Genesis PRs related to Genesis testing and implementation label Apr 15, 2025
import qualified Data.Sequence as Seq

-- | A non-empty 'Seq'
newtype NonEmptySeq a = UnsafeNonEmptySeq (Seq a)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type doesn't use AnchoredFragment for two reasons.

  • As a model, I'm trying to keep it as simple as possible: so far, I don't need headers at all. And I can't put points in an AnchoredFragment because of the HasHeader constraint.

  • So far, the model deals with chains that reach all the way back to Origin, not fragments. As long as that's true, it seems misleading to use AnchoredFragment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code no longer reaches all the way back to Genesis. It still doesn't involve headers explicitly, but perhaps AnchoredSeq could still be reused?

@nfrisby nfrisby requested a review from jorisdral as a code owner April 22, 2025 00:10
@nfrisby nfrisby force-pushed the nfrisby/csj-model branch 2 times, most recently from d0c427c to d408094 Compare April 22, 2025 00:24
@nfrisby nfrisby moved this to 🏗 In progress in Consensus Team Backlog Apr 22, 2025
@nfrisby nfrisby force-pushed the nfrisby/csj-model branch 5 times, most recently from e1444da to ebd743b Compare April 22, 2025 14:40
@nfrisby nfrisby force-pushed the nfrisby/csj-model branch 2 times, most recently from edc4bc0 to 27c8abe Compare April 30, 2025 13:53
nfrisby and others added 6 commits May 2, 2025 07:55
This commit adds the CSJ model but does not yet use it to test the
implementation.
This small alteration accomodate the fact that the ChainSync client cannot necessarily a MsgFindIntersect _immediately_ when the CSJ governor emits it.
Payloads and connecting peers forced me ot figure out how to not always persist
chains all the way back to Origin. Which forced me to realize and handle some
weird corner cases.

I'm optimistic, but I do expect friction once this is finally connected to the
actual tests.
@nfrisby nfrisby force-pushed the nfrisby/csj-model branch from 27c8abe to 833d6bf Compare May 2, 2025 15:02
@nfrisby
Copy link
Contributor Author

nfrisby commented May 14, 2025

Overnight, I ran:

  • 4 copies of -p\"/CSJ/&&/With some/&&/own thing/\" --quickcheck-tests 10000
  • 28 copies of -p\"/CSJ/" --quickcheck-shrinks 1000 --quickcheck-tests 1000

The testsuite has a 10x factor, so three of the four CSJ tests ran 280000 times, and the fourth ran 680000 times.

All of those tests passed! 🙌 Encouraging.

Edit: .... but the Byron ThreadNet tests are failing 🤦
Edit 2: Nevermind, those Byron tests also fail (in my local build) for the intersection of this PR and origin/main as well as on origin/main itself.

@nfrisby nfrisby force-pushed the nfrisby/csj-model branch from e3c5b5b to 16e5d16 Compare May 14, 2025 18:12
@nfrisby nfrisby force-pushed the nfrisby/csj-model branch 3 times, most recently from 5ae55bb to 4636dc7 Compare May 14, 2025 19:21
@nfrisby nfrisby force-pushed the nfrisby/csj-model branch from 4636dc7 to 9e7ec69 Compare May 14, 2025 19:52
@nfrisby nfrisby force-pushed the nfrisby/csj-model branch from 9e7ec69 to fb07869 Compare May 14, 2025 19:59
@nfrisby
Copy link
Contributor Author

nfrisby commented May 15, 2025

I think the tree of the current tip commit is in pretty good shape.; passing huge number of -p/CSJ/ tests locally. We discussed next steps in our call today.

  • I'll pause and ramp up on Leios Innovation work.
  • @amesgen will start reviewing this PR.
  • The nearest milestone seems to be: split out a smaller PR that merely abstracts the ChainSync client over the CSJ registerClient function and reorganize the tests in this PR just slightly so that CsjModel.hs is also used in the Genesis tests (maybe the TestSetup determines which governor, 50-50 and shrink to CsjModel.hs?).
  • Principal remaining work for a follow-up PR is to confirm that the gap between the CSJ implementation and CsjModel.hs is just the risk of unnecessary header downloads and nothing more than that. Secondary remaining work is the flaky "node restart during sync" test.

@nfrisby
Copy link
Contributor Author

nfrisby commented May 15, 2025

The ~275000th local iteration of a test failed because I forgot to change the rotateDynamo call from the CSJ implementation to CsjModel.hs. I just pushed up a commit that fixed that.

However, the test case still fails, because CsjModel.hs's Demoting/Demoted pair doesn't account for the fact that the Dynamo being demoted might be stuck on the forecast range 🤦. So that's a TODO

Repro: cabal run ouroboros-consensus-diffusion:consensus-test -- --quickcheck-replay="(SMGen 646821712187648738 7938881144433598775,0)" -p '/stalling leashing attack/'

@nfrisby nfrisby force-pushed the nfrisby/csj-model branch from 894b535 to b2ceb4f Compare May 15, 2025 20:02
Jumped{} -> True
Jumping{} -> False
Objector{} -> True
pure (x, if not shouldDemote then [] else [(pid, Demoting)])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Demoting isn't enough. Specifically:

  • The Dynamo is adversarial, withholding blocks.
  • But the Dynamo is serving the honest chain, and all Jumpers have fully accepted the latest jump request.
  • The Dynamo has served some headers after that jump request, but not enough for another jump.

In this scenario, the Dynamo has the longest chain, so Devoted BlockFetch will favor them. Also, the Dynamo's ChainSync client is fully blocked until a new block arrives, which it won't. So it'll never send the Demoted stimulus---the demoted definition just below won't happen. So the CSJ state and the syncing node overall are now stuck.

I anticipate the smoothest way to fix this is to introduce a new NonDynamo which is FormerDynamo. The CSJ governor expects it to send MsgRollForward etc and also Demoted. Once it sends Demoted, it's just a Jumper again. The benefit is that this way a new Dynamo can be elected before this Dynamo being demoted raises Demoted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Genesis PRs related to Genesis testing and implementation
Projects
Status: 🏗 In progress
Development

Successfully merging this pull request may close these issues.

2 participants