-
Notifications
You must be signed in to change notification settings - Fork 993
feat(full/availability): adding different withholding options to tests #2697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…s and extending test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had a good "in person" review of this that was convincing. I'm curious about the effectiveness of the "2 lightnode subnet split brain" test based on how the removed blocks are provisioned from Unrecoverable, but maybe i'm imagining a test that doesn't exist yet (ie: a full node transitions from failing to reconstruct from 1 subnet, to being able to with 2 subnets) but this might just be my limited grasp
} | ||
|
||
err := errg.Wait() | ||
require.NoError(t, err) | ||
} | ||
|
||
// TestShareAvailable_DisconnectedFullNodes asserts that two disconnected full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random thought: Just curious, if it would still work same for similar topology, but for more than 2 full nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should, yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thought, we need to park this in an issue. @musalbas requested the ability to scale the test in such a way.
Codecov Report
@@ Coverage Diff @@
## main #2697 +/- ##
==========================================
- Coverage 51.37% 51.33% -0.05%
==========================================
Files 159 159
Lines 10699 10709 +10
==========================================
Hits 5497 5497
- Misses 4723 4734 +11
+ Partials 479 478 -1
|
// false, otherwise it acts the same but keeps the node at index (k + 1, k + 1). | ||
func FillWithheldBS(t *testing.T, n int, bServ blockservice.BlockService, recoverable bool) *share.Root { | ||
shares := sharetest.RandShares(t, n*n) | ||
eds, err := AddAndWithholdShares(context.TODO(), shares, bServ, recoverable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make this a context.WithCancel and then t.Cleanup the cancel
{ | ||
name: "barely recoverable", | ||
origSquareSize: 16, | ||
lightNodes: 230, // 99% chance of recoverability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[question]
Where is 230 coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this case is not covered in the paper. Refer to my description, and I can send you the script so you can verify it for yourself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description didn't describe where 230 is coming from and thus the question.
Again, this case is not covered in the paper
Okay, then I missed some discussion around it because I don't remember such a fact. In my memory, figures from the paper assume withholding attack as it is extensively described in theorems in the same sections as all the plots I am referring to.
Please share the script, I am curious about how it's constructed and would like to play and verify with it. The number 230 feels too much for me when 23 is enough to reconstruct the whole square when data isn't withheld. Did you verify the script with @musalbas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
23 is not enough to reconstruct the whole square with high probability in this case and matches my simulations, which you can verify running yourself. Will send the script over slack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant 23/24 when data is not withheld, updated the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://gist.github.com/distractedm1nd/d2b306da13581b97bbbcefaa5b050738
I apologize for the shitty code, didnt expect it to be read by others :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main loop flow in montle carlo script looks familiar 🙂
If ds.sample(s)
would act as noop (for example for share that is withheld), it would mean that given share is not sampled, but you count it as sampled here:
for len(uniqueSamples) < shares {
s := randomSample(k)
uniqueSamples[s]++
ds.sample(s)
}
Could you please clarify, why you count it as sampled, when it is not? Looks like it could lead to less than 20 samples being assigned to each light node and make required amount of light nodes higher that it needs to be.
I have tried to modify the code to be as following, to ensure 20 samples per every light node:
for len(uniqueSamples) < shares {
s := randomSample(k)
if ds.sample(s) {
uniqueSamples[s]++
}
}
On adjusted calculation algo, probability graph shows, that there is decent chance to do reconstruction for withholding case on amount of light node lower than 230:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I tried to match it with the current implementation that does not change the coordinates to sample after failure but just retries. Or is this incorrect?
In any case, seeing how much the numbers improve after this change indicates it should be what we are doing both in IPLD retrieval as well as here in the test. Thank you for looking deeper into this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked light availability and seems like it is acting like neither of simulations. 😆 I'll drop my finding into new issue to discuss further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err != nil { | ||
return nil, err | ||
} | ||
// commit the batch to ipfs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[uber nit]
// commit the batch to ipfs | |
// commit the batch to ipld |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this looks like a copy of a AddShares
. I'm curious why we copy as delete functionality in AddAndWithholdShares
looks like we can compose it over the AddShares
func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was only copied in the first case because the intention was to use some internal state, but the most recent version just iterates over the adder and ipld Translates the DAH.
So I will see if I can just call AddShares and do the iteration after
} | ||
|
||
err := errg.Wait() | ||
require.NoError(t, err) | ||
} | ||
|
||
// TestShareAvailable_DisconnectedFullNodes asserts that two disconnected full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thought, we need to park this in an issue. @musalbas requested the ability to scale the test in such a way.
// FullyRecoverable makes all EDS shares available when filling the | ||
// blockservice. | ||
FullyRecoverable Recoverability = iota | ||
// BarelyRecoverable withholds the (k + 1)^2 subsquare of the 2k^2 EDS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[doc uber nit]
re: subsquare. We call em quadrants and just to decrease confusion we should continue to call em so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not mean quadrants, look to the image in desc to see what I am referring to
@@ -26,10 +26,39 @@ func GetterWithRandSquare(t *testing.T, n int) (share.Getter, *share.Root) { | |||
return getter, availability_test.RandFillBS(t, n, bServ) | |||
} | |||
|
|||
type Recoverability int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[discussion]
Dropping the alternative naming, I have in my mind. There is no actionable here.
DataWithholding(FullyWitheld, PartialyWithheld, UnWithheld)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with this
net.Connect(lights2[i].ID(), full2.ID()) | ||
net.Disconnect(lights2[i].ID(), full1.ID()) | ||
} | ||
// start connection lights with sources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// start connection lights with sources | |
// connect lights with sources |
sampleAmount: 20, | ||
recoverability: BarelyRecoverable, | ||
expectedFailure: true, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we omit unrecoverable case here?
// NOTE: This test contains cases for barely recoverable but | ||
// DisconnectedFullNodes does not. The reasoning for this is that | ||
// DisconnectedFullNodes has the additional contstraint that the data | ||
// should not be reconstructable from a single subnetwork, while this | ||
// test only tests that the data is reconstructable once the subnetworks | ||
// are connected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// NOTE: This test contains cases for barely recoverable but | |
// DisconnectedFullNodes does not. The reasoning for this is that | |
// DisconnectedFullNodes has the additional contstraint that the data | |
// should not be reconstructable from a single subnetwork, while this | |
// test only tests that the data is reconstructable once the subnetworks | |
// are connected. | |
// NOTE: This test contains cases for barely recoverable but | |
// DisconnectedFullNodes does not. The reasoning for this is that | |
// DisconnectedFullNodes has the additional constraint that the data | |
// should not be reconstructible from a single subnetwork, while this | |
// test only tests that the data is reconstructible once the subnetworks | |
// are connected. |
As per Grammarly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need help understanding the reasoning here. Why DisconnectedFullNodes
would not have a BarelyRecoverable
case? BarelyReconstructible
is the worst-case success scenario that, IMO, we should test. Two subnetworks sample the available minimum and the fulls collectively reconstruct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to my description again to see - we also discussed this internally in slack. In order to have 99% confidence with BarelyRecoverable in DisconnectedFullNodes, you cannot use 2 subnetworks, as each subnetwork has a too high probability of the data being reconstructible. This is fine, as it is exactly what ConnectedFullNodes intends to test - that the data is available once connecting the full nodes, WITHOUT the assumption that it is unrecoverable from the independent subnetworks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this is being repeated. I still would like to get to the bottom of this.
In order to have 99% confidence with BarelyRecoverable in DisconnectedFullNodes, you cannot use 2 subnetworks, as each subnetwork has a too high probability of the data being reconstructible
Isn't that a sign that 230 is too big of a number if the 115 subnetwork has a very high chance of reconstructing? In other words, is there a number that would give 99% for subnetwork, but ~0% for the whole network? Is there a way to modify the script to support that? Note that my intuition for probabilities is not good.
{ | ||
name: "fully recoverable", | ||
origSquareSize: 16, | ||
lightNodes: 24, // ~99% chance of recoverability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So those tests are flaky by nature, which is not good for CI. I did quick math with current suite and it seems those combined have 3% chance to fail. 0.990.99(1-0.007)*(1-0.003) = 0.9703
We should do something about it, but also I think it could be resolved by separate issue. One way to deal with it would be to separate all flaky test into separate CI runner, that is not blocking merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, and we can adjust the LN count to be less at the threshold to decrease flakiness further
{ | ||
name: "barely recoverable", | ||
origSquareSize: 16, | ||
lightNodes: 230, // 99% chance of recoverability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main loop flow in montle carlo script looks familiar 🙂
If ds.sample(s)
would act as noop (for example for share that is withheld), it would mean that given share is not sampled, but you count it as sampled here:
for len(uniqueSamples) < shares {
s := randomSample(k)
uniqueSamples[s]++
ds.sample(s)
}
Could you please clarify, why you count it as sampled, when it is not? Looks like it could lead to less than 20 samples being assigned to each light node and make required amount of light nodes higher that it needs to be.
I have tried to modify the code to be as following, to ensure 20 samples per every light node:
for len(uniqueSamples) < shares {
s := randomSample(k)
if ds.sample(s) {
uniqueSamples[s]++
}
}
On adjusted calculation algo, probability graph shows, that there is decent chance to do reconstruction for withholding case on amount of light node lower than 230:
Closes #2592 by adding three different modes to full availability test nodes: FullyRecoverable, BarelyRecoverable, and Unrecoverable.
BarelyRecoverable test case requires 230 nodes for 99% chance of recoverability:

BarelyRecoverable is not used in DisconnectedFullNodes, only in ConnectedFullNodes, because as you can see by the graph by splitting the 230 node count in two we are not ensured failed recoverability.