-
Notifications
You must be signed in to change notification settings - Fork 454
Reduce allocator usage in Gossip implementation #4640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
compressed, | ||
}; | ||
let _ = rv.inflate()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently this was a hack to validate that compression worked out. A crude but effective one I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it was a sanity check.
I don't know how often it does fail though.
Have you checked if it ever fails or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep this line as a debug check:
debug_assert_matches!(rv.inflate(), Ok(_));
7bb9fc5
to
1ef94fe
Compare
Chatted on slack. |
revert to bv complete. We are even doing the exact same amount of allocations as we'd get with bitvec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good just a few questions but then should be gtg
min_slot - self.first_slot | ||
}; | ||
for i in start..self.num { | ||
if i >= self.slots.len() as usize { | ||
for i in start..self.num as u64 { | ||
if i >= self.slots.len() { | ||
break; | ||
} | ||
if self.slots.get(i as u64) { | ||
if self.slots.get(i) { | ||
rv.push(self.first_slot + i as Slot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these just stylistic changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its to keep clippy happy.
let offset = *s - self.first_slot; | ||
if offset >= self.slots.len() { | ||
return i; | ||
} | ||
self.slots.set(*s - self.first_slot, true); | ||
self.num = std::cmp::max(self.num, 1 + (*s - self.first_slot) as usize); | ||
self.slots.set(offset, true); | ||
self.num = std::cmp::max(self.num, 1 + offset as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is purely to make the code more readable.
compressed, | ||
}; | ||
let _ = rv.inflate()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we worried about forcing sanity check on the receiver of this EpochSlot? Looks like maybe no since you added the check here:
if status != flate2::Status::StreamEnd {
return Err(Error::DecompressError);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original idea was to validate that compression worked (no idea why else it would be here). But since flate2 can tell us that "it worked", I see no point running decompression to achieve the same result. The check you refer is exactly that, StreamEnd according to docs indicates that all input was consumed and all output fits into buffer provided without reallocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks fine.
But just to make sure, have you verified that this change is backward compatible?
Like the new (old) code is able to ingest epoch slots generated by the old (new) code?
Separately, with the new code does deflate/inflate
return Err
more often or fewer times than before?
compressed, | ||
}; | ||
let _ = rv.inflate()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it was a sanity check.
I don't know how often it does fail though.
Have you checked if it ever fails or not?
Did some tests on MNB. The comperssion branch never errored, including the old validation path that did decompression to check (it does not get called much though, so maybe it is possible to trigger errors there if we try harder), the decompression path errored in 43104/6059035 packets, each time with BufError (i.e. not enough space for decompression https://prisma.github.io/prisma-engines/doc/flate2/enum.Status.html). Previously, before we did any checking on the decompress path, all of these packets would be passed to the validator and do something in there. I guess these could be formed by incompatible implementations? |
hmm, how often does this happen with the old code? Separately, can you please check if number of epoch-slots received from the cluster but fail to sigverify increase or not?
otherwise the signatures do not verify. In other words,
should round-trip. So if number of epoch-slots which don't sigverify increase, it means that above is not valid. |
The decompression path is exactly the same as old code, except for the check that did not exist. So in the old code it would fail exactly the same, just it would be ignored. |
Pretty suspicious that it is giving |
Increasing headroom to 128 bytes results in no notable reduction in rate of errors to decode packet. I guess only some nodes produce invalid packets that fail to decompress without errors, and buffer size has nothing to do with it. |
Attaching list of validators that send invalid packets |
There is a newer version of flate that the one we are using: #3660 |
13436 /897356 or 1.4% of packets are reported as invalid by inflate process with the new version, which is somewhat more than with the old version (0.8%). But the more suspicious issue is that new version is failing the roundtrip tests in CI, but passes the same tests on a devbox, even with multiple reruns. |
Problem
Gossip is torturing the memory allocator too much. Significant amount of these allocations occur in flate2 compression.
Summary of Changes
Fixes #
Gossip allocating too much.