-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEP 584]: Cross shard bandwidth scheduler #584
base: master
Are you sure you want to change the base?
Conversation
neps/nep-0584.md
Outdated
} | ||
``` | ||
|
||
Additionally, the value that is the closest to `max_receipt_size` is set to to `max_receipt_size`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two 'to'?
neps/nep-0584.md
Outdated
} | ||
``` | ||
|
||
The values are calculate using a linear interpolation between `base_bandwidth` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'calculated'?
neps/nep-0584.md
Outdated
BandwidthSchedulerParams { | ||
base_bandwidth: 100000, | ||
max_shard_bandwidth: 4500000, | ||
max_receipt_size: 4194304, | ||
max_allowance: 4500000, | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: clarify how these values are related to the fact that we are practicing the exercise with '4 shards'
Matej's comment about the NEP can be found here: 0ff27d1#commitcomment-151308824 |
As the moderator, I want to kickstart the review process for this NEP, as the change is part of the upcoming release. @jancionear , please comment once you believe this proposal is ready for SME review. @near/wg-protocol , could you help assign SMEs who can review the proposal? From engineering perspective, we believe @shreyan-gupta and @wacban are good candidates. Thank you. |
I think the NEP is ready for review. I addressed the issues found in Matej's review. |
As a working group member, I nominate @shreyan-gupta and @wacban as SME reviewers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall great design! I've left a couple of comments.
There is already a rudimentary solution in place, added together with stateless validation in | ||
[NEP-509](https://github.com/near/NEPs/blob/master/neps/nep-0509.md) to limit witness size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the solution we have implemented in congestion control or does it pre-date that? Because I recall we do something similar in congestion control where we round robin shards that we are allowed to send more than the usual limit to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added after congestion control. The allowed shard from congestion control has to be the same as the shard that is allowed to send more receipts to make sure that there are no liveness issues. If they were different we could have a situation where congestion control allows one shard to send receipts, but it's not the shard that can send large receipts and the large receipts could get stuck.
The idea of "allowed shard" from original congestion control was extended to also mean the shard that can send more bytes of receipts.
Let's take a look at an example. Let's say that the predefined list of values that can be requested | ||
is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setup works well when we are assuming 4 MB size limit. If in the future we would like to increase the size limit to some other number, how would we change the predefined list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When changing the limits we would have to take another look at max_shard_bandwidth
, max_single_grant
and base_bandwidth
and see if they make sense, adjust as necessary.
If we wanted to lower the receipt size limit to 2MB we could leave either keep max_single_grant
as is or make it smaller to increase the base_bandwidth
. There's no golden rule, depends on each case.
neps/nep-0584.md
Outdated
|
||
```rust | ||
max_shard_bandwidth = 4_500_000; | ||
max_single_grant = 4194304 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 4_194_304 for readability
of shards is low. There are some tests which have a low number of shards, and having a lower base | ||
bandwidth allows us to fully test the bandwidth scheduler in those tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the 100 KB limit only introduced for these tests or do they potentially hold some importance in mainnet as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this only matters in tests, with current parameters and 6 shards the base bandwidth is ~60kB, and it'll only get smaller as the number of shards increases.
Making max_shard_bandwidth
larger or max_single_grant
smaller in the future could make the base bandwidth larger that 100kB, in which case we'll have to reevaluate all the parameters.
neps/nep-0584.md
Outdated
`max_single_grant`, like this: | ||
|
||
```rust | ||
values[-1] = base_bandwidth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Might help here to just add a quick comment saying, value[-1]
is the theoretical index -1. This means if BandwidthRequestValues
are empty, we would be requesting base_bandwidth
.
I initially mistook this for the python indexing where value[-1] means the last element of the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially mistook this for the python indexing where value[-1] means the last element of the array.
Damn python strikes again x.x
Added a comment to make it clearer
```rust | ||
let mut sanity_check_bytes = Vec::new(); | ||
sanity_check_bytes.extend_from_slice(scheduler_state.sanity_check_hash.as_ref()); | ||
sanity_check_bytes.extend_from_slice(CryptoHash::hash_borsh(&all_shards).as_ref()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is all_shards
just the list of all shards in the current shard layout? Sounds like the sanity_check_bytes
just basically confirms he hash as per the block_height? Shouldn't we try to include the link_allowances into the hash as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah ideally we should hash the whole BandwidthSchedulerState
in the hash, but that could potentially be large (~60kB?), which could affect performance. I didn't include the large field just to be safe. But maybe it wouldn't be that bad? 🤔
tens of kilobytes of data, which could take a bit of cpu time, so it's not done. The sanity check | ||
still checks that all shards ran the algorithm the same number of times and with the same shards. | ||
|
||
A new trie column is introduced to keep the scheduler state: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It'll be nice to explicitly define the new columns introduced with the groups as well in the section for groups, i.e. BUFFERED_RECEIPT_GROUPS_QUEUE_DATA
and BUFFERED_RECEIPT_GROUPS_QUEUE_ITEM
because of the gas limit enforced by congestion control. This is not ideal, in the future we might | ||
consider merging these two algorithm into one better algorithm, but it is good enough for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great that this point is mentioned here. For now it seems like Congestion Control and Bandwidth Scheduler act independently and both their restrictions are places on the outgoing receipts. While Congestion Control currently deals with Gas limits, Bandwidth Scheduler as of the current implementation is limited to the receipt size.
It definitely makes sense to try to merge the efforts from both these designs to provide a consistent view; a single way to manage outgoing receipts. Congestion Control can be extended to Gas limits as well.
neps/nep-0584.md
Outdated
scheduler will work quicker than that. | ||
|
||
The current version of the scheduler should work fine up to 50-100 shards, after that we'll probably | ||
need to some modifications. A quick solution would be to randomly choose half of the shards at every |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the to
here
neps/nep-0584.md
Outdated
- https://github.com/near/nearcore/pull/12728 | ||
- https://github.com/near/nearcore/pull/12747 | ||
|
||
TODO - am I supposed to copy the code here? I think that a proper "minimal reference implementation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in this case it should be fine to keep what we have here. Maybe including the title of the PR would help a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe including the title of the PR would help a lot.
Great idea, will add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm half way there, so far so good! I'll try to finish by eod tomorrow.
note to self - pick up at BandwidthScheduler
neps/nep-0584.md
Outdated
heard NEAR DA was moving to a design that doesn't require a lot of cross-shard bandwidth. | ||
- High latency and bad scalability. A big receipt has to wait for up to `num_shards` heights before | ||
it can be sent. This is much higher than it could be, with bandwidth scheduler a receipt never has | ||
to wait more than one height (assuming that aren't shards aren't sending much). Even worse is that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(assuming that aren't shards aren't sending much)
words are not wording right ;)
It's important to keep the size of `BandwidthRequest` small because bandwidth requests are included | ||
in the chunk header, and the chunk header shouldn't be too large. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my information, what would it sum up to for 10/ 50/ 100 shards?
Is there any value in wrapping the bitmap in an Option or does that not affect the number of serialized bytes?
Did you check if the serialized size is what you expect? I could imagine borsh rounding things up to the nearest 32bits for each field, in this case it may be worth to customize the serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my information, what would it sum up to for 10/ 50/ 100 shards?
A single bandwidth request takes 6 bytes, with 100 shards the worst case would be 600 bytes per chunk, or 60kB per block. But the worst case is unlikely, usually a shard doesn't do bandwidth requests to all other shards.
Is there any value in wrapping the bitmap in an Option or does that not affect the number of serialized bytes?
A bandwidth request with zeroed out bitmap wouldn't be included in the list of shard's bandwidth requests, so there's no point to use an Option
. Bitmap is always nonzero.
Did you check if the serialized size is what you expect? I could imagine borsh rounding things up to the nearest 32bits for each field, in this case it may be worth to customize the serialization.
I hope there's no rounding, that would be terrible, borsh is supposed to be a one-to-one mapping between structs and serialized data.
`base_bandwidth` of receipts, it can just send them out immediately. Actual bandwidth grants based | ||
on bandwidth request happen after granting the base bandwidth. | ||
|
||
On current mainnet (with 6 shards) the base bandwidth is 61_139 (61kB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you briefly explained if this number is constant or dependent on the number of shards? If the latter at which point will it become too low to allow request-less traffic for some percentage of chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a short note about the relation with number of shards.
base_bandwidth = base_bandwidth = (max_shard_bandwidth - max_single_grant) / (num_shards - 1)
base_bandwidth = (4500000 - 4*1024*1024) / (num_shards - 1)
For 6 shards:
(4500000 - 4*1024*1024)/5 = 61139
For 50 shards:
(4500000 - 4*1024*1024)/49 = 6238
For 100 shards:
(4500000 - 4*1024*1024)/99 = 3087
So with 100 shards a shard will be able to send at most 3kB of data to another shard without making a request.
This isn't a lot, but it should be enough to send a few receipts.
An additional factor is that as the number of shards increases, the amount of receipts sent on each link should become lower, so a smaller base bandwidth should become less of an issue, it becomes smaller at the same rate as the number of receipts per link becomes smaller.
For larger numbers of shards we could revisit the parameters, for example decreasing max_single_grant
and max_receipt_size
to 2MB would greatly increase the budget for base bandwidth, same with increasing max_shard_bandwidth
.
/// The maximum amount of data that a shard can send or receive at a single height. | ||
pub max_shard_bandwidth: Bandwidth, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity is there any fundamental reason to have the "max send" equal to "max receive"? I'm not suggesting to split it into two, just wondering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason, it was the easiest to do and there was no need for something more complicated. I guess everything that is sent has to be received, so the amount of data should be similar, assuming equal load.
It's important to note that `size_upper_bound` is less than difference between two consecutive | ||
values in `BandwidthRequestValues` . Thanks to this the requests are just as good as they would be | ||
if they were generated directly using individual receipt sizes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there is some upper bound for how many trie reads are necessary to compute the requests. It can be guaranteed by early return once we exceed the max_shard_bandwidth and some minimum on the size of a single group. The latter isn't strictly enforced but I believe some emergent reasonable minimum still exists. Did I get it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, BandwidthRequest::make_from_receipt_sizes
reads the groups until it reaches a point where this much bandwidth can't be requested with the predefined values and then it stops. This means that generating a bandwidth request will read at most max_single_grant / group_size
trie reads, which is about 42 trie reads per outgoing buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
neps/nep-0584.md
Outdated
[This technical section is required for Protocol proposals but optional for other categories. A | ||
draft implementation should demonstrate a minimal implementation that assists in understanding or | ||
implementing this proposal. Explain the design in sufficient detail that: | ||
|
||
- Its interaction with other features is clear. | ||
- Where possible, include a Minimum Viable Interface subsection expressing the required behavior and | ||
types in a target programming language. (ie. traits and structs for rust, interfaces and classes | ||
for javascript, function signatures and structs for c, etc.) | ||
- It is reasonably clear how the feature would be implemented. | ||
- Corner cases are dissected by example. | ||
- For protocol changes: A link to a draft PR on nearcore that shows how it can be integrated in the | ||
current code. It should at least solve the key technical challenges. | ||
|
||
The section should return to the examples given in the previous section, and explain more fully how | ||
the detailed proposal makes those examples work.] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove that
No description provided.