Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEP 584]: Cross shard bandwidth scheduler #584

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

jancionear
Copy link
Contributor

No description provided.

@jancionear jancionear requested a review from a team as a code owner January 13, 2025 13:49
@jancionear jancionear changed the title NEP XXX - Cross shard bandwidth scheduler NEP 584 - Cross shard bandwidth scheduler Jan 13, 2025
@jancionear jancionear changed the title NEP 584 - Cross shard bandwidth scheduler [NEP 584]: Cross shard bandwidth scheduler Jan 13, 2025
neps/nep-0584.md Outdated
}
```

Additionally, the value that is the closest to `max_receipt_size` is set to to `max_receipt_size`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two 'to'?

neps/nep-0584.md Outdated
}
```

The values are calculate using a linear interpolation between `base_bandwidth` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'calculated'?

neps/nep-0584.md Outdated
Comment on lines 372 to 378
BandwidthSchedulerParams {
base_bandwidth: 100000,
max_shard_bandwidth: 4500000,
max_receipt_size: 4194304,
max_allowance: 4500000,
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: clarify how these values are related to the fact that we are practicing the exercise with '4 shards'

@jancionear
Copy link
Contributor Author

Matej's comment about the NEP can be found here: 0ff27d1#commitcomment-151308824

@walnut-the-cat walnut-the-cat added WG-protocol Protocol Standards Work Group should be accountable A-NEP A NEAR Enhancement Proposal (NEP). S-review/needs-sme-review A NEP in the REVIEW stage is waiting for Subject Matter Expert review. labels Jan 14, 2025
@walnut-the-cat
Copy link
Contributor

walnut-the-cat commented Jan 14, 2025

As the moderator,

I want to kickstart the review process for this NEP, as the change is part of the upcoming release.

@jancionear , please comment once you believe this proposal is ready for SME review.

@near/wg-protocol , could you help assign SMEs who can review the proposal?

From engineering perspective, we believe @shreyan-gupta and @wacban are good candidates.

Thank you.

@jancionear
Copy link
Contributor Author

@jancionear , please comment once you believe this proposal is ready for SME review.

I think the NEP is ready for review. I addressed the issues found in Matej's review.

@bowenwang1996
Copy link
Collaborator

As a working group member, I nominate @shreyan-gupta and @wacban as SME reviewers.

Copy link

@shreyan-gupta shreyan-gupta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall great design! I've left a couple of comments.

Comment on lines +83 to +84
There is already a rudimentary solution in place, added together with stateless validation in
[NEP-509](https://github.com/near/NEPs/blob/master/neps/nep-0509.md) to limit witness size.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the solution we have implemented in congestion control or does it pre-date that? Because I recall we do something similar in congestion control where we round robin shards that we are allowed to send more than the usual limit to.

Copy link
Contributor Author

@jancionear jancionear Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added after congestion control. The allowed shard from congestion control has to be the same as the shard that is allowed to send more receipts to make sure that there are no liveness issues. If they were different we could have a situation where congestion control allows one shard to send receipts, but it's not the shard that can send large receipts and the large receipts could get stuck.

The idea of "allowed shard" from original congestion control was extended to also mean the shard that can send more bytes of receipts.

Comment on lines +173 to +174
Let's take a look at an example. Let's say that the predefined list of values that can be requested
is:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setup works well when we are assuming 4 MB size limit. If in the future we would like to increase the size limit to some other number, how would we change the predefined list?

Copy link
Contributor Author

@jancionear jancionear Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When changing the limits we would have to take another look at max_shard_bandwidth, max_single_grant and base_bandwidth and see if they make sense, adjust as necessary.
If we wanted to lower the receipt size limit to 2MB we could leave either keep max_single_grant as is or make it smaller to increase the base_bandwidth. There's no golden rule, depends on each case.

neps/nep-0584.md Outdated

```rust
max_shard_bandwidth = 4_500_000;
max_single_grant = 4194304

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 4_194_304 for readability

Comment on lines +305 to +306
of shards is low. There are some tests which have a low number of shards, and having a lower base
bandwidth allows us to fully test the bandwidth scheduler in those tests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the 100 KB limit only introduced for these tests or do they potentially hold some importance in mainnet as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this only matters in tests, with current parameters and 6 shards the base bandwidth is ~60kB, and it'll only get smaller as the number of shards increases.

Making max_shard_bandwidth larger or max_single_grant smaller in the future could make the base bandwidth larger that 100kB, in which case we'll have to reevaluate all the parameters.

neps/nep-0584.md Outdated
`max_single_grant`, like this:

```rust
values[-1] = base_bandwidth

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Might help here to just add a quick comment saying, value[-1] is the theoretical index -1. This means if BandwidthRequestValues are empty, we would be requesting base_bandwidth.

I initially mistook this for the python indexing where value[-1] means the last element of the array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially mistook this for the python indexing where value[-1] means the last element of the array.

Damn python strikes again x.x
Added a comment to make it clearer

```rust
let mut sanity_check_bytes = Vec::new();
sanity_check_bytes.extend_from_slice(scheduler_state.sanity_check_hash.as_ref());
sanity_check_bytes.extend_from_slice(CryptoHash::hash_borsh(&all_shards).as_ref());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is all_shards just the list of all shards in the current shard layout? Sounds like the sanity_check_bytes just basically confirms he hash as per the block_height? Shouldn't we try to include the link_allowances into the hash as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ideally we should hash the whole BandwidthSchedulerState in the hash, but that could potentially be large (~60kB?), which could affect performance. I didn't include the large field just to be safe. But maybe it wouldn't be that bad? 🤔

tens of kilobytes of data, which could take a bit of cpu time, so it's not done. The sanity check
still checks that all shards ran the algorithm the same number of times and with the same shards.

A new trie column is introduced to keep the scheduler state:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It'll be nice to explicitly define the new columns introduced with the groups as well in the section for groups, i.e. BUFFERED_RECEIPT_GROUPS_QUEUE_DATA and BUFFERED_RECEIPT_GROUPS_QUEUE_ITEM

Comment on lines +862 to +863
because of the gas limit enforced by congestion control. This is not ideal, in the future we might
consider merging these two algorithm into one better algorithm, but it is good enough for now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great that this point is mentioned here. For now it seems like Congestion Control and Bandwidth Scheduler act independently and both their restrictions are places on the outgoing receipts. While Congestion Control currently deals with Gas limits, Bandwidth Scheduler as of the current implementation is limited to the receipt size.

It definitely makes sense to try to merge the efforts from both these designs to provide a consistent view; a single way to manage outgoing receipts. Congestion Control can be extended to Gas limits as well.

neps/nep-0584.md Outdated
scheduler will work quicker than that.

The current version of the scheduler should work fine up to 50-100 shards, after that we'll probably
need to some modifications. A quick solution would be to randomly choose half of the shards at every

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the to here

neps/nep-0584.md Outdated
- https://github.com/near/nearcore/pull/12728
- https://github.com/near/nearcore/pull/12747

TODO - am I supposed to copy the code here? I think that a proper "minimal reference implementation"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in this case it should be fine to keep what we have here. Maybe including the title of the PR would help a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe including the title of the PR would help a lot.

Great idea, will add

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm half way there, so far so good! I'll try to finish by eod tomorrow.

note to self - pick up at BandwidthScheduler

neps/nep-0584.md Outdated
heard NEAR DA was moving to a design that doesn't require a lot of cross-shard bandwidth.
- High latency and bad scalability. A big receipt has to wait for up to `num_shards` heights before
it can be sent. This is much higher than it could be, with bandwidth scheduler a receipt never has
to wait more than one height (assuming that aren't shards aren't sending much). Even worse is that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(assuming that aren't shards aren't sending much)

words are not wording right ;)

Comment on lines +244 to +245
It's important to keep the size of `BandwidthRequest` small because bandwidth requests are included
in the chunk header, and the chunk header shouldn't be too large.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my information, what would it sum up to for 10/ 50/ 100 shards?

Is there any value in wrapping the bitmap in an Option or does that not affect the number of serialized bytes?

Did you check if the serialized size is what you expect? I could imagine borsh rounding things up to the nearest 32bits for each field, in this case it may be worth to customize the serialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my information, what would it sum up to for 10/ 50/ 100 shards?

A single bandwidth request takes 6 bytes, with 100 shards the worst case would be 600 bytes per chunk, or 60kB per block. But the worst case is unlikely, usually a shard doesn't do bandwidth requests to all other shards.

Is there any value in wrapping the bitmap in an Option or does that not affect the number of serialized bytes?

A bandwidth request with zeroed out bitmap wouldn't be included in the list of shard's bandwidth requests, so there's no point to use an Option. Bitmap is always nonzero.

Did you check if the serialized size is what you expect? I could imagine borsh rounding things up to the nearest 32bits for each field, in this case it may be worth to customize the serialization.

I hope there's no rounding, that would be terrible, borsh is supposed to be a one-to-one mapping between structs and serialized data.

`base_bandwidth` of receipts, it can just send them out immediately. Actual bandwidth grants based
on bandwidth request happen after granting the base bandwidth.

On current mainnet (with 6 shards) the base bandwidth is 61_139 (61kB)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you briefly explained if this number is constant or dependent on the number of shards? If the latter at which point will it become too low to allow request-less traffic for some percentage of chunks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a short note about the relation with number of shards.

base_bandwidth = base_bandwidth = (max_shard_bandwidth - max_single_grant) / (num_shards - 1)

base_bandwidth = (4500000 - 4*1024*1024) / (num_shards - 1)

For 6 shards:
(4500000 - 4*1024*1024)/5 = 61139

For 50 shards:
(4500000 - 4*1024*1024)/49 = 6238

For 100 shards:
(4500000 - 4*1024*1024)/99 = 3087

So with 100 shards a shard will be able to send at most 3kB of data to another shard without making a request.
This isn't a lot, but it should be enough to send a few receipts.

An additional factor is that as the number of shards increases, the amount of receipts sent on each link should become lower, so a smaller base bandwidth should become less of an issue, it becomes smaller at the same rate as the number of receipts per link becomes smaller.

For larger numbers of shards we could revisit the parameters, for example decreasing max_single_grant and max_receipt_size to 2MB would greatly increase the budget for base bandwidth, same with increasing max_shard_bandwidth.

Comment on lines +272 to +273
/// The maximum amount of data that a shard can send or receive at a single height.
pub max_shard_bandwidth: Bandwidth,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity is there any fundamental reason to have the "max send" equal to "max receive"? I'm not suggesting to split it into two, just wondering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason, it was the easiest to do and there was no need for something more complicated. I guess everything that is sent has to be received, so the amount of data should be similar, assuming equal load.

Comment on lines +474 to +476
It's important to note that `size_upper_bound` is less than difference between two consecutive
values in `BandwidthRequestValues` . Thanks to this the requests are just as good as they would be
if they were generated directly using individual receipt sizes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there is some upper bound for how many trie reads are necessary to compute the requests. It can be guaranteed by early return once we exceed the max_shard_bandwidth and some minimum on the size of a single group. The latter isn't strictly enforced but I believe some emergent reasonable minimum still exists. Did I get it right?

Copy link
Contributor Author

@jancionear jancionear Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, BandwidthRequest::make_from_receipt_sizes reads the groups until it reaches a point where this much bandwidth can't be requested with the predefined values and then it stops. This means that generating a bandwidth request will read at most max_single_grant / group_size trie reads, which is about 42 trie reads per outgoing buffer.

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

neps/nep-0584.md Outdated
Comment on lines 1211 to 1225
[This technical section is required for Protocol proposals but optional for other categories. A
draft implementation should demonstrate a minimal implementation that assists in understanding or
implementing this proposal. Explain the design in sufficient detail that:

- Its interaction with other features is clear.
- Where possible, include a Minimum Viable Interface subsection expressing the required behavior and
types in a target programming language. (ie. traits and structs for rust, interfaces and classes
for javascript, function signatures and structs for c, etc.)
- It is reasonably clear how the feature would be implemented.
- Corner cases are dissected by example.
- For protocol changes: A link to a draft PR on nearcore that shows how it can be integrated in the
current code. It should at least solve the key technical challenges.

The section should return to the examples given in the previous section, and explain more fully how
the detailed proposal makes those examples work.]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-NEP A NEAR Enhancement Proposal (NEP). S-review/needs-sme-review A NEP in the REVIEW stage is waiting for Subject Matter Expert review. WG-protocol Protocol Standards Work Group should be accountable
Projects
Status: REVIEW
Development

Successfully merging this pull request may close these issues.

6 participants