Skip to content

Conversation

Sajjon
Copy link
Contributor

@Sajjon Sajjon commented Oct 10, 2025

Fixes: #9977

On our Kusama Canary chain YAP-3392 has the log entry:

Collation wasn't advertised because it was built on a relay chain block that is now part of an old session

show up 400+ times (2025-10-03 -- 2025-10-10).

Luckily we can detect this - that the session of a relay parent is old session - can easily be detected. And thus we can avoid building the block in the first place.

This will (slightly) increase block confidence (more so on our Kusama Canary where sessions last 1h instead of Polkadots 4h).

N.B. We have similar logic like this in fn build_relay_parent_ancestry in cumulus/client/consensus/common/src/parent_search.rs:

let session = relay_client.session_index_for_child(current_rp).await?;
if required_session.get_or_insert(session) != &session {
    // Respect the relay-chain rule not to cross session boundaries.
    break;
}

@Sajjon Sajjon force-pushed the cyon/skip_building_blocks_on_relay_parents_in_old_session_issue_9977 branch from be812cb to 340fdfb Compare October 10, 2025 08:51
@Sajjon Sajjon added the T9-cumulus This PR/Issue is related to cumulus. label Oct 10, 2025
@Sajjon Sajjon requested a review from sandreim October 10, 2025 08:59
@Sajjon Sajjon requested a review from sandreim October 10, 2025 10:07
Copy link
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do better than this and have some separation of concerns and more robust check. We should actually query the collator protocol and ask it if we can use a specific relay parent.

Please look at distribute_collation as there are multiple checks there on the relay parent which we can do before we decide to create a collation on it.

Also, we should keep in mind that advertisement can actually happen later, and by that time the relay parent might not be valid anymore, if a new block was created in a new session. This means that what we do here will not always prevent the situation from happening.

where
Client: RelayChainInterface,
{
let Ok(relay_best_hash) = relay_client.best_block_hash().await else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you already relay_best_hash this in the caller scope, you can just pass it to the fn

@Sajjon
Copy link
Contributor Author

Sajjon commented Oct 10, 2025

@sandreim can you expand a bit on your last message / explain more what you had in mind.

I'll call checking if relay parent is in old session for SessionBoundryCheck

We should actually query the collator protocol and ask it if we can use a specific relay parent.

From where should we query the collator protocol? Later you say:

This means that what we do here will not always prevent the situation from happening.

Thus you make it sound like performing SessionBoundryCheck inside run_block_builder is wrong? Or do you mean it is wrong to only performing SessionBoundryCheck there? So perhaps do it in multiple places?

  • before building (my attempt of doing so is my current impl, inside: run_block_builder)
  • after building but before advertisment
  • some more place/time?

Please look at distribute_collation as there are multiple checks there on the relay parent which we can do before we decide to create a collation on it.

Hmm did you mean I should perform SessionBoundryCheck inside distribute_collation? Because that feels much too late? That function is called with a candidate, thus we have already built, but this issue is about not even building the block.

By "query the collator protocol", do you mean adding a method to the trait ServiceInterface? And call it through like we do here in basic aura, like so:

collator.collator_service().some_function()

I feel properly confused now 😅

@skunert
Copy link
Contributor

skunert commented Oct 10, 2025

I thought we wanted to fix this on the relay chain side? #9766

I have a check here which already checks whether there will be an epoch change in the relay parent ancestry. If yes, I am including the next epochs authorities for verification. Currently these blocks are getting dropped anyway, so the implementation is already forward looking, because I assumed at some point we will not drop anymore 😬.

@Sajjon
Copy link
Contributor Author

Sajjon commented Oct 10, 2025

@skunert

I thought we wanted to fix this on the relay chain side?

We want to avoid even building the parablock in the first place, to not waste resources (degrading block confidence), so it must happen on parachain side then, right?

@skunert
Copy link
Contributor

skunert commented Oct 10, 2025

@skunert

I thought we wanted to fix this on the relay chain side?

We want to avoid even building the parablock in the first place, to not waste resources (degrading block confidence), so it must happen on parachain side then, right?

Confidence is dropping because we drop candidates at session boundaries. If we wouldn't do that, confidence would not reduce and parachains could keep producing blocks as they do now right? #9766

But yeah I assume it takes too long or is not scheduled?

@sandreim
Copy link
Contributor

sandreim commented Oct 10, 2025

@skunert

I thought we wanted to fix this on the relay chain side?

We want to avoid even building the parablock in the first place, to not waste resources (degrading block confidence), so it must happen on parachain side then, right?

Confidence is dropping because we drop candidates at session boundaries. If we wouldn't do that, confidence would not reduce and parachains could keep producing blocks as they do now right? #9766

But yeah I assume it takes too long or is not scheduled?

#9766 is a different issue. If you read the ticket it is about candidates that have already been backed on chain and are pending availability. To solve that one we need to fix availability.

The issue in #9977 is that these candidates are not even advertised by collator protocol because the relay parent is out of scope already. To properly fix it we need to allow candidates with relay parents from the previous session. The fix should require changes in collator protocol, backing and prospective-parachains.

We will need to do this for supporting low latency parachains. IIRC we discussed with @eskimor about decoupling the relay parent we use for execution context from the one we use for scheduling information.

The fix in this PR should be very easy but will not be perfect. The candidate could be fetched because a new session was not observed yet, but then dropped from prospective parachains as soon as the RC advances in new session. What we can do for now is not build a collation on an older RP if we've already seen the RC best block in new session.

I have a check here which already checks whether there will be an epoch change in the relay parent ancestry. If yes, I am including the next epochs authorities for verification. Currently these blocks are getting dropped anyway, so the implementation is already forward looking, because I assumed at some point we will not drop anymore 😬.

I am not familiar with this code, does it solve what I said above ?

Also I don't think this should be solved in the cumulus code, because of separation of concerns. That's why I am proposing to query the collator protocol subsystem to do a sanity check on the relay parent before proceeding with block production. When we will allow RPs from prev session, you won't need to change anything in cumulus.

@sandreim
Copy link
Contributor

Thus you make it sound like performing SessionBoundryCheck inside run_block_builder is wrong? Or do you mean it is wrong to only performing SessionBoundryCheck there? So perhaps do it in multiple places?

  • before building (my attempt of doing so is my current impl, inside: run_block_builder)

Yes, before, I propose you send a message to collator-protocol subsystem and ask it to tell you if the relay_parent is good to be built on. Collator protocol already tracks sessions and contains more checks for RP.

@skunert
Copy link
Contributor

skunert commented Oct 10, 2025

The check I mentioned above checks if any of the RP descendants we use to enforce the offset contain a session change digest. If they do, we include additional relay chain authorities in the inherent storage proof. I did this because I was assuming that we will allow relay parents from old sessions at some point. So it does currently not fix the issue you want to fix. But the check can be used for this, because the condition is the same.

The concerns of finding the correct relay parent are currently not separated anyway. We have the parent_search which tries to find a suitable parent block which lives in the same session as the tip of the chain. While thinking about this issue here I realized that we pass the relay_parent into this instead of the tip of the chain.


So I think what we should do is:

  • Pass the relay_best_hash here instead of relay_parent. This should enforce that the relay parent ancestry contains only blocks which we have no parachain blocks for, essentially leading to skipping the block.
  • However, we need to ensure that the included_header_hash here corresponds to the included hash from the relay_parent, same as it is now. Otherwise the runtime will later perform its checks against a different included block than what we check during authoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T9-cumulus This PR/Issue is related to cumulus.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cumulus: skip building a block on relay parents in old session

3 participants