Skip to content

Conversation

terencechain
Copy link
Collaborator

This is an alt to #15609, it changes KZG proof batch verification from timer-based batching to a responsive worker pool model that processes messages immediately when idle and buffers when busy

Copy link
Contributor

@nalepae nalepae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not run the code itself.

@terencechain terencechain force-pushed the verify-kzg-proofs1 branch 2 times, most recently from 0df4dde to 427d0ad Compare September 2, 2025 14:48
@terencechain terencechain marked this pull request as ready for review September 2, 2025 14:51
verificationSet := &kzgVerifier{dataColumns: dataColumns, resChan: resChan}
s.kzgChan <- verificationSet

resErr := <-resChan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be wrapped in a select statement with ctx.Done()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so too, and actually I would suggest using context.WithTimeout here to limit how long this potentially blocks on batch verification. This would simplify the other side of the channel because it would no longer need to write ctx.Err() back to all the channels (which racy anyway and can leave these goroutines wedged).

return validationResult, err
}
// Mark KZG verification as satisfied since we did it via batch verifier
verifier.SatisfyRequirement(verification.RequireSidecarKzgProofVerified)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using

defer dv.recordResult(RequireSidecarKzgProofVerified, &err)

directly in s.validateWithKzgBatchVerifier as done for all other verification functions?

nalepae
nalepae previously approved these changes Sep 7, 2025
verifyKzgBatch(kzgBatch)
}
}
}
Copy link
Collaborator

@kasey kasey Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could simplify the pair of methods kzgVerifierRoutine + pullKzgChan into a single method like the following. Also note that I have dropped this kzg.resChan <- s.ctx.Err() loop - I think this is not necessary if the callers use a select and a context.WithTimeout (which they should do anyway for safe concurrency).

// A routine that runs in the background to perform batch
// KZG verifications by draining the channel and processing all pending requests.
func (s *Service) kzgVerifierRoutine() {
    kzgBatch := make([]*kzgVerifier, 0, 1)
    for {
		select {
			case <-s.ctx.Done():
				return
			case kzg := <-s.kzgChan:
				kzgBatch = append(kzgBatch, kzg)
				continue
			default:
				verifyKzgBatch(kzgBatch)
				kzgBatch = make([]*kzgVerifier, 0, 1)
		}
	}
}

(edit: fixed a bug in the suggestion)

s.kzgChan <- verificationSet

resErr := <-resChan
close(resChan)
Copy link
Collaborator

@kasey kasey Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't really need to close channels, they get cleaned up by the garbage collector. You should only close a channel on the sending side, to unblock any receivers who could be waiting on more values from a sender which isn't going to produce any more. Receiving from a closed channel is fine (instantly unblocks any goroutine trying to receive), but sending to one /causes a panic/. This feedback will be more critical if you take the other piece of feedback to check ctx.Done() above, because in that instance you won't have read from resChan, so the other side could easily send to it after it is closed here and panic.

// the individual data columns might still be valid.
err := peerdas.VerifyDataColumnsSidecarKZGProofs(dataColumns)
if err != nil {
verErr := errors.Wrapf(err, "Could not verify")
Copy link
Collaborator

@kasey kasey Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Why a new variable and not just err = errors.Wrapf(err, "Could not verify")?

s.newColumnsVerifier = newDataColumnsVerifierFromInitializer(v)

go s.verifierRoutine()
go s.kzgVerifierRoutine()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have more than 1 goroutine doing this?

@terencechain terencechain force-pushed the verify-kzg-proofs1 branch 5 times, most recently from 593feda to 16110b5 Compare September 10, 2025 03:13
Before this commit:
After 100 ms, an un-batched verification is launched concurrently to the batched one.
As a result, a stressed node could start to be even more stressed by the multiple verifications.
Also, it is always hard to choose a correct timeout value.
100ms may be OK for a given node with a given BPO version, and not ok for the same node with a BPO version with 10x more blobs.

However, we know this gossip validation won't be useful after a full slot duration.

After this commit:
After a full slot duration, we just ignore the incoming gossip message.
It's important to ignore it and not to reject it, since rejecting it would downscore the peer sending this message.
nalepae
nalepae previously approved these changes Sep 11, 2025
@terencechain terencechain added this pull request to the merge queue Sep 11, 2025
Merged via the queue into develop with commit 9e40551 Sep 11, 2025
17 checks passed
@terencechain terencechain deleted the verify-kzg-proofs1 branch September 11, 2025 16:00
fernantho pushed a commit to fernantho/prysm that referenced this pull request Sep 26, 2025
…ffchainLabs#15617)

* Implement KZG proof batch verification for data column gossip validation

* Manu's feedback

* Add tests

* Update beacon-chain/sync/batch_verifier.go

Co-authored-by: Manu NALEPA <[email protected]>

* Update beacon-chain/sync/batch_verifier.go

Co-authored-by: Manu NALEPA <[email protected]>

* Update beacon-chain/sync/kzg_batch_verifier_test.go

Co-authored-by: Manu NALEPA <[email protected]>

* Update beacon-chain/sync/kzg_batch_verifier_test.go

Co-authored-by: Manu NALEPA <[email protected]>

* Update beacon-chain/sync/kzg_batch_verifier_test.go

Co-authored-by: Manu NALEPA <[email protected]>

* Update beacon-chain/sync/kzg_batch_verifier_test.go

Co-authored-by: Manu NALEPA <[email protected]>

* Update beacon-chain/sync/kzg_batch_verifier_test.go

Co-authored-by: Manu NALEPA <[email protected]>

* Fix tests

* Kasey's feedback

* `validateWithKzgBatchVerifier`: Give up after a full slot.

Before this commit:
After 100 ms, an un-batched verification is launched concurrently to the batched one.
As a result, a stressed node could start to be even more stressed by the multiple verifications.
Also, it is always hard to choose a correct timeout value.
100ms may be OK for a given node with a given BPO version, and not ok for the same node with a BPO version with 10x more blobs.

However, we know this gossip validation won't be useful after a full slot duration.

After this commit:
After a full slot duration, we just ignore the incoming gossip message.
It's important to ignore it and not to reject it, since rejecting it would downscore the peer sending this message.

---------

Co-authored-by: Manu NALEPA <[email protected]>
fernantho pushed a commit to fernantho/prysm that referenced this pull request Sep 26, 2025
…ffchainLabs#15617)

* Implement KZG proof batch verification for data column gossip validation

* Manu's feedback

* Add tests

* Update beacon-chain/sync/batch_verifier.go

Co-authored-by: Manu NALEPA <[email protected]>

* Update beacon-chain/sync/batch_verifier.go

Co-authored-by: Manu NALEPA <[email protected]>

* Update beacon-chain/sync/kzg_batch_verifier_test.go

Co-authored-by: Manu NALEPA <[email protected]>

* Update beacon-chain/sync/kzg_batch_verifier_test.go

Co-authored-by: Manu NALEPA <[email protected]>

* Update beacon-chain/sync/kzg_batch_verifier_test.go

Co-authored-by: Manu NALEPA <[email protected]>

* Update beacon-chain/sync/kzg_batch_verifier_test.go

Co-authored-by: Manu NALEPA <[email protected]>

* Update beacon-chain/sync/kzg_batch_verifier_test.go

Co-authored-by: Manu NALEPA <[email protected]>

* Fix tests

* Kasey's feedback

* `validateWithKzgBatchVerifier`: Give up after a full slot.

Before this commit:
After 100 ms, an un-batched verification is launched concurrently to the batched one.
As a result, a stressed node could start to be even more stressed by the multiple verifications.
Also, it is always hard to choose a correct timeout value.
100ms may be OK for a given node with a given BPO version, and not ok for the same node with a BPO version with 10x more blobs.

However, we know this gossip validation won't be useful after a full slot duration.

After this commit:
After a full slot duration, we just ignore the incoming gossip message.
It's important to ignore it and not to reject it, since rejecting it would downscore the peer sending this message.

---------

Co-authored-by: Manu NALEPA <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants