Skip to content

Fix appendOrMergeRPC inefficiency in message size recalculation #582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

algorandskiy
Copy link
Contributor

@algorandskiy algorandskiy commented Oct 1, 2024

Summary

As discussed in #581 there an inefficiency in appendOrMergeRPC in calling Size() more times than needed.

Fix

Instead of calling lastRPC.Size() that iterates over RPC.Publish, save the last known size and sum with a current message size and protobuf upper bound overhead.

Status and Evaluation

Benchmark results

[master●●] % go test ./ -run ^$ -bench BenchmarkAppendOrMergeRPC -benchmem -count=10 > old.txt
[pavel/append-or-merge●] % go test ./ -run ^$ -bench BenchmarkAppendOrMergeRPC -benchmem -count=10 > new.txt
[pavel/append-or-merge●] % benchstat old.txt new.txt
goos: darwin
goarch: arm64
pkg: github.com/libp2p/go-libp2p-pubsub
cpu: Apple M1 Pro
                          │   old.txt   │               new.txt               │
                          │   sec/op    │   sec/op     vs base                │
AppendOrMergeRPC/small-10   741.8n ± 0%   280.8n ± 0%  -62.15% (p=0.000 n=10)
AppendOrMergeRPC/large-10   36.88µ ± 0%   10.67µ ± 0%  -71.06% (p=0.000 n=10)
geomean                     5.231µ        1.731µ       -66.90%

                          │   old.txt    │                new.txt                │
                          │     B/op     │     B/op      vs base                 │
AppendOrMergeRPC/small-10     368.0 ± 0%     368.0 ± 0%       ~ (p=1.000 n=10) ¹
AppendOrMergeRPC/large-10   18.54Ki ± 0%   18.54Ki ± 0%       ~ (p=1.000 n=10) ¹
geomean                     2.581Ki        2.581Ki       +0.00%
¹ all samples are equal

                          │  old.txt   │               new.txt               │
                          │ allocs/op  │ allocs/op   vs base                 │
AppendOrMergeRPC/small-10   7.000 ± 0%   7.000 ± 0%       ~ (p=1.000 n=10) ¹
AppendOrMergeRPC/large-10   216.0 ± 0%   216.0 ± 0%       ~ (p=1.000 n=10) ¹
geomean                     38.88        38.88       +0.00%
¹ all samples are equal

@MarcoPolo
Copy link
Contributor

Generally looks good. Make sure to run the Fuzz tests as well. A benchmark might also be helpful here.

@algorandskiy
Copy link
Contributor Author

The Fuzz looks working too well and finds broken input even on master branch (rerunning makes it pass tho):

go test -fuzz=FuzzAppendOrMergeRPC -run ^$ -v
=== RUN   FuzzAppendOrMergeRPC
fuzz: elapsed: 0s, gathering baseline coverage: 0/209 completed
fuzz: elapsed: 3s, gathering baseline coverage: 205/209 completed
fuzz: elapsed: 5s, gathering baseline coverage: 209/209 completed, now fuzzing with 10 workers
fuzz: elapsed: 6s, execs: 3044 (946/sec), new interesting: 0 (total: 209)
fuzz: elapsed: 9s, execs: 16393 (4450/sec), new interesting: 0 (total: 209)
fuzz: elapsed: 12s, execs: 24930 (2845/sec), new interesting: 0 (total: 209)
fuzz: elapsed: 15s, execs: 26633 (568/sec), new interesting: 0 (total: 209)
fuzz: elapsed: 17s, execs: 26897 (108/sec), new interesting: 0 (total: 209)
--- FAIL: FuzzAppendOrMergeRPC (17.46s)
    fuzzing process hung or terminated unexpectedly: exit status 2
    Failing input written to testdata/fuzz/FuzzAppendOrMergeRPC/fb45983a3dbe24d8
    To re-run:
    go test -run=FuzzAppendOrMergeRPC/fb45983a3dbe24d8
=== NAME
FAIL
exit status 1
FAIL	github.com/libp2p/go-libp2p-pubsub	17.895s

The same story is on my feature branch so I can't tell if it is a new or pre-existing issue.

I explained where "1+" comes from (it is a field key size from pb generator) and added a benchmark.

@algorandskiy algorandskiy marked this pull request as ready for review May 20, 2025 21:16
@algorandskiy algorandskiy requested a review from MarcoPolo May 20, 2025 21:16
@algorandskiy
Copy link
Contributor Author

@MarcoPolo could you retrigger the testing job, TestMessageBatchPublish timeout and it never happened on my local machines.

@algorandskiy
Copy link
Contributor Author

Okay, it is

=== RUN   TestMessageBatchPublish
2025/05/22 00:14:32 failed to sufficiently increase receive buffer size (was: 1024 kiB, wanted: 7168 kiB, got: 2048 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes for details.
panic: test timed out after 10m0s
	running tests:
		TestMessageBatchPublish (10m0s)

@MarcoPolo
Copy link
Contributor

A couple of things to respond to:

  • Thanks for the bump here. I've been working on a refactor of this code as well, and I think we can combine our efforts. I'll push a PR soon, could I ask you to review it?

  • The fuzz failure is concerning. I can't seem to reproduce, can you reach out via email so we can discuss details here?

  • I'll debug the test timeout a bit. Our current testing strategy is pretty flaky in this repo, so it might just be that. We can solve it by migrating to synctest + simulated networks: feat(simconn): Simulated Networks go-libp2p#3262

@algorandskiy
Copy link
Contributor Author

could I ask you to review it?

Absolutely!

can you reach out via email so we can discuss details here?

Sent an email with details

}
return RPC{
RPC: pb.RPC{
Publish: msgs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your workload, do you see RPCs being split primarily due to many messages in a single RPC? I ask because we could add some optimizations if so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants