executor: improve startup time on machines with many CPUs #6482
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I observed that on machines with many CPUs (480 on my setup), fuzzing with a handful of procs (8 on my setup) would consistently fail to start because syz-executors would fail to respond within the default handshake timeout of 1 minute. Reducing procs to 4 would fix it but sounds ridiculous on such a powerful machine.
As part of the default sandbox policy, a syz-executor creates a large number of virtual network interfaces (16 on my kernel config, probably more on other kernels). This step vastly dominates the executor startup time and was clearly responsible for the timeout I observed that prevented me from fuzzing.
When fuzzing or reproducing with procs > 1, all executors run their sandbox setup in parallel. Creating network interfaces is done by socket operations to the RTNL (routing netlink) subsystem. Unfortunately, all RTNL operations in the kernel are serialized by a "rtnl_mutex" mega lock so instead of paralellizing the 8*16 interfaces creation, they effectively get serialized and the timing it takes to set up the default sandbox for one executor scales lineraly with the number of executors started "in parallel". This is currently inherent to the rtnl_mutex in the kernel and as far as I can tell there's nothing we can do about it.
However, it makes it very important that each critical section guarded by "rtnl_mutex" stays short and snappy, to avoid long waits on the lock. Unfortunately, the default behavior of a virtual network interface creation is to create one RX and one TX queue per CPU. Each queue is associated with a sysfs file whose creation is quite slow and goes through various sanitized paths that take a long time. This means that each critical section scales linearly to the number of CPUs on the host.
For example, in my setup, starting fuzzing needs 2 minutes 25. I found that I could bring this down to 10 seconds (15x faster startup time!) by limiting the number of RX and TX queues created per virtual interface to 2 using the IFLA_NUM_*X_QUEUES RTNL attributes. I opportunistically chose 2 to try and keep coverage of the code that exercises multiple queues but I don't have evidences that choosing 1 here would actually reduce the code coverage.
As far as I can tell, reducing the number of queues would be problematic in a high performance networking scenario but doesn't matter for fuzzing in a namespace with only one process so this seems like a fair trade-off to me. Ultimately, this lets me start a lot more parallel executors and take better advantage of my beefy machine.
Technical detail for review: veth interfaces actually create two interfaces for both side of the virtual ethernet link so both sides need to be configured with a low number of queues.