RPC/Serialization Overhead/Delay #1850

L3tum · 2024-02-01T18:22:34Z

L3tum
Feb 1, 2024

Heya, I'm mainly looking for other work to see if anybody actually measured this in a good capacity because I'm doubting my own numbers.
I've been on the hunt for some performance issues and noticed that our Symfony kernel.terminate EventListener takes ~2ms in prod (so with opcache, JIT, cache warmer and what not). However we only collect Metrics there and don't do anything else. I even checked and there aren't any other listeners or hidden things executed.

Curious to see why it takes so long I thought I'd "profile" (I use that term very loosely here) the Metrics class, since it sends off some RPC calls and does some serialization.

I've made a basic MetricsProfiler that I inject with a CompilerPass. The MetricsProfiler is very simple just the following, for each method:

public function add(string $name, float $value, array $labels = []): void
{
    $start = hrtime(true);
    $this->metrics->add($name, $value, $labels);
    $this->logger->critical("Add took " . ((hrtime(true) - $start) / 1e6) . " ms");
}

The resulting log entries for a single request are here (click on this)

- Matched route "api"
- Add took 0.42 ms
- Add took 0.406 ms
- Add took 0.3086 ms
- INFO    http            http log        {"status": 200, "method": "POST", "URI": "/", "remote_address": "192.168.16.1:51092", "read_bytes": 646, "write_bytes": 216, "start": "2024-02-01T17:26:14+0000", "elapsed": "32.5789ms"}
- === The following is basically all kernel.terminate EventListener stuff ===
- Add took 0.9934 ms
- Observe took 0.2744 ms
- Add took 0.5614 ms
- Add took 0.516 ms
- Add took 0.7812 ms
- Add took 0.3773 ms
- Add took 0.6843 ms
- Add took 0.5864 ms
- Add took 0.3434 m
- Add took 0.7013 ms
- Add took 0.6131 ms
- Add took 0.2735 ms
- Add took 0.3319 ms
- Add took 0.3253 ms
- Add took 0.3438 ms
- Add took 0.4394 ms
- Add took 0.5912 ms
- Add took 0.4409 ms
- Add took 0.5453 ms
- Add took 0.5968 ms
- Observe took 0.4165 ms

Obviously added together this is a bit more than 2ms, but it's also collected locally (on a frankly anemic laptop) without JIT (but with OPCache and APP_ENV=prod and pool.debug=false and the works).
Either way this is entirely too long IMO and why I think something must be wrong on my end. But it's also the only way I can explain our issues with the kernel.terminate Listener, because it does little else but this.

I've also run xdebug.mode=profile through this and while I can't share the cachegrind file, here's the relevant screenshots from QCachegrind. If I understand its interface correctly, each "time unit" is 10ns here, so if the call took 44000 "units" it'd be around 440000ns or 440microseconds, or 0,4ms, which supports my measurement above.

Click me!

RPC->call has these Callees

RPC->decodeResponse has these Callees

Stepping into the Protobuf callstack confirms its using the extension, no pure-php bullshit.

The worst seems to be the KV Cache though

I'm not sure why that one is so slow in particular.

I've tried to look through the code but haven't found anything obviously amiss. There's some protobuf stuff I don't really know, but I do have the protobuf extension installed and loaded

php -i | grep protobuf
/usr/local/etc/php/conf.d/docker-php-ext-protobuf.ini,
protobuf
protobuf.keep_descriptor_pool_after_request => 0 => 0

The gRPC extension is currently misbehaving so it isn't loaded, but I also haven't seen any reference to Roadrunner needing it. Sockets is installed as well though

php -i | grep sockets
/usr/local/etc/php/conf.d/docker-php-ext-sockets.ini,
sockets

I really want to use the Metrics plugin but this basically ruins our performance. One idea if the RPC overhead is the issue would be to batch-send the metrics, but I'm not sure how easy or quick that could be done. It could also be that prometheus-go is just particularly slow, but that would still make it a non-starter to use the plugin.

rustatian · 2024-02-01T19:16:33Z

rustatian
Feb 1, 2024
Maintainer

Hey @L3tum 👋
Great investigation, thanks 👍
I'm not a PHP dev, but I understand for sure what are talking about 😃
If that possible, could you please create a repository with that experiment, so, I'll ask our PHP team to take a look? I'll also profile RR to check if there are any bottlenecks on the RR side.

1 reply

L3tum Feb 1, 2024
Author

Thank you for your reply!
Sure, I'll build a small repro tomorrow and send it to you :)

Slight correction: The Protobuf stack was somehow wrong, in reality the slowest call is the waitFrame() call (the Total Call Cost is in Milliseconds in this screenshot)

Not sure where the difference came from but I thought I'd investigate further and found this out. So it seems it's probably the prometheus-go counterpart.

L3tum · 2024-02-02T14:28:28Z

L3tum
Feb 2, 2024
Author

Hey @rustatian !

Here's a small (2MB zipped) repro, I stripped it down from our existing project so the configuration is (mostly) identical. There also aren't any differences in versions employed so it's as 1:1 as I can give you. There's a Dockerfile with two targets included as well as a docker-compose.yaml which also starts webgrind. The whole thing should be runnable locally as well, though. If you run it either with the Dockerfile->Dev target or locally you'll need to install the composer dependencies manually. A simple composer install --no-dev should suffice. The webserver will then start on port 8080 and opening localhost:8080/ should launch the test (which is located in Controller/RESTController) and you'll get a JSON response with the statistics from the test.

I've played around a bit with non-blocking IO. It's a sore topic for PHP obviously and I didn't rip everything out and use a framework like AMPHP for it, but I did implement a NiceMultiRPC and a MultiRPC class, which use multiple SocketRelay under the hood. The difference between the two is that the latter never consumes the socket buffer (and thus, afaik, would slowly leak memory), while the first batch-consumes it (without verifying the response). This obviously wouldn't work for most things, but since Metrics is a push-and-ignore kind of operation this seems to mesh really well. In my tests there weren't any issues on the RR side either, and the metrics were correctly recorded with no weird missing numbers or so.

I tested the ideal number of sockets each would create and noticed that 5-10 Sockets is apparently the sweet spot (for that test anyways). NiceMultiRPC pre-connects the sockets and can scale to more sockets than that (I usually used 50). I guess the socket reuse trumps over the delay of having to connect the socket after about 10 sockets.

Anyways, with these two RPC implementations I've managed to cut the test time down from 1ms to 0.07ms :)
I'll implement NiceMultiRPC in our project and run our usual performance tests against it to see if it made any difference in a production setting.

FYI I've also added a CustomRPC implementing fire-and-forget on a single socket, as well as EasyMultiRPC which just uses multiple RPC instances. The first is most likely blocked on the socket because it doesn't improve times at all, while the second obviously still waits for a response due to its usage of the RPC class and thus is worse for performance. I thought I'd still include them for the test though

roadrunner-repro.zip

7 replies

rustatian Feb 2, 2024
Maintainer

No rush 😃
It would be really cool to improve the PHP part of the bridge up to 10x ⚡

rustatian Feb 2, 2024
Maintainer

And thank you very much for your work ⚡

L3tum Feb 5, 2024
Author

Hey @rustatian !

I've actually worked on this a bit more and cleaned it up a bit. The "root" PR is here: roadrunner-php/goridge#22
There's two follow-up PRs (so far) Metrics and KV.

I want to wait for feedback from you guys to see if it's worth contuining like this. It may also be a good idea to instead use the Factories in the respective follow-up PRs instead of patching the support in. I'll add tests and more documentation once the basic implementation strategy is ironed out.

LMK if you need anything else from me to get this started. In the meantime I'll pull these changes in for our production service to test them out

rustatian Feb 5, 2024
Maintainer

Hey @L3tum 👋
This is super cool, thank you 👍
Let me address these PRs to our PHP team.

rustatian Feb 7, 2024
Maintainer

@L3tum Hey 👋
@msmakouz will take a look at the PR's. Thank you ❤️

RoadRunner

RPC/Serialization Overhead/Delay #1850

Uh oh!

L3tum Feb 1, 2024

Replies: 2 comments · 8 replies

Uh oh!

rustatian Feb 1, 2024 Maintainer

Uh oh!

Uh oh!

L3tum Feb 1, 2024 Author

Uh oh!

Uh oh!

L3tum Feb 2, 2024 Author

Uh oh!

rustatian Feb 2, 2024 Maintainer

Uh oh!

rustatian Feb 2, 2024 Maintainer

Uh oh!

L3tum Feb 5, 2024 Author

Uh oh!

rustatian Feb 5, 2024 Maintainer

Uh oh!

rustatian Feb 7, 2024 Maintainer

L3tum
Feb 1, 2024

Replies: 2 comments 8 replies

rustatian
Feb 1, 2024
Maintainer

L3tum Feb 1, 2024
Author

L3tum
Feb 2, 2024
Author

rustatian Feb 2, 2024
Maintainer

rustatian Feb 2, 2024
Maintainer

L3tum Feb 5, 2024
Author

rustatian Feb 5, 2024
Maintainer

rustatian Feb 7, 2024
Maintainer