WS: Hardware-accelerated WebSocket masking via SIMD#251
Open
Sh3llcod3 wants to merge 11 commits intolexiforest:mainfrom
Open
WS: Hardware-accelerated WebSocket masking via SIMD#251Sh3llcod3 wants to merge 11 commits intolexiforest:mainfrom
Sh3llcod3 wants to merge 11 commits intolexiforest:mainfrom
Conversation
1 task
Author
|
@lexiforest Almost there with the pipeline, I think we are getting close. |
Author
|
@lexiforest Haha - I think that's fixed the issue, though the pipeline needs another re-run: You know its an interesting day when GNU is down... |
Owner
|
GNU libunistring is so unique(annoying actaully), it also stands on my way of migrating to cmake. I was considering prebuilding a binary in another repository. |
- The real goal is to re-run the pipeline
Author
|
Yeah - I think it looks good from my end, it would be good if you test as well, in case something crops up. I'll bring out the old Raspberry Pi 4 and make sure the NEON SIMD path works as well (Apple M-series chip can probably test this better). |
Owner
|
Thanks, I will review it in the next couple of days. |
Author
|
I've improved the SIMD gate to support as many CPUs and OSes as I could. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Overview
@lexiforest When I took cProfile captures of WebSocket send benchmarks, I found a severe CPU bottleneck in libcurl's code. The
ws_enc_write_payloadapplies RFC 6455 XOR masking byte-by-byte and invokesCurl_bufq_write, capping transmit speeds:Changes
xbuf) to 64-byte cache lines to reduce memory latency penalties.WS_CHUNK_SIZEfrom 64KB to 128KB to reducerecv()/send()syscall overhead.ws_flushwhere partial-write progress was dropped if the socket returnedCURLE_AGAINon the same cycle, preventing frame corruption.ws_send_raw_blockingso that it accurately reports partial bytes sent to the callback layer if a connection dies mid-stream.ws_cw_writeto ensure the decoder is cleanly reset before accepting stream termination.This PR accompanies the lexiforest/curl_cffi#749 PR.
Impact
Once the patch is applied WebSocket transmit speeds are massively improved.
They are no longer CPU-bound and huge speed improvements are visible (more than 10x on my server). The send side now provides multi-gigabit throughput:
In fact, the bottleneck is now AIOHTTP!
When acting as the receiving server, AIOHTTP is CPU pinned at 100%.
AVX-512 can result in downclocking in some CPUs, but the clock drop should still leave us with good performance. The best part is that the patches are done in a highly portable way, so if your CPU does not support AVX-512/AVX2/NEON SIMD instructions, it will fall back to a fast scalar loop that's still much faster than the original code. CPU feature detection is done via runtime dispatch so the library can seamlessly statically link and run like normal.
No additional compiler flags or build step changes are needed.