Skip to content

WS: Hardware-accelerated WebSocket masking via SIMD#251

Open
Sh3llcod3 wants to merge 11 commits intolexiforest:mainfrom
Sh3llcod3:ws-send-patch
Open

WS: Hardware-accelerated WebSocket masking via SIMD#251
Sh3llcod3 wants to merge 11 commits intolexiforest:mainfrom
Sh3llcod3:ws-send-patch

Conversation

@Sh3llcod3
Copy link
Copy Markdown

@Sh3llcod3 Sh3llcod3 commented Apr 23, 2026

Overview

@lexiforest When I took cProfile captures of WebSocket send benchmarks, I found a severe CPU bottleneck in libcurl's code. The ws_enc_write_payload applies RFC 6455 XOR masking byte-by-byte and invokes Curl_bufq_write, capping transmit speeds:

image

Changes

  • SIMD Hardware Acceleration: To fix this, I added AVX-512, AVX2, and ARM NEON vectorized XOR masking. Fallback scalar paths are optimized and gated safely behind macros for runtime dispatch, maintaining static-binary portability. I've aligned the XOR buffer (xbuf) to 64-byte cache lines to reduce memory latency penalties.
  • Increased Buffer Size: Increased WS_CHUNK_SIZE from 64KB to 128KB to reduce recv()/send() syscall overhead.
  • State Machine Hardening:
    • Fixed an issue in ws_flush where partial-write progress was dropped if the socket returned CURLE_AGAIN on the same cycle, preventing frame corruption.
    • Fixed ws_send_raw_blocking so that it accurately reports partial bytes sent to the callback layer if a connection dies mid-stream.
    • Improved End-of-Stream validation in ws_cw_write to ensure the decoder is cleanly reset before accepting stream termination.

This PR accompanies the lexiforest/curl_cffi#749 PR.

Impact

Once the patch is applied WebSocket transmit speeds are massively improved.

They are no longer CPU-bound and huge speed improvements are visible (more than 10x on my server). The send side now provides multi-gigabit throughput:

image

In fact, the bottleneck is now AIOHTTP!

image

When acting as the receiving server, AIOHTTP is CPU pinned at 100%.

AVX-512 can result in downclocking in some CPUs, but the clock drop should still leave us with good performance. The best part is that the patches are done in a highly portable way, so if your CPU does not support AVX-512/AVX2/NEON SIMD instructions, it will fall back to a fast scalar loop that's still much faster than the original code. CPU feature detection is done via runtime dispatch so the library can seamlessly statically link and run like normal.

No additional compiler flags or build step changes are needed.

@Sh3llcod3
Copy link
Copy Markdown
Author

@lexiforest Almost there with the pipeline, I think we are getting close.

@Sh3llcod3
Copy link
Copy Markdown
Author

Sh3llcod3 commented Apr 25, 2026

@lexiforest Haha - I think that's fixed the issue, though the pipeline needs another re-run:

curl: (28) Failed to connect to ftp.gnu.org port 443 after 136195 ms: Couldn't connect to server
make: *** [Makefile:422: libidn2-2.3.7.tar.gz] Error 28
Error: Process completed with exit code 2.

You know its an interesting day when GNU is down...

Speeds are good, no regressions:
image

@lexiforest
Copy link
Copy Markdown
Owner

lexiforest commented Apr 25, 2026

GNU libunistring is so unique(annoying actaully), it also stands on my way of migrating to cmake. I was considering prebuilding a binary in another repository.

- The real goal is to re-run the pipeline
@Sh3llcod3
Copy link
Copy Markdown
Author

Sh3llcod3 commented Apr 25, 2026

Yeah - I think it looks good from my end, it would be good if you test as well, in case something crops up. I'll bring out the old Raspberry Pi 4 and make sure the NEON SIMD path works as well (Apple M-series chip can probably test this better).

@lexiforest
Copy link
Copy Markdown
Owner

Thanks, I will review it in the next couple of days.

@Sh3llcod3
Copy link
Copy Markdown
Author

Sh3llcod3 commented Apr 30, 2026

I've improved the SIMD gate to support as many CPUs and OSes as I could.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants