Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TLDR: in a binary stream,
Hybi#emit_message
converts the data from an ascii string to a byte array. This adds up to quite a lot of work when processing a very high throughput stream, and is especially wasteful if the API user actually wants to receive a string, and needs to convert the array back to the string usingpack('C*')
(not sure if this is the best way). This also creates unnecessary temporary data which adds work for the GC.This PR adds an option
:binary_data_format => :array / :string
(defaulting to:array
); if you set it to:string
, thenHybi#emit_message
just passes through the string it has as is to the message. There will be a second PR tofaye-websocket
related to this that I'll add in a moment.In my case, this change can result in as much as 200-300% speedup (!) (measured when doing literally nothing more than reading and discarding data from the websocket, in the actual app it will be obviously less, with database calls etc., since the reading of streamed data is a smaller % of total work).
Backstory: I have some number of projects using the Bluesky API / AT Protocol, including the skyfall Ruby gem for streaming from their streaming APIs, built on Faye. The Bluesky "firehose" API emits all "events" happening anywhere in the network, i.e. any new/deleted post, like, follow and so on. It also includes quite a lot of overhead, because there's some low-level data there that might be needed by some clients, but in practice usually isn't.
The current traffic (which varies during the day) is around 500 events/second, or around 20 mbit/s. But we already had a sustained traffic of almost 2000 evt/s for a little bit back in November, and both Bluesky's and third party devs' code was often struggling then. If the network keeps growing, the traffic is expected to grow again, so I've been trying to optimize my code to prepare for that (incl. migrating from SQLite to Postgres).
I was recently profiling the processing code and checking where I can make some more speed gains. At the moment I can push it to around 4000 events/second, potentially to 5-6k, so I have some space to grow.
But in the end I ran a profiler, and I've noticed that most of the time was spent inside the websocket library, specifically on converting strings to arrays and back. And I dug into the code, and I've noticed that… the websocket driver has the data as a binary string, converts it to a byte array, which is passed to me, which I then need to convert back to a binary string in order to pass it to the decoder (CBOR). How much could I save if the string is just passed to me without conversion? Turns out, a lot!
In this PR, I've added an option
:binary_data_format => :array / :string
(defaulting to:array
); if you set it to:string
, thenHybi#emit_message
just passes through the string it has as is to the message. There will be a second PR tofaye-websocket
related to this that I'll add in a moment, which just passes through this option from the API user down to the driver.I made some simple test code using my library that you can run to test this: https://github.com/mackuba/websocket-fix/blob/master/websocket_test.rb. Try it with e.g. 100,000 or 200,000 iterations (first arg) and with and without
-f
(second arg).Note: to see the difference in processing speed, you need to have at least ~200 mbit/s connection, so ideally run this on some kind of server. If you max out your network's connection speed, then you will see a difference in CPU use instead.
Results of the test for me, first on my home connection:
% time ./websocket_test.rb 400000
Running in normal mode:
Connected (0.834173s to connect)
Estimated processing speed: 4394.1 evt/s
GC stats: {count: 681, time: 1145, marking_time: 80, sweeping_time: 1065, heap_allocated_pages: 93, heap_empty_pages: 0, heap_allocatable_slots: 0, heap_available_slots: 90335, heap_live_slots: 64919, heap_free_slots: 25416, heap_final_slots: 0, heap_marked_slots: 64877, heap_eden_pages: 93, total_allocated_pages: 93, total_freed_pages: 0, total_allocated_objects: 14701400, total_freed_objects: 14636481, malloc_increase_bytes: 976, malloc_increase_bytes_limit: 32225676, minor_gc_count: 672, major_gc_count: 9, compact_count: 0, read_barrier_faults: 0, total_moved_objects: 0, remembered_wb_unprotected_objects: 0, remembered_wb_unprotected_objects_limit: 647, old_objects: 64761, old_objects_limit: 129522, oldmalloc_increase_bytes: 976, oldmalloc_increase_bytes_limit: 40928508}
./websocket_test.rb 400000 50.64s user 1.89s system 57% cpu 1:32.03 total
% time ./websocket_test.rb 400000 -f
Running in FAST mode:
Connected (0.839828s to connect)
Estimated processing speed: 4502.4 evt/s
GC stats: {count: 517, time: 2025, marking_time: 222, sweeping_time: 1803, heap_allocated_pages: 99, heap_empty_pages: 0, heap_allocatable_slots: 0, heap_available_slots: 95660, heap_live_slots: 64920, heap_free_slots: 30740, heap_final_slots: 0, heap_marked_slots: 64878, heap_eden_pages: 99, total_allocated_pages: 99, total_freed_pages: 0, total_allocated_objects: 13500383, total_freed_objects: 13435463, malloc_increase_bytes: 976, malloc_increase_bytes_limit: 16777216, minor_gc_count: 512, major_gc_count: 5, compact_count: 0, read_barrier_faults: 0, total_moved_objects: 0, remembered_wb_unprotected_objects: 0, remembered_wb_unprotected_objects_limit: 647, old_objects: 64762, old_objects_limit: 129524, oldmalloc_increase_bytes: 976, oldmalloc_increase_bytes_limit: 19737900}
./websocket_test.rb 400000 -f 21.76s user 4.79s system 29% cpu 1:29.86 total
Results when running on a server:
$ time ./websocket_test.rb 400000
Running in normal mode:
Connected (0.685806801s to connect)
Estimated processing speed: 6559.5 evt/s
GC stats: {count: 674, time: 959, marking_time: 164, sweeping_time: 795, heap_allocated_pages: 84, heap_empty_pages: 0, heap_allocatable_slots: 0, heap_available_slots: 77650, heap_live_slots: 52010, heap_free_slots: 25640, heap_final_slots: 0, heap_marked_slots: 51968, heap_eden_pages: 84, total_allocated_pages: 84, total_freed_pages: 0, total_allocated_objects: 14677645, total_freed_objects: 14625635, malloc_increase_bytes: 976, malloc_increase_bytes_limit: 29129457, minor_gc_count: 666, major_gc_count: 8, compact_count: 0, read_barrier_faults: 0, total_moved_objects: 0, remembered_wb_unprotected_objects: 0, remembered_wb_unprotected_objects_limit: 518, old_objects: 51852, old_objects_limit: 103704, oldmalloc_increase_bytes: 976, oldmalloc_increase_bytes_limit: 34107091}
real 1m1.718s
user 0m57.342s
sys 0m2.746s
$ time ./websocket_test.rb 400000 -f
Running in FAST mode:
Connected (0.679570876s to connect)
Estimated processing speed: 26389.5 evt/s
GC stats: {count: 570, time: 618, marking_time: 121, sweeping_time: 496, heap_allocated_pages: 86, heap_empty_pages: 0, heap_allocatable_slots: 0, heap_available_slots: 79696, heap_live_slots: 52011, heap_free_slots: 27685, heap_final_slots: 0, heap_marked_slots: 51969, heap_eden_pages: 86, total_allocated_pages: 86, total_freed_pages: 0, total_allocated_objects: 13481274, total_freed_objects: 13429263, malloc_increase_bytes: 976, malloc_increase_bytes_limit: 16777216, minor_gc_count: 565, major_gc_count: 5, compact_count: 0, read_barrier_faults: 0, total_moved_objects: 0, remembered_wb_unprotected_objects: 0, remembered_wb_unprotected_objects_limit: 518, old_objects: 51853, old_objects_limit: 103706, oldmalloc_increase_bytes: 976, oldmalloc_increase_bytes_limit: 19737900}
real 0m15.885s
user 0m7.368s
sys 0m1.860s
(~25k events/s is comparable to what I got when testing a Rust library doing that!)