[RFC] sub/sd_sbr: use instanced rendering for significant performance improvements #17187
+528
−352
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This API is still a work-in-progress and requires afishhh/subrandr#125.
I can split out the "split packer out from ass_mp" into a separate PR if desired.
What the API looks like and why
The API I settled on here is not what I came with to IRC last time and is more similar to what libass does since I accepted that blending on the CPU is hopeless now (way too slow as kasper93 pointed out on IRC, I still tried to make it faster but CPUs are just not meant for this).
Thus this API assumes the user has access to a faster way of compositing bitmaps with bilinear interpolation.
Bilinear interpolation is assumed because GPUs have it for free in hardware and it allows the implementation to do tricks like drawing axis-and-pixel-aligned rectangles by drawing an interpolated single-pixel bitmap (saves significant atlas space and CPU work if there's many backgrounds, see Japanese subtitles of https://www.youtube.com/watch?v=ksdvNgqOToQ for a case where storing all backgrounds as bitmaps actually significantly impacts atlas size)
Bitmaps are also de-duplicated and output as "instances" of "images" with every instance referencing a single image, this again saves a lot of atlas space and CPU work in the (not uncommon) case where there's many instances of the same glyph (of course accounting for subpixel positioning and stuff) in the frame. (Correct me if I'm wrong but it doesn't seem like libass does this? Maybe ASS subbers just don't repeat the same things 100 times on the same frame)
Also since subrandr wants to be able to draw real bitmaps like emojis, it has to use the BGRA8 output format instead of the A8 that libass uses. This means that each color/alpha variant of a bitmap has to be separate, this does not appear to be a huge problem in practice though.
Currently the (simplified) API looks like this:
To explain some potentially non-obvious design decisions:
sbr_output_images may not hold a complete output image internally, this is why they are only exposed as a "draw into this buffer" function (this is used for non-pixel-aligned rectangles like underlines or strike-throughs which are drawn anti-aliased on the CPU (because instances need integer output dimensions)).sbr_output_images hold an additional "user data" pointer to allow users like mpv to associate data with images inO(1)time without the complexity of their own hash map. In this PR this is used to associate a(n index of a)sub_bitmapwith eachsbr_output_image(thissub_bitmapis then accessed when constructing the real instanced output after packing).Benchmark (singular)
Rasterization no longer a 100ms bottleneck and my terrible font matching code in layout is probably more noticable here.
Sometimes the rasterization stage still takes 30ms for no reason though I am tempted to just blame this on scheduling and live happily.
TODO