-
-
Notifications
You must be signed in to change notification settings - Fork 183
Faster fblits()
for multiple blits of the same Surface
#2825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Faster fblits()
for multiple blits of the same Surface
#2825
Conversation
Surface.fblits()
image caching for performaceSurface.fblits()
caching for performace
I don't really understand the proposal, but piping in on this:
It sounds like you want to store an arbitrary amount of data in SIMD registers-- that's not possible. There are a set amount of these registers in each processor core, like 16. http://www.infophysics.net/amd64regs.png XMM are the SSE registers, YMM are the AVX registers, ZMM (not shown in that diagram) are the AVX512 registers. The intrinsics look like they're giving us registers, but it doesn't compile as 1-for-1 as it seems. I think some are routines have more intrinsics variables than there are physical registers. |
Surface.fblits()
caching for performaceSurface.fblits()
caching for performance
I'm aware of all of this but it looked like it was working anyway. But with further experimentation i found out that using memcpy directly yielded similar if not better results while not requiring any heap allocation. This is true for blitcopy alone tho as i still need to check for other blendmodes. |
1e484e2
to
ead00f1
Compare
Surface.fblits()
caching for performancefblits()
for multiple blits of the same Surface
…ns in favour of a single implementation with memcpy
…y per destination (20bytes -> 16)
d22721e
to
5acd3a5
Compare
Had a little peek at this. Are you intending to add alpha blending support, or special blend mode support, in this PR or in a future one? |
Optimizing Multi Image Blitting
In a typical blit sequence, we often need to draw the same surface at multiple positions. The current approach is:
This method is straightforward but not optimal. It requires running the same checks for each surface separately, even if it’s the same surface. This checking process takes time, invalidating the destination’s pixel cache before the actual blit.
The current process involves checking the type of blit we want, then reading/writing the pixels. However, by the time we get to the next surface in the sequence, the temporal caching mechanisms are invalidated due to the time elapsed.
I propose a new method for blitting multiple copies of the same Surface onto another Surface. This method allows for faster iteration, reduced checking overhead, and improved temporal coherency, resulting in higher performance.
The proposed approach involves a new format in fblits:
This format takes a tuple containing a reference to the surface and a list of positions we want to draw to.
Inside the function, we can use memcpy for blitting the same surface more efficiently and run the checks only once, leading to significant performance improvements.
Conclusions
How to use:
To implement this, pass a tuple in the format
(Surface_To_Draw, List_Of_Positions_To_Draw)
:TODO:
The results so far are very promising:
(TO BE UPDATED)
Here is a little program I've been experimenting with to get these numbers: