Skip to content

Blit Copy performance  #1641

Open
Open
@GalacticEmperor1

Description

@GalacticEmperor1

Issue №3308 opened by Starbuck5 at 2022-07-15 07:15:54

@MyreMylar @itzpr3d4t0r

Relevant to MyreMylar's question in libsdl-org/SDL#5918

I have procured an SDL bundle capable of testing this.
built-x64.zip

SDL = -Ibuilt-x64/include -Lbuilt-x64/new -lSDL2
COPYLIB_SDL2 -Lbuilt-x64/new/SDL2.dll

^ Emphasis on lines you will need to change in Setup.

It comes with 1 set of headers in include/, and then three SDL DLLs in new/, old/, and log/.

Old is plain SDL built on my machine. Log has a SDL_LOG (printf) statement inside SDL_BlitCopy, so you can fine tune what actually resolves into calling this function internally, and new/ of course has the patched SDL version.

When you are changing SDL versions, pip uninstall pygame and delete the build/ folder between every pip install

The applied patch is just deleting this: https://github.com/libsdl-org/SDL/blob/120c76c84bbce4c1bfed4e9eb74e10678bd83120/src/video/SDL_blit_copy.c# L130-L153

I've done a small bit of testing and not found much of a significant result, putting it out for testing by others.

Interestingly, I think this is the answer to # 3097. Running the patched code, running the code without x86 specific instructions, yields the same failures.


Comments

# # itzpr3d4t0r commented at 2022-07-15 08:51:00

I ran a quick test using this code:

from random import randint
from timeit import timeit

import pygame


def randcol():
    return randint(0, 255), randint(0, 255), randint(0, 255)


pygame.init()

screen = pygame.display.set_mode((800, 600))

src_surface = pygame.Surface((512, 512)).convert(24)
src_surface.fill(randcol())
dst_surface = pygame.Surface((133, 133)).convert(24)
dst_surface.fill(randcol())

print(timeit("src_surface.blit(dst_surface, (0, 0))",
             globals={"dst_surface": dst_surface, "src_surface": src_surface}))

And i get:
(This is at 1 million calls per test)

  • "Old" SDL 2.23.1 without the change
    • 32.968203900000844 s
    • 32.833649799999876 s
    • 32.74673020000046 s
  • "New" SDL 2.23.1 with the change
    • 7.640265200000613 s
    • 7.694070499999725 s
    • 7.648391300000185 s

Or 328.6% improvement (hoping this is the right way to test it)


# # itzpr3d4t0r commented at 2022-07-15 09:55:01

Some other test with a different number of blit calls per test:
Old 500000

  • 17.158649900000455
  • 16.998320700000477
  • 17.023041999998895

New 500000

  • 3.9034018000002106
  • 3.8267964999995456
  • 3.8662202000014076

Old 1

  • 5.889999920327682e-05
  • 5.720000081055332e-05
  • 5.450000026030466e-05

New 1

  • 1.1000000085914508e-05
  • 8.700000762473792e-06
  • 8.600000001024455e-06

# # MyreMylar commented at 2022-07-16 07:56:24

Hmm I'm actually getting worse performance with the patched dll on my machine.

Almost the inverse of what was happening with my crappy patch.

current main:

tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 32, 32 :		7.006ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 31, 32 :		5.004ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 64, 64 :		14.013ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 63, 64 :		9.008ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 128, 128 :		41.038ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 127, 128 :		31.027ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 256, 256 :		198.18ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 255, 256 :		145.132ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 512, 512 :		600.89ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 511, 512 :		564.622ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 1024, 1024 :		2488.26ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 1023, 1024 :		2887.21ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 16, 16 :		21.022ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 15, 16 :		18.02ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 8, 8 :		15.014ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 7, 8 :		16.019ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 4, 4 :		14.014ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 3, 4 :		14.021ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 2, 2 :		15.01ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 1, 2 :		14.013ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 1, 1 :		14.015ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 0, 1 :		13.013ms
Total test time:  7145.550489425659
Using patched SDL dll:

tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 32, 32 :		12.01ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 31, 32 :		14.013ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 64, 64 :		36.032ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 63, 64 :		34.031ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 128, 128 :		140.127ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 127, 128 :		153.139ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 256, 256 :		399.362ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 255, 256 :		400.362ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 512, 512 :		1277.566ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 511, 512 :		1286.166ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 1024, 1024 :		4976.248ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 1023, 1024 :		5014.241ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 16, 16 :		21.019ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 15, 16 :		22.02ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 8, 8 :		16.015ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 7, 8 :		16.014ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 4, 4 :		15.014ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 3, 4 :		14.012ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 2, 2 :		14.015ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 1, 2 :		14.015ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 1, 1 :		14.016ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 0, 1 :		14.014ms
Total test time:  13903.450965881348

So I officially have no f'ing clue with this anymore.

It seemed pretty clear a couple of days ago that removing that SSE chunk was a clear win and now it seems clear in the opposite direction on my machine. Perhaps some kind of optimisation compile option is confusing things with memcopy?


# # MyreMylar commented at 2022-07-16 16:04:47

Missed the old dlls existed. The comparison is better but the SSE removed patch still seems to run worse. REsult:

old:

tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 32, 32 :		9.01ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 31, 32 :		10.008ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 64, 64 :		16.015ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 63, 64 :		39.035ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 128, 128 :	        43.041ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 127, 128 :	        114.104ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 256, 256 :	        161.146ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 255, 256 :	        380.748ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 512, 512 :	        641.169ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 511, 512 :	        1324.0ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 1024, 1024 :	        2806.796ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 1023, 1024 :	        5259.176ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 16, 16 :		22.001ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 15, 16 :		21.994ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 8, 8 :		17.006ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 7, 8 :		16.997ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 4, 4 :		15.998ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 3, 4 :		15.003ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 2, 2 :		14.996ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 1, 2 :		15.004ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 1, 1 :		14.996ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 0, 1 :		14.004ms
Total test time:  10972.24473953247
new:
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 32, 32 :		9.009ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 31, 32 :		8.008ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 64, 64 :		24.021ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 63, 64 :		23.021ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 128, 128 :	        88.08ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 127, 128 :	        89.08ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 256, 256 :	        334.304ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 255, 256 :	        344.315ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 512, 512 :	       1288.051ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 511, 512 :	       1296.366ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, even dimensions 1024, 1024 :	    5109.74ms
tested Blit NORMAL_NO_ALPHA - 10000 tries, odd dimensions 1023, 1024 :	    5058.144ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 16, 16 :		23.026ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 15, 16 :		22.549ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 8, 8 :		16.016ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 7, 8 :		17.37ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 4, 4 :		15.014ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 3, 4 :		15.013ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 2, 2 :		15.013ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 1, 2 :		14.013ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, even dimensions 1, 1 :		15.014ms
tested Blit NORMAL_NO_ALPHA - 50000 tries, odd dimensions 0, 1 :		13.011ms
Total test time:  13838.17744255066

# # MyreMylar commented at 2022-07-16 23:33:06

There was some chat about this on the discord, basically it is a confusing situation. Removing the SSE path doesn't seem to be as easy a win as hoped but there is some oddness about the current state of things. Here's the results of testing on 2.1.2 using prebuilt SDL (the normal one):

prebuilt, 32bit, odd : 2.2621544 
prebuilt, 32bit, even: 3.8878691

prebuilt, 24bit, odd : 38.818756699999994 
prebuilt, 24bit, even: 2.9077811000000002

Where you can see that 24bit, odd width blits are massively slower than the other types. Using the Starbuck5 brand dlls from above I got these results:

old, 32bit, odd:  9.3382141
old, 32bit, even: 4.6988297

old, 24bit, odd:  38.7242775
old, 24bit, even: 2.7260882000000004

So clearly some different optimisation paths are being followed there for the 'untouched' dll in the 32 bit blits case for Starbuck5's build. Now the even width blit is faster than the odd width blit, and both are slower than the prebuilt.-Meanwhile, 24 bit seems unchanged.

new, 32bit, odd :  9.1433634
new, 32bit, even:  9.3573669

new, 24bit, odd :  7.484730000000001
new, 24bit, even:  7.4543045999999995

moving onto the SSE stripped version from Starbuck5, we can see it's slower than either of the previous two overall - except in the case of that odd width 24bit blit where it is massively better.

I'd like to somehow see an SSE stripped version using whatever build process the prebuilt use, but failing that I'm a bit stumped on this for now.


# # Starbuck5 commented at 2022-07-17 04:28:52

I found the optimization flags on the VS project, I'm already using the best (\O2)


# # Starbuck5 commented at 2022-07-17 08:28:10

I found another option about SIMD instructions, so I raised that from default (SSE2) to AVX2. I also pulled in the SDL 2.23.1 prebuilt.

Results:

#  24 bit 127x128 blit
#  2.0.22 prebuilt   : 30.94
#  2.23.1 prebuilt   : 30.42
#  new               : 7.37
#  avx2              : 6.27

#  24 bit 128x128 blit
#  2.0.22 prebuilt   : 1.38
#  2.23.1 prebuilt   : 1.46
#  new               : 7.32
#  avx2              : 6.32

# # itzpr3d4t0r commented at 2022-07-17 08:33:41

Ok that looks like another improvement but aren't we worried about that 5-6X slowdown with even blits?


# # MyreMylar commented at 2022-07-19 19:42:58

Ok that looks like another improvement but aren't we worried about that 5-6X slowdown with even blits?

Yes. This doesn't seem to work like I hoped.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceRelated to the speed or resource usage of the projectSurfacepygame.Surface

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions