Faster `fblits()` for multiple blits of the same Surface #2825

itzpr3d4t0r · 2024-04-26T15:27:11Z

Optimizing Multi Image Blitting

In a typical blit sequence, we often need to draw the same surface at multiple positions. The current approach is:

screen.fblits([ (surf, pos1), (surf, pos2), (surf, pos3), ... ])

This method is straightforward but not optimal. It requires running the same checks for each surface separately, even if it’s the same surface. This checking process takes time, invalidating the destination’s pixel cache before the actual blit.

The current process involves checking the type of blit we want, then reading/writing the pixels. However, by the time we get to the next surface in the sequence, the temporal caching mechanisms are invalidated due to the time elapsed.

I propose a new method for blitting multiple copies of the same Surface onto another Surface. This method allows for faster iteration, reduced checking overhead, and improved temporal coherency, resulting in higher performance.

The proposed approach involves a new format in fblits:

screen.fblits([ (surf, [pos1, pos2, pos3]), ... ])

This format takes a tuple containing a reference to the surface and a list of positions we want to draw to.

Inside the function, we can use memcpy for blitting the same surface more efficiently and run the checks only once, leading to significant performance improvements.

Conclusions

How to use:
To implement this, pass a tuple in the format (Surface_To_Draw, List_Of_Positions_To_Draw):

import pygame
pygame.init()

screen = pygame.display.set_mode((500, 500))

surf = pygame.Surface((20, 20))
surf.fill("red")

positions = [(0, 0), (100, 100), (30, 30)]

screen.fblits([(surf, positions)], 0, True)  # sets the flag to 0 and cache=True to actually use the caching

TODO:

Respect destination clip rect
Respect blitting to a subsurface
Investigate self blitting (ensure it works)
Ability to partially draw a surface
Support for Rect/Frect and more sequences as blit positions
(FUTURE WORK) Support blend flags, alpha and colorkey.

The results so far are very promising:
(TO BE UPDATED)

Here is a little program I've been experimenting with to get these numbers:

from random import randint, random

import pygame

pygame.init()

screen_size = 1000

screen = pygame.display.set_mode((screen_size, screen_size))

size = 50
flag = 0

s = pygame.Surface((size, size))
pygame.draw.circle(s, (100, 0, 255), (size // 2, size // 2), size // 2)


class Particle:
    def __init__(self, x, y, vx, vy):
        self.x = x
        self.y = y
        self.vx = vx
        self.vy = vy

    def update(self, dt):
        self.x += self.vx * dt
        self.y += self.vy * dt

        if self.x < 0:
            self.x = 0
            self.vx *= -1
        elif self.x > screen_size - size - 1:
            self.x = screen_size - size - 1
            self.vx *= -1

        if self.y < 0:
            self.y = 0
            self.vy *= -1
        elif self.y > screen_size - size - 1:
            self.y = screen_size - size - 1
            self.vy *= -1

        return self.x, self.y


particles = [
    Particle(randint(0, screen_size - size), randint(0, screen_size - size),
             random() * 2 - 1, random() * 2 - 1)
    for _ in range(10000)
]

clock = pygame.time.Clock()

font = pygame.font.SysFont("Arial", 25, True)

modes = ["fblits", "blits", "fblits cached"]
mode_index = 0
mode = modes[mode_index]

while True:
    dt = clock.tick_busy_loop(1000) * 60 / 1000
    for event in pygame.event.get():
        if event.type == pygame.QUIT:
            pygame.quit()
            exit()
        elif event.type == pygame.MOUSEWHEEL:
            mode_index = (mode_index + event.y) % len(modes)
            mode = modes[mode_index]
        elif event.type == pygame.KEYDOWN:
            if event.key == pygame.K_0:
                flag = 0

    screen.fill(0)

    if mode == "fblits":
        screen.fblits([(s, p.update(dt)) for p in particles], flag)
    elif mode == "blits":
        screen.blits([(s, p.update(dt), None, flag) for p in particles])
    elif mode == "fblits cached":
        screen.fblits(
            [
                (s, [p.update(dt) for p in particles])
            ],
            flag
        )

    fps_text = font.render(
        "FPS: " + f"{int(clock.get_fps())}" + f"| N: {len(particles)}",
        True, (255, 255, 255))
    pygame.draw.rect(screen, (0, 0, 0),
                     fps_text.get_rect(center=(screen_size / 2, screen_size / 2)))
    screen.blit(fps_text, fps_text.get_rect(center=(screen_size / 2, screen_size / 2)))

    mode_text = font.render(f"{mode}", True, (255, 255, 255))
    pygame.draw.rect(screen, (0, 0, 0),
                     mode_text.get_rect(center=(screen_size / 2, screen_size / 2 + 25)))
    screen.blit(mode_text,
                mode_text.get_rect(center=(screen_size / 2, screen_size / 2 + 25)))

    pygame.display.flip()

Starbuck5 · 2024-04-29T06:19:17Z

I don't really understand the proposal, but piping in on this:

Create a register cache (of __m256i) and load the Surface’s pixels once.

It sounds like you want to store an arbitrary amount of data in SIMD registers-- that's not possible. There are a set amount of these registers in each processor core, like 16. http://www.infophysics.net/amd64regs.png

XMM are the SSE registers, YMM are the AVX registers, ZMM (not shown in that diagram) are the AVX512 registers.

The intrinsics look like they're giving us registers, but it doesn't compile as 1-for-1 as it seems. I think some are routines have more intrinsics variables than there are physical registers.

itzpr3d4t0r · 2024-04-29T10:34:11Z

I don't really understand the proposal, but piping in on this:

Create a register cache (of __m256i) and load the Surface’s pixels once.

It sounds like you want to store an arbitrary amount of data in SIMD registers-- that's not possible. There are a set amount of these registers in each processor core, like 16. http://www.infophysics.net/amd64regs.png

The intrinsics look like they're giving us registers, but it doesn't compile as 1-for-1 as it seems. I think some are routines have more intrinsics variables than there are physical registers.

I'm aware of all of this but it looked like it was working anyway. But with further experimentation i found out that using memcpy directly yielded similar if not better results while not requiring any heap allocation. This is true for blitcopy alone tho as i still need to check for other blendmodes.

…ns in favour of a single implementation with memcpy

…s and renames

…y per destination (20bytes -> 16)

MyreMylar · 2024-05-25T16:41:10Z

Had a little peek at this.

Are you intending to add alpha blending support, or special blend mode support, in this PR or in a future one?

itzpr3d4t0r added Performance Related to the speed or resource usage of the project SIMD Surface pygame.Surface labels Apr 26, 2024

itzpr3d4t0r changed the title ~~Surface.fblits() image caching for performace~~ Surface.fblits() caching for performace Apr 26, 2024

Starbuck5 changed the title ~~Surface.fblits() caching for performace~~ Surface.fblits() caching for performance Apr 29, 2024

itzpr3d4t0r force-pushed the fblits_cache_optimization branch from 1e484e2 to ead00f1 Compare May 8, 2024 14:12

itzpr3d4t0r changed the title ~~Surface.fblits() caching for performance~~ Faster fblits() for multiple blits of the same Surface May 11, 2024

itzpr3d4t0r removed the SIMD label May 18, 2024

itzpr3d4t0r marked this pull request as ready for review May 18, 2024 14:57

itzpr3d4t0r requested a review from a team as a code owner May 18, 2024 14:57

itzpr3d4t0r added 18 commits May 21, 2024 12:11

First implementation of caching mechanism as blitcopy optimization.

0af8c37

fixes

5db5c25

more fixes and add missing stubs

6ff00d6

removed unused variable

69cbaf6

another fix

7c107e4

another fix

52c4844

remove unused variables and cast

5508da3

moved sequence setup to _surf_fblits_cached_item_check_and_blit

0bc580e

Added SSE2 version

0fba53e

fix

54dd0fa

tentative fix

261472b

Massively simplified code for cached blitcopy, removed avx/sse versio…

f2fd2e9

…ns in favour of a single implementation with memcpy

forgot about that

a6ddfef

Can now partially blit surfaces onto the destination.

41b68ad

use SDL_HasColorKey

f24c4eb

remove unused variable

701be35

cleanup, always using realloc now, added proper error messages.

57e2aaa

function now respects the destination's clip rect

eada109

itzpr3d4t0r added 7 commits May 21, 2024 12:12

better support subssurfaces

065b3f8

now correctly draws onto subsurfaces

2d2a1c0

removed "cache" parameter

f91a080

fix

f8f78a8

rename "cache" -> "multi"

b7a7d78

Now properly supporting all rects/rectlike as positions, minor change…

0325472

…s and renames

more changes to marginally improve performance + now using less memor…

5acd3a5

…y per destination (20bytes -> 16)

itzpr3d4t0r force-pushed the fblits_cache_optimization branch from d22721e to 5acd3a5 Compare May 21, 2024 10:12

itzpr3d4t0r added 3 commits May 22, 2024 15:07

forgot a rename

7a5834b

removed unused declaration

d50fd98

fix and rename

662fa0e

itzpr3d4t0r marked this pull request as draft June 30, 2024 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Faster `fblits()` for multiple blits of the same Surface #2825

Faster `fblits()` for multiple blits of the same Surface #2825

itzpr3d4t0r commented Apr 26, 2024 •

edited

Loading

Uh oh!

Starbuck5 commented Apr 29, 2024 •

edited

Loading

Uh oh!

itzpr3d4t0r commented Apr 29, 2024

Uh oh!

MyreMylar commented May 25, 2024

Uh oh!

Uh oh!

Uh oh!

Faster fblits() for multiple blits of the same Surface #2825

Are you sure you want to change the base?

Faster fblits() for multiple blits of the same Surface #2825

Conversation

itzpr3d4t0r commented Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optimizing Multi Image Blitting

Conclusions

Uh oh!

Starbuck5 commented Apr 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itzpr3d4t0r commented Apr 29, 2024

Uh oh!

MyreMylar commented May 25, 2024

Uh oh!

Uh oh!

Faster `fblits()` for multiple blits of the same Surface #2825

Faster `fblits()` for multiple blits of the same Surface #2825

itzpr3d4t0r commented Apr 26, 2024 •

edited

Loading

Starbuck5 commented Apr 29, 2024 •

edited

Loading