ARM assembly to address performance of blit and fill routines

_Originally posted by @bavison in https://github.com/libsdl-org/SDL-1.2/issues/777#issuecomment-871726562_:

I think I can see what's happened. In SDL 2.0, the way `BlitRGBtoRGBPixelAlpha()` handled the alpha channel changed in an incompatible way. To be precise, in this commit:

https://github.com/libsdl-org/SDL/commit/89bc80f1ae5bd99db448eccbc19ba2981722da1f

There is no equivalent commit in SDL 1.2. My ARM assembly optimisations - both the SIMD and NEON versions - faithfully copied the SDL 1.2 behaviour (this was, after all, my primary target at the time). It looks like nobody, including myself, noticed the difference between the two branches.

First conclusion: I think the code can safely be re-enabled on SDL 1.2.

Having looked more closely at the SDL 2.0 code, I think the new handling of alpha is actually incorrect, according to its own specification. It's desirable that every colour component is treated identically (not least because it makes SIMD processing easier). Look at the least-significant byte of `d1`/`s1` (blue) - we can ignore the AND mask because the intermediate value can't overflow for any combination of input values - we can rearrange

```
d1 + ((s1 - d1) * alpha >> 8)
```

to

```
(s1 * alpha + d1 * 0x100 - d1 * alpha) >> 8
```

If you substitute `s1` with `alpha` and `d1` with `dalpha`, this gives

```
(alpha * alpha + dalpha * 0x100 - dalpha * alpha) >> 8
```

By contrast, we can rearrange

```
alpha + (dalpha * (alpha ^ 0xFF) >> 8)
```

to

```
(0x100 * alpha + dalpha * 0xFF - dalpha * alpha) >> 8
```

These are **not** equivalent, except (almost) when `alpha == 0xFF` - but that's already been special-cased a few lines above!

Some worked examples:

source 0x0f0f0f0f, destination 0x00000000, SDL1.2 result 0x00000000, SDL2.0 result 0x0f000000

source 0x0f0f0f0f, destination 0xffffffff, SDL1.2 result 0xfff0f0f0, SDL2.0 result 0xfef0f0f0

Ideally, I'd have said that we should be aiming for every colour component to be treated the same, which would result in 0x00000000 and 0xf0f0f0f0 respectively for these examples. (There's also an argument that some rounding should be done rather than simple truncation of the 16-bit intermediate product, but I won't go into that now.)

If I were to re-work the SDL2.0 assembly, would it be acceptable to treat all components the same in this way? I wouldn't have to revisit it yet again if the equation is changed a year or two down the line. I can change `BlitRGBtoRGBPixelAlpha()`to match this behaviour easily enough, but would need some assistance from someone familiar with MMX and 3dNOW to make those cases match.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARM assembly to address performance of blit and fill routines #4484

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ARM assembly to address performance of blit and fill routines #4484

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions