SOLVED: An ESP32(-S3) fast fp32 division using a reciprocal inline assembly with arg/result passsed in fp32 regs for even more speed! (DSP-148)

### Is your feature request related to a problem?

ESP32(-S3) fp32 division is notoriously slow. It can be made faster several times by using a reciprocal asm sequence, which is accurate to 1 ULP - precise enough for most cases.
ESP32(-S3) ABI specifies passing both func's input args and output value in general-purpose regs (A2-A15) - even for floats, but for inline assembly in C that may not be the case - tested various scenarios and both input and output are passed in fp32 regs (F0-F15) where possible, which surely speeds things up :)
This code was inspired by https://blog.llandsmeer.com/tech/2021/04/08/esp32-s2-fpu.html, which I significantly enhanced:
- 2 asm instructions less (no `wfr`/`rfr`)
- no fixed fp32 regs (gcc can freely choose which ones - allows more optimizations)
- no `static` keyword for `recipsf2()` (is visible outside its source file)

Here it is with **Public Domain License**:

```
__attribute__((always_inline)) inline
float recipsf2(float input) {
    float result, temp;
    asm(
        "recip0.s %0, %2\n"
        "const.s %1, 1\n"
        "msub.s %1, %2, %0\n"
        "madd.s %0, %0, %1\n"
        "const.s %1, 1\n"
        "msub.s %1, %2, %0\n"
        "maddn.s %0, %0, %1\n"
        :"=&f"(result),"=&f"(temp):"f"(input)
    );
    return result;
}

#define DIV(a, b) (a)*recipsf2(b)
```

Cheers,

f4lc0n

Fixed: Added `&` for the `temp` var so that it is mapped to a unique fp32 reg (in some `recipsf2()` usage cases it wasn't).
Changed: Removed `volatile` after `asm`.
Changed: The 1st `maddn.s` to `madd.s` so that it corresponds to the canonical reciprocal sequence in _Xtensa ISA Summary_ on p. 113.

### Describe the solution you'd like.

This solution can be added to your "math" section.

### Describe alternatives you've considered.

Moving to ESP32-P4, which has just a 3-cycle `fdiv.s` instruction, is not always possible…

### Additional context.

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SOLVED: An ESP32(-S3) fast fp32 division using a reciprocal inline assembly with arg/result passsed in fp32 regs for even more speed! (DSP-148) #95

Is your feature request related to a problem?

Describe the solution you'd like.

Describe alternatives you've considered.

Additional context.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SOLVED: An ESP32(-S3) fast fp32 division using a reciprocal inline assembly with arg/result passsed in fp32 regs for even more speed! (DSP-148) #95

Description

Is your feature request related to a problem?

Describe the solution you'd like.

Describe alternatives you've considered.

Additional context.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions