Rounding other than round-to-nearest in floating point: Are these considered issues?

I've found a few issues in some of the floating point conversion functions that only occur in non-standard rounding modes (round to positive infinity, negative infinity, or zero - not the default and probably-sane round to nearest).

I've been working on "making SIMDe on x86 match ARMv8 exactly," and this shakes out some really weird bugs over time.

The core of the issue is that the use of `pow(2, n)` in the `simde_vcvth_n_f16_s16(int16_t a, const int n)` functions only works properly in round-to-nearest.  In round-to-zero or round-to-negative-infinity modes, this generates a result that is one LSB off the correct result.

The fix is simple enough:
```
-      HEDLEY_STATIC_CAST(simde_float64_t, a) / pow(2, n)));
+      HEDLEY_STATIC_CAST(simde_float64_t, a) / (UINT64_C(1) << n)));
```

Or for the 64-bit types, because 64 is a valid fixed point shift (and 1 << 64 is not a number), this matches hardware exactly.
```
-  return HEDLEY_STATIC_CAST(simde_float64_t, a) / pow(2, n);
+  return HEDLEY_STATIC_CAST(simde_float64_t, a) / ((n == 64) ? pow(2, n) : UINT64_C(1) << n);
```

The problem is that proving this in the test suite is... a challenge.  It starts getting into very hardware specific sort of incantations to set the floating point rounding modes, and I'm not sure a lot of x86-isms in the test suite are welcome.

This, however, is a simple test program on x86 that shows the nature of the problem and the subtle difference in results.  If you're not used to reading floating point as hex... sorry. :/  

MXCSR is set to round to negative infinity: https://help.totalview.io/current/HTML/index.html#page/TotalView/Intelx86MXSCRRegister.html


```
#include <xmmintrin.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <cinttypes>

/*
g++ -o fcvtf_vfp fcvtf_vfp.cpp && ./fcvtf_vfp

a64_scvtf_vfp_s
fracbits: 1  mxcsr: 3f80
SIMDe result: 43d0000000000000 0000000000000000
XCVTF result: 43cfffffffffffff 0000000000000000
*/

int main()
{
	_mm_setcsr(0x3f80);
	int64_t foo = 0x7fffffffffffffff;
	uint8_t fbits = 1;
	double res1 = ((double)((int64_t)foo)) / ((uint64_t)1 << fbits);
	double res2 = (double)foo / pow(2, fbits);
	double res3 = (double)foo / (double)((uint64_t)1 << fbits);

	printf("((double)((int64_t)foo)) / ((uint64_t)1 << fbits): %lx\n", *((uint64_t*)&res1));
	printf("(double)foo / pow(2, fbits):                       %lx\n", *((uint64_t*)&res2));
	printf("(double)foo / (double)((uint64_t)1 << fbits):      %lx\n", *((uint64_t*)&res3));

	union { double d; uint64_t u; } buf;
	buf.d = pow(2, fbits);
	printf("pow(2, fbits) == %f / %016" PRIx64 "\n", buf.d, buf.u);
	buf.d = (double)(1 << fbits);
	printf("pow(2, fbits) == %f / %016" PRIx64 "\n", buf.d, buf.u);
}
```

The results ought be:
```
((double)((int64_t)foo)) / ((uint64_t)1 << fbits): 43cfffffffffffff
(double)foo / pow(2, fbits):                       43d0000000000000
(double)foo / (double)((uint64_t)1 << fbits):      43cfffffffffffff
pow(2, fbits) == 2.000000 / 3fffffffffffffff
pow(2, fbits) == 2.000000 / 4000000000000000
```

Again, this is only an issue in the non-standard rounding modes.  But the fix is also fairly straightforward to exactly match hardware.

If this is something of interest, I can push a fix for it.  But as SIMDe generally doesn't deal with non-standard rounding modes properly, I wanted to ask first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rounding other than round-to-nearest in floating point: Are these considered issues? #1260

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rounding other than round-to-nearest in floating point: Are these considered issues? #1260

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions