Description
Motivation
Fused multiply–add (FMA) is a floating-point operation performed in one step, with a single rounding. FMA can speed up and improve the accuracy of many computations: dot product, matrix multiplication, convolutions and artificial neural networks, polynomial evaluation, Newton's method, Kahan summation, Veltkamp-Dekker algorithm. This instruction exist in languages like: C / C++, Rust, C#, Go, Java, Julia, Swift, OpenGL 4+.
Problem and existing solutions
In WASM there is no way to get speed improvement from this widely supported instruction. There are two way how to implement fallback if hardware is not support it: correctly rounded but slow and simple combination that fast but accumulate error.
For example OpenCL has both. In C/C++ you have to implement fma fallback by yourself but in Rust/Go/C#/Java/Julia fma implemented with correctly rounded fallback. It doesn't make much difference how it will be implemented because you always can detect fma feature in runtime initialisation with special floating point expression and implement conditional fallback as you wish in your software.
If at least first two basic instructions will be implemented it will be great step forward, because right now software that focused on precision need to implement correct fma fallback with more than 23 FLOPs instead of one. This can be finance application or arbitrary precision libraries or space flight simulator with orbital dynamic.
Proposed instructions
(f64.fma $x $y $z) ;; equivalent to fma(x, y, z) in C99
(f32.fma $x $y $z) ;; equivalent to fmaf(x, y, z) in C99
Usually languages compiles fma(x,y,-z) into fused multiply-subtract under the hood. Since .wasm is compilation target looks like all instruction set can be implemented.
(f64.fms $x $y $z) ;; RN(x * y - z) = fma(x,y,-z)
(f32.fms $x $y $z) ;;
(f64.fnma $x $y $z) ;; RN(-x * y + z)
(f32.fnma $x $y $z) ;;
(f64.fnms $x $y $z) ;; RN(-x * y - z)
(f32.fnms $x $y $z) ;;
But not everyone doing all of them because result with negation the same.
Implementation
Here a draft how to implement software fallback based on relatively new Boldo-Melquiond paper.
#include <math.h> //fma, FP_FAST_FMA
static inline double two_prod(const double x, const double y, double &err) {
double splitter = 0x8000001p0; //2^27+1
//float splitter = 0x2001p0; //2^13+1
double t = splitter * a;
double ah = t + (a - t);
double al = a - ah;
t = splitter * b;
double bh = t + (b - t);
double bl = b - bh;
t = a * b;
err = ((ah * bh - t) + ah * bl + al * bh) + al * bl;
return t;
}
static inline double two_sum(const double a, const double b, double &err){
double s = a + b;
double a1 = s - b;
err = (a - a1) + (b - (s - a1));
return s;
}
static inline double fma_correct_fallback(const double x, const double y, const double z) {
// Check overflows
// Veltkamp-Dekker multiplication: x * y -> (mul, err)
// Moller-Knuth summation: mul + z -> (sum, err2)
// Boldo-Melquiond ternary summation: sum + err + err2 -> fma
}
static inline double f64_fma(const double x, const double y, const double z) {
#ifdef FP_FAST_FMA
return fma(x, y, z);
#else
return fma_correct_fallback(x, y, z);
#endif
}