Skip to content

FMA instruction #1391

Open
Open
@munrocket

Description

@munrocket

Motivation

Fused multiply–add (FMA) is a floating-point operation performed in one step, with a single rounding. FMA can speed up and improve the accuracy of many computations: dot product, matrix multiplication, convolutions and artificial neural networks, polynomial evaluation, Newton's method, Kahan summation, Veltkamp-Dekker algorithm. This instruction exist in languages like: C / C++, Rust, C#, Go, Java, Julia, Swift, OpenGL 4+.

Problem and existing solutions

In WASM there is no way to get speed improvement from this widely supported instruction. There are two way how to implement fallback if hardware is not support it: correctly rounded but slow and simple combination that fast but accumulate error.

For example OpenCL has both. In C/C++ you have to implement fma fallback by yourself but in Rust/Go/C#/Java/Julia fma implemented with correctly rounded fallback. It doesn't make much difference how it will be implemented because you always can detect fma feature in runtime initialisation with special floating point expression and implement conditional fallback as you wish in your software.

If at least first two basic instructions will be implemented it will be great step forward, because right now software that focused on precision need to implement correct fma fallback with more than 23 FLOPs instead of one. This can be finance application or arbitrary precision libraries or space flight simulator with orbital dynamic.

Proposed instructions

(f64.fma $x $y $z) ;; equivalent to fma(x, y, z) in C99
(f32.fma $x $y $z) ;; equivalent to fmaf(x, y, z) in C99

Usually languages compiles fma(x,y,-z) into fused multiply-subtract under the hood. Since .wasm is compilation target looks like all instruction set can be implemented.

(f64.fms $x $y $z) ;; RN(x * y - z) = fma(x,y,-z)
(f32.fms $x $y $z) ;;
(f64.fnma $x $y $z) ;; RN(-x * y + z)
(f32.fnma $x $y $z) ;;
(f64.fnms $x $y $z) ;; RN(-x * y - z)
(f32.fnms $x $y $z) ;;

But not everyone doing all of them because result with negation the same.

Implementation

Here a draft how to implement software fallback based on relatively new Boldo-Melquiond paper.

#include <math.h> //fma, FP_FAST_FMA

static inline double two_prod(const double x, const double y, double &err) {
  double splitter = 0x8000001p0; //2^27+1
  //float splitter = 0x2001p0; //2^13+1

  double t = splitter * a;
  double ah = t + (a - t);
  double al = a - ah;

  t = splitter * b;
  double bh = t + (b - t);
  double bl = b - bh;
  t = a * b;
  
  err = ((ah * bh - t) + ah * bl + al * bh) + al * bl;
  return t;
}

static inline double two_sum(const double a, const double b, double &err){
  double s = a + b;
  double a1  = s - b;

  err = (a - a1) + (b - (s - a1));
  return s;
}

static inline double fma_correct_fallback(const double x, const double y, const double z) {
  // Check overflows
  // Veltkamp-Dekker multiplication: x * y -> (mul, err)
  // Moller-Knuth summation: mul + z -> (sum, err2)
  // Boldo-Melquiond ternary summation: sum + err + err2 -> fma
}

static inline double f64_fma(const double x, const double y, const double z) {
  #ifdef FP_FAST_FMA
    return fma(x, y, z);
  #else
    return fma_correct_fallback(x, y, z);
  #endif
}

But chromium and apple already had some own implementation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions