Open
Description
This proc
@proc
def sgemv(
alpha: f32,
beta: f32,
m: size,
n: size,
a: f32[m, n],
x: f32[n],
y: f32[m],
):
for i in seq(0, m):
y[i] = beta * y[i]
for j in seq(0, n):
y[i] += alpha * x[j] * a[i, j]
compiles to the following code:
void sgemv(void *ctxt, const float *alpha, const float *beta, int_fast32_t m,
int_fast32_t n, const float *a, const float *x, float *y) {
for (int i = 0; i < m; i++) {
y[(i) * (1)] = *beta * y[(i) * (1)];
for (int j = 0; j < n; j++) {
y[(i) * (1)] += *alpha * x[(j) * (1)] * a[(i) * (n) + (j) * (1)];
}
}
}
However, dereferencing alpha
in every iteration of the inner loop causes it to be reloaded on every iteration. C doesn't allow you to hoist it because it could technically be memory-mapped and change underneath you. The code we should generate looks more like this:
void sgemv(void *ctxt, const float *alpha, const float *beta, int_fast32_t m,
int_fast32_t n, const float *a, const float *x, float *y) {
const float alpha_ = *alpha;
const float beta_ = *beta;
for (int i = 0; i < m; i++) {
y[(i) * (1)] = beta_ * y[(i) * (1)];
for (int j = 0; j < n; j++) {
y[(i) * (1)] += alpha_ * x[(j) * (1)] * a[(i) * (n) + (j) * (1)];
}
}
}
Here, the values are loaded only once, at the start of the pipeline.
See the following Godbolt interaction to see the assembly diff: https://gcc.godbolt.org/z/WExnhEd68