Description
While comparing some Chapel code with Fortran code, it was observed that many lines of promoted operations was slower than the equivalent code written with a single forall and no promoted operations. The following code snippet is a complete reproducer
use Time;
use Math;
config const n = 50_000;
config const ns = 5000;
config const print = false;
config const explicitForall = false;
var A, B, C, D, E, F: [1..n] real;
var RES1, RES2: [1..n] real;
writeln("Using numTasks = ", here.maxTaskPar);
var t = new stopwatch();
t.start();
if explicitForall {
doExplicitForall();
} else {
doPromotion();
}
t.stop();
writeln("Execution time: ", t.elapsed(), " seconds");
if print {
writeln("RHO: ", RES1);
writeln("RE: ", RES2);
}
proc doExplicitForall() {
for 1..ns {
forall ii in 1..n {
RES1(ii) = A(ii) * B(ii) + (1.0-A(ii)) * C(ii);
RES2(ii) = RES1(ii) * D(ii) * E(ii) / F(ii);
}
}
}
proc doPromotion() {
for 1..ns {
RES1 = A * B + (1.0-A) * C;
RES2 = RES1 * D * E / F;
}
}
Running this on a linux64 system with 32 cores, I see the following
> ./example --ns 10000 --n 1_000_000
Using numTasks = 32
Execution time: 0.820101 seconds
> ./example --ns 10000 --n 1_000_000 --explicitForall
Using numTasks = 32
Execution time: 0.637187 seconds
Essentially, the promoted version is resulting in two parallel loops, so there is some extra overhead and work done. The promoted version could be notionally written as follows
proc doPromotionExplicit() {
for 1..ns {
[(r, a, b, c) in zip(RES1, A, B, C)] r = a * b + (1.0-a) * c;
[(r1, r2, d, e, f) in zip(RES1, RES2, D, E, F)] r2 = r1 * d * e / f;
}
}
Combining these loops into one results doExplicitForall
function in the original code. Note that I have opted to iterate over the domain rather than the zip, but the result should be the same.
Should the Chapel compiler do this automatically, essentially fusing parallel loops together? We could restrict this kind of optimization to "do these arrays all share the same domain", but I would go a step further and say "can we prove these arrays have the same loop trip count". This would allow users to write the much cleaner and concise array operations, but get similar performance as the explicit forall case.
For the full context on the original discussion, see https://chapel.discourse.group/t/chapel-performance-test-on-wsl-ubuntu/42122