Skip to content

Can/should the Chapel compiler do loop fusing of whole-array statements / promotions? #27076

Open
@jabraham17

Description

@jabraham17

While comparing some Chapel code with Fortran code, it was observed that many lines of promoted operations was slower than the equivalent code written with a single forall and no promoted operations. The following code snippet is a complete reproducer

use Time;
use Math;

config const n = 50_000;
config const ns = 5000;
config const print = false;
config const explicitForall = false;

var A, B, C, D, E, F: [1..n] real;
var RES1, RES2: [1..n] real;

writeln("Using numTasks = ", here.maxTaskPar);

var t = new stopwatch();
t.start();
if explicitForall {
  doExplicitForall();
} else {
  doPromotion();
}
t.stop();
writeln("Execution time: ", t.elapsed(), " seconds");

if print {
  writeln("RHO: ", RES1);
  writeln("RE: ", RES2);
}

proc doExplicitForall() {
  for 1..ns {
    forall ii in 1..n {
      RES1(ii) = A(ii) * B(ii) + (1.0-A(ii)) * C(ii);
      RES2(ii) = RES1(ii) * D(ii) * E(ii) / F(ii);
    }
  }
}

proc doPromotion() {
  for 1..ns {
    RES1 = A * B + (1.0-A) * C;
    RES2 = RES1 * D * E / F;
  }
}

Running this on a linux64 system with 32 cores, I see the following

> ./example --ns 10000 --n 1_000_000 
Using numTasks = 32
Execution time: 0.820101 seconds

> ./example --ns 10000 --n 1_000_000 --explicitForall
Using numTasks = 32
Execution time: 0.637187 seconds

Essentially, the promoted version is resulting in two parallel loops, so there is some extra overhead and work done. The promoted version could be notionally written as follows

proc doPromotionExplicit() {
  for 1..ns {
    [(r, a, b, c) in zip(RES1, A, B, C)] r = a * b + (1.0-a) * c;
    [(r1, r2, d, e, f) in zip(RES1, RES2, D, E, F)] r2 = r1 * d * e / f;
  }
}

Combining these loops into one results doExplicitForall function in the original code. Note that I have opted to iterate over the domain rather than the zip, but the result should be the same.

Should the Chapel compiler do this automatically, essentially fusing parallel loops together? We could restrict this kind of optimization to "do these arrays all share the same domain", but I would go a step further and say "can we prove these arrays have the same loop trip count". This would allow users to write the much cleaner and concise array operations, but get similar performance as the explicit forall case.

For the full context on the original discussion, see https://chapel.discourse.group/t/chapel-performance-test-on-wsl-ubuntu/42122

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions