Can/should the Chapel compiler do loop fusing of whole-array statements / promotions?

While comparing some Chapel code with Fortran code, it was observed that many lines of promoted operations was slower than the equivalent code written with a single forall and no promoted operations. The following code snippet is a complete reproducer

```chapel
use Time;
use Math;

config const n = 50_000;
config const ns = 5000;
config const print = false;
config const explicitForall = false;

var A, B, C, D, E, F: [1..n] real;
var RES1, RES2: [1..n] real;

writeln("Using numTasks = ", here.maxTaskPar);

var t = new stopwatch();
t.start();
if explicitForall {
  doExplicitForall();
} else {
  doPromotion();
}
t.stop();
writeln("Execution time: ", t.elapsed(), " seconds");

if print {
  writeln("RHO: ", RES1);
  writeln("RE: ", RES2);
}

proc doExplicitForall() {
  for 1..ns {
    forall ii in 1..n {
      RES1(ii) = A(ii) * B(ii) + (1.0-A(ii)) * C(ii);
      RES2(ii) = RES1(ii) * D(ii) * E(ii) / F(ii);
    }
  }
}

proc doPromotion() {
  for 1..ns {
    RES1 = A * B + (1.0-A) * C;
    RES2 = RES1 * D * E / F;
  }
}
```

Running this on a linux64 system with 32 cores, I see the following

```bash
> ./example --ns 10000 --n 1_000_000 
Using numTasks = 32
Execution time: 0.820101 seconds

> ./example --ns 10000 --n 1_000_000 --explicitForall
Using numTasks = 32
Execution time: 0.637187 seconds
```

Essentially, the promoted version is resulting in two parallel loops, so there is some extra overhead and work done. The promoted version could be notionally written as follows

```chapel
proc doPromotionExplicit() {
  for 1..ns {
    [(r, a, b, c) in zip(RES1, A, B, C)] r = a * b + (1.0-a) * c;
    [(r1, r2, d, e, f) in zip(RES1, RES2, D, E, F)] r2 = r1 * d * e / f;
  }
}
```

Combining these loops into one results `doExplicitForall` function in the original code. Note that I have opted to iterate over the domain rather than the zip, but the result *should* be the same.

Should the Chapel compiler do this automatically, essentially fusing parallel loops together? We could restrict this kind of optimization to "do these arrays all share the same domain", but I would go a step further and say "can we prove these arrays have the same loop trip count". This would allow users to write the much cleaner and concise array operations, but get similar performance as the explicit forall case.

For the full context on the original discussion, see https://chapel.discourse.group/t/chapel-performance-test-on-wsl-ubuntu/42122

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can/should the Chapel compiler do loop fusing of whole-array statements / promotions? #27076

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can/should the Chapel compiler do loop fusing of whole-array statements / promotions? #27076

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions