Improve "vector" task spawn performance

It seems like there has to be some room for improving the performance of spawning a bunch of tasks at once. Currently for coforalls, Chapel basically calls qthread_fork in a for loop. i.e. we turn something like:

```chapel
coforall i in 1..10 { noop(i); }
```

into something roughly like:

```chapel
var endCount: atomic int;
endCount.add((1..10).size);
for i in 1..10 {
  args.i = i;
  args.endCount = endCount;
  qthread_fork(taskWrapper, args, NULL);     
}
endCount.waitFor(0);

proc taskWrapper(args) {
  var i = args.i;
  var endCount = args.endCount;
  noop(i);
  endCount.sub(1);
}
```

This works really well for us (performance on par with gcc's OpenMP, though behind intel/cce), but it seems like there should be a way to improve performance by enqueuing multiple tasks at once.

-----

There is a qt_loop, which does tree spawning, but it seems to have worse performance than the for+fork idiom. 

For a simple noop test, I see the chapel-like scheme being about 2X faster than qt_loop_dc (and all qt_loop variants.)


```C
#include <stdint.h>

#include "qthread/qthread.h"
#include "qthread/qloop.h"
#include <qthread/qtimer.h>

static aligned_t decTask(void* arg) {
  aligned_t * doneCount = (aligned_t*)arg;
  qthread_incr(doneCount, -1);

  return 0;
}
static void qtChplLikeTaskSpawn(int64_t trials, int64_t numTasks) {
  int i, j;

  for (i=0; i<trials; i++) {
    aligned_t doneCount;
    doneCount = 0;
    qthread_incr(&doneCount, numTasks);
    for (j=0; j<numTasks; j++) {
      qthread_fork(decTask, &(doneCount), NULL);
    }
    while (qthread_incr(&doneCount, 0)) {
      qthread_yield();
    }
  }
}

static void emptyFunction(size_t start, size_t stop, void* arg) { }
static void qtLoopTaskSpawn(int64_t trials, int64_t numTasks) {
  int i;
  for (i=0; i<trials; i++) {
    qt_loop_dc(0, numTasks, emptyFunction, NULL);
  }
}

int main() {
  qthread_initialize();
  int64_t numTasks = qthread_num_workers();
  int64_t numTrials = 500000;

  qtimer_t t = qtimer_create();
  qtimer_start(t);
  {
    qtLoopTaskSpawn    (numTrials, numTasks);
    //qtChplLikeTaskSpawn(numTrials, numTasks);
  }
  qtimer_stop(t);
  printf("Elapsed time for %d workers: %f\n", numTasks, qtimer_secs(t));
}
```


On a 24 core (dual 12-core) haswell machine:

| spawnScheme          | time(s)  |
| -------------------- | -------- |
| for+fork (chpl-like) | ~3.25    |
| qt_loop              | ~7.15    |


Similar results on an older 4 core (single socket) core 2 quad machine:

| spawnScheme          | time(s)  |
| -------------------- | -------- |
| for+fork (chpl-like)  | ~1.15    |
| qt_loop              | ~2.45    |


Those numbers are with nemesis. With distrib absolute times were slower, but the trend between for+fork and qt_loop is similar

qt_loop probably isn't the exact interface we'd want anyways, but I was figuring it's a good test for "optimized" spawn performance. We'd probably want an interface that takes the number of tasks to create, the function to call, and an array of args instead of qt_loop using the same args for all tasks.

It looks like qt_loop does a fair number of unpooled allocations, so that could be a cause of the overhead, but I haven't dug in enough to know for sure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve "vector" task spawn performance #47

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve "vector" task spawn performance #47

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions