Skip to content

Improve "vector" task spawn performance #47

Open
@ronawho

Description

@ronawho

It seems like there has to be some room for improving the performance of spawning a bunch of tasks at once. Currently for coforalls, Chapel basically calls qthread_fork in a for loop. i.e. we turn something like:

coforall i in 1..10 { noop(i); }

into something roughly like:

var endCount: atomic int;
endCount.add((1..10).size);
for i in 1..10 {
  args.i = i;
  args.endCount = endCount;
  qthread_fork(taskWrapper, args, NULL);     
}
endCount.waitFor(0);

proc taskWrapper(args) {
  var i = args.i;
  var endCount = args.endCount;
  noop(i);
  endCount.sub(1);
}

This works really well for us (performance on par with gcc's OpenMP, though behind intel/cce), but it seems like there should be a way to improve performance by enqueuing multiple tasks at once.


There is a qt_loop, which does tree spawning, but it seems to have worse performance than the for+fork idiom.

For a simple noop test, I see the chapel-like scheme being about 2X faster than qt_loop_dc (and all qt_loop variants.)

#include <stdint.h>

#include "qthread/qthread.h"
#include "qthread/qloop.h"
#include <qthread/qtimer.h>

static aligned_t decTask(void* arg) {
  aligned_t * doneCount = (aligned_t*)arg;
  qthread_incr(doneCount, -1);

  return 0;
}
static void qtChplLikeTaskSpawn(int64_t trials, int64_t numTasks) {
  int i, j;

  for (i=0; i<trials; i++) {
    aligned_t doneCount;
    doneCount = 0;
    qthread_incr(&doneCount, numTasks);
    for (j=0; j<numTasks; j++) {
      qthread_fork(decTask, &(doneCount), NULL);
    }
    while (qthread_incr(&doneCount, 0)) {
      qthread_yield();
    }
  }
}

static void emptyFunction(size_t start, size_t stop, void* arg) { }
static void qtLoopTaskSpawn(int64_t trials, int64_t numTasks) {
  int i;
  for (i=0; i<trials; i++) {
    qt_loop_dc(0, numTasks, emptyFunction, NULL);
  }
}

int main() {
  qthread_initialize();
  int64_t numTasks = qthread_num_workers();
  int64_t numTrials = 500000;

  qtimer_t t = qtimer_create();
  qtimer_start(t);
  {
    qtLoopTaskSpawn    (numTrials, numTasks);
    //qtChplLikeTaskSpawn(numTrials, numTasks);
  }
  qtimer_stop(t);
  printf("Elapsed time for %d workers: %f\n", numTasks, qtimer_secs(t));
}

On a 24 core (dual 12-core) haswell machine:

spawnScheme time(s)
for+fork (chpl-like) ~3.25
qt_loop ~7.15

Similar results on an older 4 core (single socket) core 2 quad machine:

spawnScheme time(s)
for+fork (chpl-like) ~1.15
qt_loop ~2.45

Those numbers are with nemesis. With distrib absolute times were slower, but the trend between for+fork and qt_loop is similar

qt_loop probably isn't the exact interface we'd want anyways, but I was figuring it's a good test for "optimized" spawn performance. We'd probably want an interface that takes the number of tasks to create, the function to call, and an array of args instead of qt_loop using the same args for all tasks.

It looks like qt_loop does a fair number of unpooled allocations, so that could be a cause of the overhead, but I haven't dug in enough to know for sure.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions