Description
Ideas
Segmented Flat Array - Custom Domain
Use the segmented idea from FlatObjectPool
to allow concurrent operations during resizing.
Can possibly work with distribution as well as layouts.
FCHLock - Localization
Localize data
when we get combiner status and perform it on that, then write back later.
MPMC List - Per Task Blocks
Have here.maxTaskPar
blocks of data.
Global Queue using GDT
Idea: Enqueue 'promises' like normal but this time does an exchange
operation.
// Inside 'add' function...
var node = new node();
var prev = tail.exchange(node);
if prev == nil then head.write(node);
else prev.next = node;
Of course, using GDT for global objects!
Active Request Bias
Have separate active request lists so that a locale serves only its own tasks. Possible to define some fairness or starvation-free principles by just giving combiner thread access to another locale that has requests.
FIFO Extension
Use the above algorithm for obtaining the head index. This eliminates contention and ensures good system throughput (aside for communication cost, which is necessary anyway). Compared to the number of contending Compare-And-Swap operations being O(numLocales), its 0, with drawback of having communication (which is unavoidable).
Benchmark Framework Tool
Idea: Convert this from Go to Chapel so that benchmarks and tests can be ran with more ease.
Queue
Global wait free queue + FCHLock
Idea: Wait-Freedom guarantees that we have a bounded number of remote operations, which ensures forward progress and scalability (eventually). By using global atomics, we can make use of a bounded global queue that is proven to scale very well. Since the queue is only SPMC (without MCAS), we can use the Flat Combining strategy to ensure scalability even for a single a producer. Consumers use the normal wait-free dequeue solution.
Overall this will ensure global wait freedom! The queue itself is wait free, and because the operations to be combined is wait free, the wait for a task to have its operation be completed is bounded assuming that scheduling inside FCHLock is adequate. Overall, this can guarantee scalability for bounded globally FIFO queues.
Round Robin + Work Stealing
Idea: Reduce amount of work stealing by performing a round robin enqueue like FIFO does.
Rationale: Queue has more consumers than producers, means it won't need to steal work too often. Might increase overall throughput of queue.
Drawback: Enqueue does scale but is magnitudes slower than local enqueue. Good only if enqueue is not a huge priority.
Optimization
FIFO
Allow Enqueues and Dequeues to skip indexes...
Idea: Have enqueue and dequeue solely rely on fetchAdd
, then have them update another set of global arrays using fetchAdd
for enqueue or fetchSub
for dequeue. If during an enqueue, a fetchAdd
yields a negative, skip it. If during a dequeue, a fetchSub
yields a negative, skip it.
Rationale: Allows enqueues to be unhindered with only one additional wait-free atomic operation. Allows dequeue to rely on two wait-free atomic operations.
Drawback: Having a large amount of dequeues means you have an equal amount more of enqueues having to retry, but is fair and should scale.
Abstractions
QueueSet
Idea: Have a set of queues that are linked together that allow work stealing. This is the MPMC queue, but in a much reusable manner that allows multiple queues, even multiple local queues, all linked together.
RCU With Descriptors
Idea: Have descriptors into a locale's address space which adds one more step of indirection.
Rationale: Allows atomic updates of variables, especially with RCU if its for record
types. A compare-and-swap is just one on descriptors. A read and write are also just on descriptors. RCU can do a read on the descriptor, a copy of what is read and modification into a new descriptor, and a replacement to point to the new one (with reference counting for the old one).