Global Atomic Operations on Multi-word Data

@e-kayrakli 
@mppf 

This is of importance to both the project and to Chapel as a whole (or so I believe).
I believe that this early prototype of the Global Descriptor Table (GDT), formerly
referred to as GlobalAtomicObject, demonstrates the potential power of global atomics.
There are a few key things here which were misconceptions that I've had and been told
since the beginning...

## Communication Is NOT Bad

Communication, even if it is per-operation, is *not* bad, but it's not good either.
The *true* bottleneck is contention. I've tested time and time again, and each time
an algorithm containing some contention-causing operation which may cause an unbounded number
of failed operation resulting in an unbounded number of communications, has failed horribly
in terms of performance. For example, operations like this is bad...

```chpl
while true {
  var _x = x.read();
  var _y = y.read();
  if _x < _y && _x.compareExchangeStrong(_x, _x + 1) {
    break;
  }
}
```

Will cause an unbounded number of communications. Reading `x` and `y` is relatively
expensive, and since the `CAS` operation not only causes remote contention, but the fact
that both `x` and `y` need to be read *again*, causes performance to drop.

Now, algorithms which are guaranteed to succeed are guaranteed to scale very well;
that is, *only* **wait-free** algorithms can see scalability. Code that uses
`fetchAdd` and `exchange` work wonderfully, which gives me hope that a global data
structure with very loose guarantees are possible and useful.


## Global Descriptor Table

Next, is the new and *novel* idea of the `GDT`, the Global Descriptor Table. A 64-bit word
is used over a 128-bit wide pointer allowing us to take advantage of the benefits of
network atomics. In essence, we encode the locale number in the upper 32 bits and
the actual index in the lower 32 bits. Currently, an array is used, but in reality (perhaps
with runtime support if not already available) its possible that a descriptor can be directly
used like a normal pointer. Currently there is a *large* amount of overhead in needing
to keep a bitmap of usable memory and it cannot be resized without needing to synchronize
all accesses to it (as Chapel domains and arrays cannot be resized in a lock-free way).

Currently, it has been tested with simple `exchange` operations on class instances
remotely across all locales, versus needing to acquire a sync variable to do the same.
As stated above, a `compareExchangeStrong` kills performance, but that has nothing to do
with the `GDT` but with network atomic contention, so its kept simple. It works *and* it
scales. The below graph shows time to complete the same number of operations
(100,000 in this case). It shows the overall runtime of the average of 4 trials
(discarding the first warm-up), and that while `sync` will increase in an a near
exponential growth, `GDT` remains linear.

![image](https://user-images.githubusercontent.com/4269990/27773602-45d7f7b6-5f4b-11e7-8dad-0a53ee65860a.png)


Now, to get a better look at it, here are the same results in Operations per Second.

![image](https://user-images.githubusercontent.com/4269990/27773603-4ad30cba-5f4b-11e7-988c-f607bb43442f.png)


## Implications

I believe this is some *very* valuable information, which is why I include Michael as well.
Not only is the implementation very immature (there are *so* many optimizations that
can be made, that there's no saying how much *more* this can scale), it also surpasses
the *only* way to perform atomics on global class instances. As well, this opens some
very unconventional means of concurrency, the first being the `DSMLock` (WIP) that is
based on the DSM-Synch (*Distributed* Shared Memory Combined Synchronization)
in the publication that had CC-Synch (Cache-Coherent Combined Synchronization). As I can confirm
that this scales, I can possibly even make `DSMLock` scale in such a way that global
lock-based algorithms can scale (not just wait-free). Extremely exciting!

Edit:

If this is extended on for the runtime, I can imagine having the entire global address space chunked up into a 4GB (2^32) zones, with 2^8 zones on 2^24 locales. With 256 of 4GB zones, that's 1TB address space per locale, with 16M+ locales.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Global Atomic Operations on Multi-word Data #15

Communication Is NOT Bad

Global Descriptor Table

Implications

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Global Atomic Operations on Multi-word Data #15

Description

Communication Is NOT Bad

Global Descriptor Table

Implications

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions