Description
This is of importance to both the project and to Chapel as a whole (or so I believe).
I believe that this early prototype of the Global Descriptor Table (GDT), formerly
referred to as GlobalAtomicObject, demonstrates the potential power of global atomics.
There are a few key things here which were misconceptions that I've had and been told
since the beginning...
Communication Is NOT Bad
Communication, even if it is per-operation, is not bad, but it's not good either.
The true bottleneck is contention. I've tested time and time again, and each time
an algorithm containing some contention-causing operation which may cause an unbounded number
of failed operation resulting in an unbounded number of communications, has failed horribly
in terms of performance. For example, operations like this is bad...
while true {
var _x = x.read();
var _y = y.read();
if _x < _y && _x.compareExchangeStrong(_x, _x + 1) {
break;
}
}
Will cause an unbounded number of communications. Reading x
and y
is relatively
expensive, and since the CAS
operation not only causes remote contention, but the fact
that both x
and y
need to be read again, causes performance to drop.
Now, algorithms which are guaranteed to succeed are guaranteed to scale very well;
that is, only wait-free algorithms can see scalability. Code that uses
fetchAdd
and exchange
work wonderfully, which gives me hope that a global data
structure with very loose guarantees are possible and useful.
Global Descriptor Table
Next, is the new and novel idea of the GDT
, the Global Descriptor Table. A 64-bit word
is used over a 128-bit wide pointer allowing us to take advantage of the benefits of
network atomics. In essence, we encode the locale number in the upper 32 bits and
the actual index in the lower 32 bits. Currently, an array is used, but in reality (perhaps
with runtime support if not already available) its possible that a descriptor can be directly
used like a normal pointer. Currently there is a large amount of overhead in needing
to keep a bitmap of usable memory and it cannot be resized without needing to synchronize
all accesses to it (as Chapel domains and arrays cannot be resized in a lock-free way).
Currently, it has been tested with simple exchange
operations on class instances
remotely across all locales, versus needing to acquire a sync variable to do the same.
As stated above, a compareExchangeStrong
kills performance, but that has nothing to do
with the GDT
but with network atomic contention, so its kept simple. It works and it
scales. The below graph shows time to complete the same number of operations
(100,000 in this case). It shows the overall runtime of the average of 4 trials
(discarding the first warm-up), and that while sync
will increase in an a near
exponential growth, GDT
remains linear.
Now, to get a better look at it, here are the same results in Operations per Second.
Implications
I believe this is some very valuable information, which is why I include Michael as well.
Not only is the implementation very immature (there are so many optimizations that
can be made, that there's no saying how much more this can scale), it also surpasses
the only way to perform atomics on global class instances. As well, this opens some
very unconventional means of concurrency, the first being the DSMLock
(WIP) that is
based on the DSM-Synch (Distributed Shared Memory Combined Synchronization)
in the publication that had CC-Synch (Cache-Coherent Combined Synchronization). As I can confirm
that this scales, I can possibly even make DSMLock
scale in such a way that global
lock-based algorithms can scale (not just wait-free). Extremely exciting!
Edit:
If this is extended on for the runtime, I can imagine having the entire global address space chunked up into a 4GB (2^32) zones, with 2^8 zones on 2^24 locales. With 256 of 4GB zones, that's 1TB address space per locale, with 16M+ locales.