Global Arrays provide one-sided, noncollective communication operations that allow to access data in global arrays without cooperation with the process or processes that hold the referenced data. These processes do not know what data items in their own memory are being accessed or updated by remote processes. Moreover, since the GA interface uses global array indices to reference nonlocal data, the calling process does not even have to know process ids and location in memory where the refernenced data resides.
The one-sided operations that Global Arrays provide can be summarized into three categories:
| Operation | Process |
|---|---|
| Remote blockwise write/read | ga_put, ga_get |
| Remote atomic update | ga_acc, ga_read_inc, ga_scatter_acc |
| Remote elementwise write/read | ga_scatter, ga_gather |
Put and get are two powerful operations for interprocess communication, performing remote write and read. Because of their one-sided nature, they don’t need cooperation from the process(es) that owns the data. The semantics of these operations do not require the user to specify which remote process or processes own the accessed portion of a global array. The data is simply accessed as if it were in shared memory.
Put copies data from the local array to the global array section, which is:
- n-D Fortran subroutine: nga_put(g_a, lo, hi, buf, ld)
- 2-D Fortran subroutine: ga_put(g_a, ilo, ihi, jlo, jhi, buf, ld)
- C: void NGA_Put(int g_a, int lo[], int hi[], void *buf, int ld[])
- C++: void GA::GlobalArray::put(int lo[], int hi[], void *buf, int ld[])
All the arguments are
provided in one call: lo and hi specify where the data should go
in the global array; ld specifies the stride information of the
local array buf. The local array should have the same number of
dimensions as the global array; however, it is really required to
present the n-dimensional view of the local memory buffer, that by
itself might be one-dimensional.
The operation is transparent to the user, which means the user doesn’t
have to worry about where the region defined by lo and hi is
located. It can be in the memory of one or many remote processes, owned
by the local process, or even mixed (part of it belongs to remote
processes and part of it belongs to a local process).
Get is the reverse operation of put. It copies data from a global array section to the local array. It is:
- n-D Fortran subroutine: nga_get(g_a, lo, hi, buf, ld)
- 2-D Fortran subroutine: ga_get(g_a, ilo, ihi, jlo, jhi, buf, ld)
- C: void NGA_get(int g_a, int lo[], int hi[], void *buf, int ld[])
- C++: void GA::GlobalArray::get(int lo[], int hi[], void *buf, int ld[])
Similar to put, lo and hi specify where the data should come from in the global array,
and ld specifies the stride information of the local array buf.
The local array is assumed to have the same number of dimensions as the
global array. Users don't need to worry about where the region defined
by lo and hi is physically located.
For a ga_get operation transferring data from the (11:15,1:5)
section of a 2-dimensional 15x10 global array into a local buffer 5x10
array we have: (In Fortran notation)
lo={11,1}, hi={15,5}, ld={10}
It is often useful in a put operation to combine the data moved to the target process with the data that resides at that process, rather then replacing the data there. Accumulate and read_inc perform an atomic remote update to a patch (a section of the global array) in the global array and an element in the global array, respectively. They don’t need the cooperation of the process(es) who owns the data. Since the operations are atomic, the same portion of a global array can be referenced by these operations issued by multiple processes and the GA will assure the correct and consistent result of the updates.
Accumulate combines the data from the local array with data in the global array section, which is
- n-D Fortran subroutine: nga_acc(g_a, lo, hi, buf, ld, alpha)
- 2-D Fortran subroutine: ga_acc(g_a, ilo, ihi, jlo, jhi, buf, ld, alpha)
- C: void NGA_Acc(int g_a, int lo[], int hi[], void *buf, int ld[], void *alpha)
- C++: void NGA::GlobalArray::acc(int lo[], int hi[], void *buf, int ld[], void *alpha)
The local array is assumed to have the same number of dimensions as the global array. Users don’t need to worry about where the region defined by lo and hi is physically located. The function performs
global array section (lo[], hi[]) += alpha * buf
Read_inc remotely updates a particular element in the global array, which is
- n-D Fortran subroutine: nga_read_inc(g_a, subscript, inc)
- 2-D Fortran subroutine: ga_read_inc(g_a, i, j, inc)
- C: long NGA_Read_inc(int g_a, int subscript[], long inc)
- C++: long GA::GlobalArray::readInc(int subscript[], long inc)
This function applies to integer arrays only. It atomically reads and increments an element in an integer array. It performs
a(subsripts) += inc
and returns the original value (before the update) of a(subscript).
Scatter and gather transfer a specified set of elements to and from global arrays. They are one-sided: that is they don’t need the cooperation of the process(es) who own the referenced elements in the global array.
Scatter puts array elements into a global array, which is
- n-D Fortran subroutine: nga_scatter(g_a, v, subsarray, n)
- 2-D Fortran subroutine: ga_scatter(g_a, v, i, j, n)
- C: void NGA_Scatter(int g_a, void *v, int *subsarray[], int n)
- C++: void GA::GlobalArray::scatter(void *v, int *subsarray[], int n)
It performs (in C notation)
for(k=0; k<=n; k++) {
a[subsArray[k][0]][subsArray[k][1]][subsArray[k][2]] ... = v[k];
}
Example: Scatter the 5 elements into a 10x10 global array
Element 1: v[0]=5; subsArray[0][0]=2; subsArray[0][1]=3; Element 2: v[1]=3; subsArray[1][0]=3; subsArray[1][1]=4; Element 3: v[2]=8; subsArray[2][0]=8; subsArray[2][1]=5; Element 4: v[3]=7; subsArray[3][0]=3; subsArray[3][1]=7; Element 5: v[4]=2; subsArray[4][0]=6; subsArray[4][1]=3;
After the scatter operation, the five elements would be scattered into the global array as shown in the following figure.
Gather is the reverse operation of scatter. It gets the array elements from a global array into a local array.
- n-D Fortran subroutine: nga_gather(g_a, v, subsarray, n)
- 2-D Fortran subroutine: ga_gatherga_gather(g_a, v, i, j, n)
- C: void NGA_Gather(int g_a, void *v, int *subsarray[], int n)
- C++: void GA::GlobalArray::gather(void *v, int *subsarray[], int n)
It performs (in C notation)
for(k=0; k<=n; k++) {
v[k] = a[subsArray[k][0]][subsArray[k][1]][subsArray[k][2]] ...;
}
Periodic interfaces to the one-sided operations have been added to Global Arrays in version 3.1 to support some computational fluid dynamics problems on multidimensional grids. They provide an index translation layer that allows you to use put, get, and accumulate operations, possibly extending beyond the boundaries of a global array. The references that are outside of the boundaries are wrapped up inside the global array. To better illustrate these operations, look at the following example:
Example: Assume a two dimensional global array g_a with dimensions 5 X 5.
To access a patch [2:4,-1:3], one can assume that the array is wrapped over in the second dimension, as shown in the following figure
Therefore the patch [2:4, -1:3] is
17 22 2 7 12 18 23 3 8 13 19 24 4 9 14
Periodic operations extend the boudary of each dimension in two directions, toward the lower bound and toward the upper bound. For any dimension with lo(i) to hi(i), where 1 < i < ndim, it extends the range from
[lo(i) : hi(i)] to [(lo(i)-1-(hi(i)-lo(i)+1)) : (lo(i)-1)], [lo(i) : hi(i)], and [(hi(i)+1) : (hi(i)+1+(hi(i)-lo(i)+1))], or [(lo(i)-1-(hi(i)-lo(i)+1)) : (hi(i)+1+(hi(i)-lo(i)+1))].
Even though the patch spans in a much large range, the length must always be less, or equal to (hi(i)-lo(i)+1)).
Example: For a 2 x 2 array as shown in the following figure, where the dimensions are [1:2, 1:2], periodic operations would look at the range of each of the dimensions as [-1:4, -1:4].
Current version of GA supports three periodic operations. They are
- periodic get,
- periodic put, and
- periodic accumulate
Periodic Get copies data from a global array section to a local array, which is almost the same as regular get, except the indices of the patch can be outside the boundaries of each dimension.
- Fortran subroutine: nga_periodic_get(g_a, lo, hi, buf, ld)
- C: void NGA_Periodic_get(int g_a, int lo[], int hi[], void *buf, int ld[])
- C++: void GA::GlobalArray::periodicGet(int lo[], int hi[], void *buf, int ld[])
Similar to regular get, lo and hi specify where the data
should come from in the global array, and ld specifies the stride
information of the local array buf.
Example: Let us look at the first example in this section. It is 5 x 5 two dimensional global array. Assume that the local buffer is an 4x3 array.
Also assume that
1o[0] = -1, hi[0] = 2, lo[1] = 4, hi[1] = 6, and ld[0] = 4.
The local buffer buf is
19 24 4 20 25 5 16 21 1 17 22 2
Periodic Put is the reverse operations of Periodic Get. It copies data from the local array to the global array section, which is
- Fortran subroutine: nga_periodic_put(g_a, lo, hi, buf, ld)
- C: void NGA_Periodic_put(int g_a, int lo[], int hi[], void *buf, int ld[])
- C++: void GA::GlobalArray::periodicPut(int lo[], int hi[], void *buf, int ld[])
Similar to regular put, lo and hi specify where the data
should go in the global array; ld specifies the stride information
of the local array buf.
Periodic Put/Get (also include the Accumulate, which will be discussed later in this section) divide the patch into several smaller patches. For those smaller patches that are outside the global aray, adjust the indices so that they rotate back to the original array. After that call the regular Put/Get/Accumulate, for each patch, to complete the operations.
Example: Look at the example for periodic get. Because it is a 5 x 5 global array, the valid indices for each dimension are
dimension 0: [1 : 5] dimension 1: [1 : 5]
The specified lo and hi are apparently out of the range of each dimension:
dimemsion 0: [-1 : 2] --> [-1 : 0] -- wrap back --> [4 : 5] [ 1 : 2] ok dimension 1: [ 4 : 6] --> [ 4 : 5] ok [ 6 : 6] -- wrap back --> [1 : 1]
Hence, there will be four smaller patches after the adjustment. They are
patch 0: [4 : 5, 4 : 5] patch 1: [4 : 5, 1 : 1] patch 2: [1 : 2, 4 : 5] patch 3: [1 : 2, 1 : 1]
as shown in the following figure
Of course the destination addresses of each samller patch in the local buffer also need to be calculated.
Similar to regular Accumulate, Periodic Accumulate combines the data from the local array with data in the global array section, which is
- Fortran subroutine: nga_periodic_acc(g_a, lo, hi, buf, ld, alpha)
- C: void NGA_Periodic_acc(int g_a, int lo[], int hi[], void *buf, int ld[], void *alpha)
- C++: void GA::GlobalArray::periodicAcc(int lo[], int hi[], void *buf, int ld[], void *alpha)
The local array is assumed to have the same number of dimensions as the
global array. Users don’t need to worry about where the region defined
by lo and hi is physically located. The function performs
global array section (lo[], hi[]) += alpha * buf
Example: Let us look at the same example as above. There is a 5 x 5 two dimensional global array. Assume that the local buffer is an 4x3 array.
Also assume that
1o[0] = -1, hi[0] = 2, lo[1] = 4, hi[1] = 6, and ld[0] = 4.
The local buffer buf is
1 5 9 4 6 5 3 2 1 7 8 2
and alpha = 2.
After the Periodic Accumulate operation, the global array will be
The non-blocking operations (get/put/accumulate) are derived from the blocking interface by adding a handle argument that identifies an instance of the non-blocking request. Nonblocking operations initiate a communication call and then return control to the application. A return from a nonblocking operation call indicates a mere initiation of the data transfer process and the operation can be completed locally by making a call to the wait (e.g. nga_nbwait) routine.
The wait function completes a non-blocking one-sided operation locally. Waiting on a nonblocking put or an accumulate operation assures that data was injected into the network and the user buffer can be now be reused. Completing a get operation assures data has arrived into the user memory and is ready for use. Wait operation ensures only local completion. Unlike their blocking counterparts, the nonblocking operations are not ordered with respect to the destination. Performance being one reason, the other reason is that by ensuring ordering we incur additional and possibly unnecessary overhead on applications that do not require their operations to be ordered. For cases where ordering is necessary, it can be done by calling a fence operation. The fence operation is provided to the user to confirm remote completion if needed.
Example: Let us take a simple case for illustration. Say, there are two global arrays i.e. one array stores pressure and the other stores temperature. If there are two computation phases (first phase computes pressure and second phase computes temperature), then we can overlap communication with computation, thus hiding latency.
. . . . . . . . .
nga_get (get_pressure_array)
nga_nbget(initiates data transfer to get temperature_array, and returns immediately)
/* hiding latency - communication is overlapped with computation */
compute_pressure()
nga_nbwait(temperature_array - completes data transfer)
compute_temperature()
. . . . . . . .The non-blocking APIs are derived from the blocking interface by adding a handle argument that identifies an instance of the non-blocking request.
- n-D Fortran subroutine: nga_nbput(g_a, lo, hi, buf, ld, nbhandle)
- n-D Fortran subroutine: nga_nbget(g_a, lo, hi, buf, ld, nbhandle)
- n-D Fortran subroutine: nga_nbacc(g_a, lo, hi, buf, ld, alpha, nbhandle)
- n-D Fortran subroutine: nga_nbwait(nbhandle)
- 2-D Fortran subroutine: ga_nbput(g_a, ilo, ihi, jlo, jhi, buf, ld, nbhandle)
- 2-D Fortran subroutine: ga_nbget(g_a, ilo, ihi, jlo, jhi, buf, ld, nbhandle)
- 2-D Fortran subroutine: ga_nbacc(g_a, ilo, ihi, jlo, jhi, buf, ld, alpha, nbhandle)
- 2-D Fortran subroutine: ga_nbwait(nbhandle)
- C: void NGA_NbPut(int g_a, int lo[], int hi[], void *buf, int ld[], ga_nbhdl_t* nbhandle)
- C: void NGA_NbGet(int g_a, int lo[], int hi[], void *buf, int ld[], ga_nbhdl_t* nbhandle)
- C: void NGA_NbAcc(int g_a, int lo[], int hi[], void *buf, int ld[], void *alpha, ga_nbhdl_t* nbhandle)
- C int NGA_NbWait(ga_nbhdl_t* nbhandle)
- C++: void GA::GlobalArray::nbPut(int lo[], int hi[], void *buf, int ld[], ga_nbhdl_t* nbhandle)
- C++: void GA::GlobalArray::nbGet(int lo[], int hi[], void *buf, int ld[], ga_nbhdl_t* nbhandle)
- C++: void GA::GlobalArray::nbAcc(int lo[], int hi[], void *buf, int ld[], void *alpha, ga_nbhdl_t* nbhandle)
- C++ int GA::GlobalArray::NbWait(ga_nbhdl_t* nbhandle)






