Skip to content

Commit 01ddd35

Browse files
committed
Merge pull request #589 from gbtitus/doc-slurm-ugni-mem-registration-issue-1.10
Add a note about ugni memory registration and concurrency with slurm. (cherry picked from commit e05a5b8)
2 parents e5c2c86 + 397e343 commit 01ddd35

File tree

1 file changed

+22
-1
lines changed

1 file changed

+22
-1
lines changed

Diff for: doc/release/platforms/README.cray

+22-1
Original file line numberDiff line numberDiff line change
@@ -296,7 +296,9 @@ program heap will grow to during execution:
296296

297297
By default the heap will occupy as much of the free memory on the locale
298298
(compute node) as the runtime can acquire, less a certain amount to
299-
allow for demands from other (system) programs running there. Advanced
299+
allow for demands from other (system) programs running there. (Note
300+
that the default with slurm job placement is 16 GiB; see "Communication
301+
Layer Concurrency and Slurm", below, for more information.) Advanced
300302
users may want to make the heap smaller than this. Programs start more
301303
quickly with a smaller heap, and in the unfortunate event that you need
302304
to produce core files, those will be written more quickly if the heap is
@@ -540,6 +542,25 @@ Parameters associated with the ugni communication layer:
540542
silently increased or reduced so as to fall within it.
541543

542544

545+
Communication Layer Concurrency and Slurm
546+
-----------------------------------------
547+
548+
When slurm is used for job placement on Cray systems, it limits the
549+
total NIC memory registration in order to allow for job sharing on
550+
the compute nodes. In our experience this limit is approximately
551+
240 GiB. The product of CHPL_RT_MAX_HEAP_SIZE and the communication
552+
layer concurrency discussed above must be less than this. The ugni
553+
communication layer adjusts its heap size and concurency defaults to
554+
reflect this limit when slurm is responsible for job placement. The
555+
default heap size is reduced to 16 GiB. The concurrency is computed
556+
such that the product of heap size and concurrency is below 240 GiB.
557+
Thus under slurm, the ugni communication layer can support programs
558+
with very large heaps or programs that need a lot of communication
559+
concurrency, but not programs that need both simultaneously. Such
560+
programs need to be run using ALPS for job placement instead of
561+
slurm.
562+
563+
543564
Network Atomics
544565
---------------
545566

0 commit comments

Comments
 (0)