Skip to content

[Performance]: Possible caching issue #27092

Open
@Guillaume-Helbecque

Description

@Guillaume-Helbecque

Summary of Problem

This report follows a discussion that we started on Gitter. This concerns a potential performance issue, possibly related to caching issue. To recap, I have two independent programs prog1 and prog2, and I want to measure their execution time successively and independently within the same main. To do so, I have the following code structure:

proc main()
{
  {
    var t1: stopwatch;

    t1.start();
    prog1();
    t1.stop();

    writeln("t1 = ", t1.elapsed());
  }

  {
    var t2: stopwatch;

    t2.start();
    prog2();
    t2.stop();

    writeln("t2 = ", t2.elapsed());
  }

  return 0;
}

What happens to me is that when I execute both blocks (as shown above) the execution time corresponding to the first block is 2x/3x larger that when executed alone (commenting the second one). It is worth to note that I'm using (and would like to keep) Chapel 2.1.0 for this code, and that both programs involve a single CPU task that performs computation on a GPU device.

The full real code for this is attached to this report (in .txt because .chpl is not accepted). I didn't succeed to implement a simpler reproducer for this, but the code should be relatively easy to understand. On a system equipped with AMD EPYC 7513 (Zen 3), x86_64 and a Nvidia A100-SXM4-40GB (40 GiB), this gives me (in seconds):

t1 = 14.8412
t2 = 14.0312

when both blocks are executed, and t1 = 4.79889 when the second one is commented.

Programs consist of two versions of a GPU-accelerated N-Queens solver, in which tree nodes are managed in a pool data structure and lots of data exchanges occur between CPU and GPU. In prog1, the arrays are (de)allocated at each iteration, while in prog2 I use class wrappers in order to create "permanent" arrays on the GPU memory (inspired by https://github.com/chapel-lang/chapel/blob/main/test/gpu/native/basics/outOfOnArr.chpl). @bradcray suggested a third version using Chapel's on gpuLocale var …; to allocate memory independently of iterations/scopes, but this procudes segfault using Chapel 2.1.0.

nqueensGpu.txt

Is this issue currently blocking your progress?
Might be

Steps to Reproduce

Source Code:

The code is given in the attached file.

Compile command:
chpl nqueensGpu.chpl -o nqueensGpu.out --fast

--fast optimization flag enabled?
'yes'

Execution command:
./nqueensGpu.out

Configuration Information

  • Output of chpl --version: 2.1.0
  • Output of $CHPL_HOME/util/printchplenv --anonymize:
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: gpu *
  CHPL_GPU: nvidia *
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: bundled *
CHPL_AUX_FILESYS: none
  • Back-end compiler and version, e.g. gcc --version or clang --version: gcc (Spack GCC) 12.2.0
  • (For Cray systems only) Output of module list:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions