[Performance]: Wide pointers being required in one function result in reduced performance for unrelated functions

### Summary of Problem

This report follows a discussion that we started on Gitter. This concerns a potential performance issue, possibly related to caching issue. To recap, I have two independent programs `prog1` and `prog2`, and I want to measure their execution time successively and independently within the same `main`. To do so, I have the following code structure:

```chapel
proc main()
{
  {
    var t1: stopwatch;

    t1.start();
    prog1();
    t1.stop();

    writeln("t1 = ", t1.elapsed());
  }

  {
    var t2: stopwatch;

    t2.start();
    prog2();
    t2.stop();

    writeln("t2 = ", t2.elapsed());
  }

  return 0;
}
```

What happens to me is that when I execute both blocks (as shown above) the execution time corresponding to the first block is 2x/3x larger that when executed alone (commenting the second one). It is worth to note that I'm using (and would like to keep) Chapel 2.1.0 for this code, and that both programs involve a single CPU task that performs computation on a GPU device.

The full real code for this is attached to this report (in `.txt` because `.chpl` is not accepted). I didn't succeed to implement a simpler reproducer for this, but the code should be relatively easy to understand. On a system equipped with AMD EPYC 7513 (Zen 3), x86_64 and a Nvidia A100-SXM4-40GB (40 GiB), this gives me (in seconds):
```
t1 = 14.8412
t2 = 14.0312
```
when both blocks are executed, and `t1 = 4.79889` when the second one is commented.

Programs consist of two versions of a GPU-accelerated N-Queens solver, in which tree nodes are managed in a pool data structure and lots of data exchanges occur between CPU and GPU. In `prog1`, the arrays are (de)allocated at each iteration, while in `prog2` I use class wrappers in order to create "permanent" arrays on the GPU memory (inspired by https://github.com/chapel-lang/chapel/blob/main/test/gpu/native/basics/outOfOnArr.chpl). @bradcray suggested a third version using Chapel's `on gpuLocale var …;` to allocate memory independently of iterations/scopes, but this procudes segfault using Chapel 2.1.0.

[nqueensGpu.txt](https://github.com/user-attachments/files/19688991/nqueensGpu.txt)

**Is this issue currently blocking your progress?**
No

### Steps to Reproduce

**Source Code:**

The code is given in the attached file.

**Compile command:**
`chpl nqueensGpu.chpl -o nqueensGpu.out --fast`

`--fast` optimization flag enabled?
'yes'

**Execution command:**
`./nqueensGpu.out`

### Configuration Information

- Output of `chpl --version`: 2.1.0
- Output of `$CHPL_HOME/util/printchplenv --anonymize`:
```
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: gpu *
  CHPL_GPU: nvidia *
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: bundled *
CHPL_AUX_FILESYS: none
```
- Back-end compiler and version, e.g. `gcc --version` or `clang --version`: `gcc (Spack GCC) 12.2.0`
- (For Cray systems only) Output of `module list`:


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance]: Wide pointers being required in one function result in reduced performance for unrelated functions #27092

Summary of Problem

Steps to Reproduce

Configuration Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance]: Wide pointers being required in one function result in reduced performance for unrelated functions #27092

Description

Summary of Problem

Steps to Reproduce

Configuration Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions