Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/rst/technotes/gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ The following are further requirements for GPU support:

* Specifically for targeting AMD GPUs:

* ROCm version between 5.0 and 5.4 or between ROCm 6.0 and 6.2 must be
* ROCm version between 5.0 and 5.4 or between ROCm 6.0 and 6.3 must be
installed.

* For ROCm 5.x, ``CHPL_LLVM`` must be set to ``system``. Note that, ROCm
Expand Down Expand Up @@ -710,7 +710,7 @@ marked with * are covered in our nightly testing configurations.

* Hardware: MI60, MI100 and MI250X*

* Software:ROCm 5.4, 6.0, 6.1, 6.2*
* Software:ROCm 5.4, 6.0, 6.1, 6.2*, 6.3*


GPU Support on Windows Subsystem for Linux
Expand Down
5 changes: 5 additions & 0 deletions test/gpu/native/assertOnGpuVarRuntime.prediff
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/usr/bin/env bash

# asserts in a GPU kernel may segfault on exit
grep -v 'Segmentation fault' $2 >$2.tmp
mv $2.tmp $2
5 changes: 5 additions & 0 deletions test/gpu/native/gpuWritelnAndAssertOnGpu.prediff
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/usr/bin/env bash

# asserts in a GPU kernel may segfault on exit
grep -v 'Segmentation fault' $2 >$2.tmp
mv $2.tmp $2
5 changes: 5 additions & 0 deletions test/gpu/native/halting/gpuhalt.prediff
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/usr/bin/env bash

# asserts in a GPU kernel may segfault on exit
grep -v 'Segmentation fault' $2 >$2.tmp
mv $2.tmp $2
5 changes: 5 additions & 0 deletions test/gpu/native/studies/sort/nonGpuError.prediff
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/usr/bin/env bash

# asserts in a GPU kernel may segfault on exit
grep -v 'Segmentation fault' $2 >$2.tmp
mv $2.tmp $2
5 changes: 5 additions & 0 deletions test/gpu/native/studies/sort/scanTest.prediff
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/usr/bin/env bash

# asserts in a GPU kernel may segfault on exit
grep -v 'Segmentation fault' $2 >$2.tmp
mv $2.tmp $2
7 changes: 4 additions & 3 deletions util/chplenv/chpl_gpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -475,6 +475,7 @@ def _validate_rocm_llvm_version_impl(gpu: gpu_type):
# 'CHPL_LLVM_65188_PATCH' is a temporary escape hatch to enable system
# llvm with rocm 6.x if the user knows what they are doing
error("Cannot target AMD GPUs with ROCm 6.x without CHPL_LLVM=bundled")
# TODO: can we use LLVM 20 with ROCm 6.3?
elif (
major_version == "6"
and chpl_llvm.get() == "system"
Expand Down Expand Up @@ -549,8 +550,8 @@ def _validate_rocm_version_impl():
MAX_REQ_VERSION_NICE = "5.4.x"

MIN_ROCM6_REQ_VERSION = "6.0"
MAX_ROCM6_REQ_VERSION = "6.3" # upper bound non-inclusive
MAX_ROCM6_REQ_VERSION_NICE = "6.2.x"
MAX_ROCM6_REQ_VERSION = "6.4" # upper bound non-inclusive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semi serious question: Should we drop upper bounds here and let people try whatever version they might want to? Looking at the diff here, the only real "fix" here is this bump it looks like. Are we unnecessarily creating work for us to bump this up for each ROCm version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not be that opposed to removing the upper bound. I think it is a nicer user experience that users can only use Chapel with known-to-work versions, rather than getting build/runtime errors. However, its a maintenance burden and can also frustrate users.

Maybe a good compromise is to turn it into a warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the error into a warning, let me know what you think

MAX_ROCM6_REQ_VERSION_NICE = "6.3.x"

rocm_version = get_sdk_version()

Expand All @@ -559,7 +560,7 @@ def _validate_rocm_version_impl():
return False

if not is_ver_in_range(rocm_version, MIN_REQ_VERSION, MAX_REQ_VERSION) and not is_ver_in_range(rocm_version, MIN_ROCM6_REQ_VERSION, MAX_ROCM6_REQ_VERSION):
_reportMissingGpuReq(
warning(
"Chapel requires ROCm versions %s to %s or %s to %s, "
"detected version %s on system." %
(MIN_REQ_VERSION, MAX_REQ_VERSION_NICE, MIN_ROCM6_REQ_VERSION, MAX_ROCM6_REQ_VERSION_NICE, rocm_version))
Expand Down
21 changes: 21 additions & 0 deletions util/cron/test-gpu-ex-rocm-63.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/usr/bin/env bash
#
# GPU native testing on a Cray EX (using none for CHPL_COMM)

UTIL_CRON_DIR=$(cd $(dirname ${BASH_SOURCE[0]}) ; pwd)
source $UTIL_CRON_DIR/common-native-gpu.bash
source $UTIL_CRON_DIR/common-hpe-cray-ex.bash

module load rocm/6.3.0 # pin to rocm 6.3.0

export CHPL_COMM=none
export CHPL_LLVM=bundled
unset CHPL_LLVM_CONFIG # we need this to avoid warnings
export CHPL_LAUNCHER_PARTITION=bardpeak # bardpeak is the default queue
export CHPL_GPU=amd # also detected by default
export CHPL_GPU_ARCH=gfx90a

module list

export CHPL_NIGHTLY_TEST_CONFIG_NAME="gpu-ex-rocm-63"
$UTIL_CRON_DIR/nightly -cron ${nightly_args}
21 changes: 21 additions & 0 deletions util/cron/test-gpu-ex-rocm-63.ofi.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/usr/bin/env bash
#
# GPU native testing on a Cray EX

UTIL_CRON_DIR=$(cd $(dirname ${BASH_SOURCE[0]}) ; pwd)
source $UTIL_CRON_DIR/common-native-gpu.bash
source $UTIL_CRON_DIR/common-hpe-cray-ex.bash

module load rocm/6.3 # pin to rocm 6.3

export CHPL_COMM=ofi
export CHPL_LLVM=bundled
unset CHPL_LLVM_CONFIG # we need this to avoid warnings
export CHPL_LAUNCHER_PARTITION=bardpeak # bardpeak is the default queue
export CHPL_GPU=amd # also detected by default
export CHPL_GPU_ARCH=gfx90a

module list

export CHPL_NIGHTLY_TEST_CONFIG_NAME="gpu-ex-rocm-63.ofi"
$UTIL_CRON_DIR/nightly -cron ${nightly_args}
4 changes: 2 additions & 2 deletions util/cron/test-perf.gpu-ex-rocm.bash
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ source $UTIL_CRON_DIR/common-native-gpu-perf.bash
source $UTIL_CRON_DIR/common-perf.bash

# module load rocm # load the default version of ROCm
# pin to rocm 6.2 for now, https://github.com/chapel-lang/chapel/issues/26934
module load rocm/6.2.0
# load rocm 6.3, because 6.4 has issues with libc++
module load rocm/6.3.0

export CHPL_COMM=none
export CHPL_LLVM=bundled
Expand Down