Skip to content

Commit b616580

Browse files
committed
docs(research): Update multi-GPU HeFFTe research documentation and examples
- Add support for HeFFTe build without GPU-aware MPI (for testing) - Update CMakeLists.txt to support HEFFTE_NO_GPU_AWARE environment variable - Document successful multi-GPU testing without GPU-aware MPI - Add detailed instructions for building and testing multi-GPU setup - Update cuda_fft_example.cpp with GPU device verification and memory reporting - Add GPU UUID and device property information to output - Document that multi-GPU works via CPU transfer fallback (slower but functional) - Add performance notes about GPU-aware MPI benefits
1 parent 3361e82 commit b616580

File tree

3 files changed

+116
-5
lines changed

3 files changed

+116
-5
lines changed

research/multigpu_heffte/CMakeLists.txt

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,15 @@ set(CMAKE_CXX_STANDARD_REQUIRED ON)
1010
find_package(MPI REQUIRED)
1111

1212
# Find HeFFTe from our custom installation
13-
# The HeffteConfig.cmake is located at ~/opt/heffte/2.4.1-cuda/lib64/cmake/Heffte/
14-
set(Heffte_DIR "$ENV{HOME}/opt/heffte/2.4.1-cuda/lib64/cmake/Heffte")
13+
# Option to use version without GPU-aware MPI for testing
14+
# Set environment variable: export HEFFTE_NO_GPU_AWARE=1
15+
if(DEFINED ENV{HEFFTE_NO_GPU_AWARE} AND $ENV{HEFFTE_NO_GPU_AWARE} EQUAL 1)
16+
set(Heffte_DIR "$ENV{HOME}/opt/heffte/2.4.1-cuda-no-gpuaware/lib64/cmake/Heffte")
17+
message(STATUS "Using HeFFTe WITHOUT GPU-aware MPI (for testing)")
18+
else()
19+
set(Heffte_DIR "$ENV{HOME}/opt/heffte/2.4.1-cuda/lib64/cmake/Heffte")
20+
message(STATUS "Using HeFFTe WITH GPU-aware MPI (default)")
21+
endif()
1522
find_package(Heffte 2.4.1 REQUIRED PATHS ${Heffte_DIR})
1623

1724
# Print information about found HeFFTe

research/multigpu_heffte/README.md

Lines changed: 69 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,10 @@ mpirun -np 2 ./cuda_fft_example
105105

106106
**Note:**
107107
- **Single GPU (1 MPI rank) works fine** - verified locally ✓
108-
- **Multi-GPU (2+ MPI ranks) fails** - requires GPU-aware MPI support which is not available in the current OpenMPI installation
108+
- **Multi-GPU (2+ MPI ranks) works** - verified on cluster with GPU-aware MPI disabled ✓
109+
- Without GPU-aware MPI: HeFFTe automatically transfers data to CPU for MPI communication, then back to GPU
110+
- This works but is slower than GPU-aware MPI
111+
- To use this mode, rebuild with `HEFFTE_NO_GPU_AWARE=1` (see instructions below)
109112

110113
### Implementation Details
111114

@@ -228,8 +231,72 @@ grep -i "GPU_AWARE" ~/dev/heffte/build_2.4.1_cuda/configured/summary.txt
228231
- This causes segmentation faults in multi-GPU scenarios when HeFFTe tries to use GPU-aware MPI for inter-GPU communication
229232
- Single GPU operations work fine, but multi-GPU requires GPU-aware MPI support
230233

234+
### Testing Multi-GPU Without GPU-Aware MPI
235+
236+
**Status: ✓ CONFIRMED - Multi-GPU works without GPU-aware MPI!**
237+
238+
We successfully tested HeFFTe with 2 GPUs on the cluster. Without GPU-aware MPI, HeFFTe automatically transfers data to CPU for MPI communication, then back to GPU. This works but is slower than GPU-aware MPI.
239+
240+
**To test multi-GPU (already done, but here's how to repeat):**
241+
242+
1. **Rebuild HeFFTe without GPU-aware MPI** (if not already done):
243+
```bash
244+
cd ~/dev/heffte
245+
mkdir -p build_2.4.1_cuda_no_gpuaware
246+
cd build_2.4.1_cuda_no_gpuaware
247+
module load cuda
248+
cmake .. \
249+
-DCMAKE_INSTALL_PREFIX=~/opt/heffte/2.4.1-cuda-no-gpuaware \
250+
-DHeffte_ENABLE_FFTW=ON \
251+
-DHeffte_ENABLE_CUDA=ON \
252+
-DHeffte_ENABLE_GPU_AWARE_MPI=OFF \
253+
-DCMAKE_BUILD_TYPE=Release
254+
make -j$(nproc)
255+
make install
256+
```
257+
258+
2. **Rebuild the example with the non-GPU-aware version:**
259+
```bash
260+
cd research/multigpu_heffte/build
261+
export HEFFTE_NO_GPU_AWARE=1
262+
cmake ..
263+
make
264+
```
265+
266+
3. **Submit job for 2 GPU test:**
267+
```bash
268+
cd research/multigpu_heffte
269+
sbatch run_2gpu_test.sh
270+
```
271+
272+
**Test Results:**
273+
- ✓ Job completed successfully (ExitCode: 0:0)
274+
- ✓ 2 MPI ranks running on 2 different GPUs (GPU 0 and GPU 1)
275+
- ✓ Domain decomposition working: Rank 0 handles z=[0,31], Rank 1 handles z=[32,63]
276+
- ✓ Forward FFT completed
277+
- ✓ Laplacian operator applied in Fourier domain
278+
- ✓ No segmentation faults
279+
280+
**Performance Note:** Without GPU-aware MPI, data is transferred CPU↔GPU for MPI communication, which adds overhead. For production use, GPU-aware MPI would be preferred for better performance.
281+
282+
### Checking GPU-Aware MPI Status
283+
284+
**Check OpenMPI GPU-aware support:**
285+
```bash
286+
ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
287+
```
288+
289+
**Check HeFFTe GPU-aware MPI configuration:**
290+
```bash
291+
# Current build (with GPU-aware MPI)
292+
grep -i "GPU_AWARE" ~/dev/heffte/build_2.4.1_cuda/CMakeCache.txt
293+
294+
# Test build (without GPU-aware MPI)
295+
grep -i "GPU_AWARE" ~/dev/heffte/build_2.4.1_cuda_no_gpuaware/CMakeCache.txt
296+
```
297+
231298
### Next Steps
232299
233300
- Create a more complex demo that uses multiple GPU cards
234301
- Reference examples from `~/dev/heffte/examples/` to understand how to implement multi-GPU usage in practice
235-
- Verify GPU-aware MPI support before attempting multi-GPU runs
302+
- Test multi-GPU performance with and without GPU-aware MPI

research/multigpu_heffte/cuda_fft_example.cpp

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,8 +90,27 @@ int main(int argc, char **argv) {
9090
if (heffte::gpu::device_count() > 0) {
9191
int device_id = my_rank % heffte::gpu::device_count();
9292
heffte::gpu::device_set(device_id);
93+
94+
// Verify we're on the correct device and get device properties
95+
int current_device;
96+
cudaGetDevice(&current_device);
97+
cudaDeviceProp prop;
98+
cudaGetDeviceProperties(&prop, current_device);
99+
100+
// Format UUID
101+
char uuid_str[64];
102+
snprintf(uuid_str, sizeof(uuid_str),
103+
"%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x-%02x%02x%02x%02x%02x%02x",
104+
prop.uuid.bytes[0], prop.uuid.bytes[1], prop.uuid.bytes[2],
105+
prop.uuid.bytes[3], prop.uuid.bytes[4], prop.uuid.bytes[5],
106+
prop.uuid.bytes[6], prop.uuid.bytes[7], prop.uuid.bytes[8],
107+
prop.uuid.bytes[9], prop.uuid.bytes[10], prop.uuid.bytes[11],
108+
prop.uuid.bytes[12], prop.uuid.bytes[13], prop.uuid.bytes[14],
109+
prop.uuid.bytes[15]);
110+
93111
std::cout << "Rank " << my_rank << " using GPU device " << device_id
94-
<< std::endl;
112+
<< " (verified: " << current_device << ")"
113+
<< " - " << prop.name << " [UUID: " << uuid_str << "]" << std::endl;
95114
} else {
96115
if (my_rank == 0) {
97116
std::cerr << "ERROR: No CUDA devices found!" << std::endl;
@@ -153,9 +172,27 @@ int main(int argc, char **argv) {
153172
}
154173

155174
// Transfer input to GPU
175+
// Verify GPU memory allocation on correct device
176+
int device_before;
177+
cudaGetDevice(&device_before);
178+
size_t free_mem_before, total_mem_before;
179+
cudaMemGetInfo(&free_mem_before, &total_mem_before);
180+
156181
heffte::gpu::vector<std::complex<double>> gpu_input =
157182
heffte::gpu::transfer().load(input);
158183

184+
// Verify memory was allocated on correct device
185+
int device_after;
186+
cudaGetDevice(&device_after);
187+
size_t free_mem_after, total_mem_after;
188+
cudaMemGetInfo(&free_mem_after, &total_mem_after);
189+
size_t mem_used = free_mem_before - free_mem_after;
190+
191+
std::cout << "Rank " << my_rank << " GPU " << device_after << " memory: used "
192+
<< mem_used / (1024 * 1024) << " MB"
193+
<< " (free: " << free_mem_after / (1024 * 1024) << " MB / "
194+
<< total_mem_after / (1024 * 1024) << " MB total)" << std::endl;
195+
159196
// Allocate GPU memory for FFT output
160197
heffte::gpu::vector<std::complex<double>> gpu_output(fft.size_outbox());
161198

0 commit comments

Comments
 (0)