Skip to content

Add rapidsmpf to manifest #497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: branch-25.06
Choose a base branch
from

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Apr 30, 2025

No description provided.

@wence- wence- requested a review from a team as a code owner April 30, 2025 15:27
@wence- wence- requested review from AyodeAwe and removed request for a team April 30, 2025 15:27
@wence-
Copy link
Contributor Author

wence- commented Apr 30, 2025

I guess this also needs the bind mounts...

@AyodeAwe
Copy link
Contributor

cc @trxcllnt

@AyodeAwe AyodeAwe requested a review from trxcllnt April 30, 2025 15:52
@trxcllnt
Copy link
Collaborator

@wence- are the rapidsmpf compiler errors expected?

@wence-
Copy link
Contributor Author

wence- commented Apr 30, 2025

I had rapidsai/rapidsmpf#233 in flight which it probably didn't pick up

@wence-
Copy link
Contributor Author

wence- commented May 1, 2025

That is now in, but I don't have the powers to restart the builds

@trxcllnt
Copy link
Collaborator

trxcllnt commented May 1, 2025

@wence- I've re-run CI a few times now, still seems to be failing with this error. Is the pip container missing any dependencies, or using an incompatible gcc version? Should we add a devcontainers job to rapidsmpf CI?

Click to expand cpp/tests/test_shuffler.cpp error
/usr/bin/sccache /usr/bin/g++ -DLIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE -DRMM_LOG_ACTIVE_LEVEL=RAPIDS_LOGGER_LEVEL_INFO -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -DUCXX_ENABLE_RMM -D_LIBCUDACXX_ATOMIC_UNSAFE_AUTOMATIC_STORAGE=1 -I/home/coder/rapidsmpf/cpp/include -I/home/coder/rapidsmpf/cpp/build/pip/cuda-12.8/release/_deps/cccl-src/lib/cmake/thrust/../../../thrust -I/home/coder/rapidsmpf/cpp/build/pip/cuda-12.8/release/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include -I/home/coder/rapidsmpf/cpp/build/pip/cuda-12.8/release/_deps/cccl-src/lib/cmake/cub/../../../cub -I/home/coder/rapidsmpf/cpp/build/pip/cuda-12.8/release/_deps/nvtx3-src/c/include -isystem /home/coder/rmm/cpp/include -isystem /home/coder/rmm/cpp/build/pip/cuda-12.8/release/include -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /home/coder/rmm/cpp/build/pip/cuda-12.8/release/_deps/rapids_logger-src/include -isystem /home/coder/cudf/cpp/build/pip/cuda-12.8/release/_deps/dlpack-src/include -isystem /home/coder/cudf/cpp/build/pip/cuda-12.8/release/_deps/jitify-src -isystem /home/coder/cudf/cpp/include -isystem /home/coder/cudf/cpp/build/pip/cuda-12.8/release/include -isystem /home/coder/ucxx/cpp/include -isystem /home/coder/cudf/cpp -isystem /home/coder/cudf/cpp/src -isystem /home/coder/cudf/cpp/build/pip/cuda-12.8/release/_deps/cuco-src/include -O3 -DNDEBUG -std=gnu++17 -fPIE -Wall -Werror -Wextra -Wsign-conversion -Wno-unknown-pragmas -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wno-sign-conversion -MD -MT tests/CMakeFiles/mpi_tests.dir/test_shuffler.cpp.o -MF tests/CMakeFiles/mpi_tests.dir/test_shuffler.cpp.o.d -o tests/CMakeFiles/mpi_tests.dir/test_shuffler.cpp.o -c /home/coder/rapidsmpf/cpp/tests/test_shuffler.cpp
In file included from /home/coder/rapidsmpf/cpp/include/rapidsmpf/buffer/resource.hpp:16,
                 from /home/coder/rapidsmpf/cpp/tests/test_shuffler.cpp:17:
/home/coder/rapidsmpf/cpp/include/rapidsmpf/buffer/buffer.hpp:58:5: error: deduction guide ‘rapidsmpf::Buffer::overloaded(Ts ...) -> rapidsmpf::Buffer::overloaded<Ts ...>’ must be declared at namespace scope
   58 |     overloaded(Ts...) -> overloaded<Ts...>;
      |     ^~~~~~~~~~
/home/coder/rapidsmpf/cpp/include/rapidsmpf/buffer/buffer.hpp: In member function ‘constexpr rapidsmpf::MemoryType rapidsmpf::Buffer::mem_type() const’:
/home/coder/rapidsmpf/cpp/include/rapidsmpf/buffer/buffer.hpp:120:13: error: class template argument deduction failed:
  120 |             },
      |             ^
/home/coder/rapidsmpf/cpp/include/rapidsmpf/buffer/buffer.hpp:120:13: error: no matching function for call to ‘overloaded(rapidsmpf::Buffer::mem_type() const::<lambda(const HostStorageT&)>, rapidsmpf::Buffer::mem_type() const::<lambda(const DeviceStorageT&)>)’
/home/coder/rapidsmpf/cpp/include/rapidsmpf/buffer/buffer.hpp:53:12: note: candidate: ‘template<class ... Ts> overloaded()-> rapidsmpf::Buffer::overloaded<Ts>’
   53 |     struct overloaded : Ts... {
      |            ^~~~~~~~~~
/home/coder/rapidsmpf/cpp/include/rapidsmpf/buffer/buffer.hpp:53:12: note:   template argument deduction/substitution failed:
/home/coder/rapidsmpf/cpp/include/rapidsmpf/buffer/buffer.hpp:120:13: note:   candidate expects 0 arguments, 2 provided
  120 |             },
      |             ^
/home/coder/rapidsmpf/cpp/include/rapidsmpf/buffer/buffer.hpp:53:12: note: candidate: ‘template<class ... Ts> overloaded(rapidsmpf::Buffer::overloaded<Ts>)-> rapidsmpf::Buffer::overloaded<Ts>’
   53 |     struct overloaded : Ts... {
      |            ^~~~~~~~~~
/home/coder/rapidsmpf/cpp/include/rapidsmpf/buffer/buffer.hpp:53:12: note:   template argument deduction/substitution failed:
/home/coder/rapidsmpf/cpp/include/rapidsmpf/buffer/buffer.hpp:120:13: note:   ‘rapidsmpf::Buffer::mem_type() const::<lambda(const HostStorageT&)>’ is not derived from ‘rapidsmpf::Buffer::overloaded<Ts>’
  120 |             },
      |             ^

@wence-
Copy link
Contributor Author

wence- commented May 2, 2025

I've re-run CI a few times now, still seems to be failing with this error. Is the pip container missing any dependencies, or using an incompatible gcc version? Should we add a devcontainers job to rapidsmpf CI?

Ah let me check, this is probably because the pip containers use g++11 (conda uses g++13).

Yes, would be good to add an individual run for rapidsmpf. I will also submit a fix for this issue.

@wence-
Copy link
Contributor Author

wence- commented May 2, 2025

Yeah, this is a gcc 11 bug (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79501 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100983) which I will work around

@wence-
Copy link
Contributor Author

wence- commented May 2, 2025

rapidsai/rapidsmpf#238

@wence-
Copy link
Contributor Author

wence- commented May 2, 2025

@trxcllnt the gcc 11 WAR is now merged. But as Peter notes, we don't want to build for cuda 11.8. Is there a way in the manifest to conditionally remove rapidsmpf for the cuda 11.8 part of the matrix. Or is it sufficient that the 11.8 devcontainer.json files now don't mention rapidsmpf?

@trxcllnt
Copy link
Collaborator

trxcllnt commented May 2, 2025

@wence- Or is it sufficient that the 11.8 devcontainer.json files now don't mention rapidsmpf?

That just ensures the source isn't mounted from the host... the clone-rapidsmpf script would still clone the repo but it would only exist in that container's overlayfs.

Is there a way in the manifest to conditionally remove rapidsmpf for the cuda 11.8 part of the matrix

There is not, because we haven't encountered this before. We don't test 11.8 in CI, so this would only affect someone using the 11.8 combined devcontainer. That's (probably?) an uncommon use-case, and given that we're about to drop 11.8 across the board, I'd say it's fine to leave it.

If someone is using the combined 11.8 devcontainer and re-runs clone-all, they'll get rapidsmpf. When it fails, they can rm -rf ~/rapidsmpf, reopen the devcontainer, and build-all will work again.

@pentschev
Copy link
Member

That's (probably?) an uncommon use-case, and given that we're about to drop 11.8 across the board, I'd say it's fine to leave it.

For context, that is exactly why we dropped CUDA 11 support in RAPIDSMPF, we were planning to release it only around the time when CUDA 11 would be dropped and thus there was no point in even dealing with it.

@trxcllnt
Copy link
Collaborator

trxcllnt commented May 2, 2025

Something different about the cython-3.0.12 in the pip build vs. the conda build?

@trxcllnt
Copy link
Collaborator

trxcllnt commented May 3, 2025

Looks like some dependencies aren't included when generating requirements.txt. PR 245 fixes it for me locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants