You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<!-- .github/pull_request_template.md -->
## π Description
We currently have unit tests failing as:
```
==========================================
Running: pytest --continue-on-collection-errors -s --junitxml=/junit/tests/comm/test_trtllm_mnnvl_allreduce.py.xml "tests/comm/test_trtllm_mnnvl_allreduce.py"
==========================================
Abort(1090447) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack:
MPIR_Init_thread(192)........:
MPID_Init(1665)..............:
MPIDI_OFI_mpi_init_hook(1586):
(unknown)(): Unknown error class
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1090447
:
system msg for write_line failure : Bad file descriptor
Abort(1090447) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Unknown error class, error stack:
MPIR_Init_thread(192)........:
MPID_Init(1665)..............:
MPIDI_OFI_mpi_init_hook(1586):
...
Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, cuda.bindings._bindings.cydriver, cuda.bindings.cydriver, cuda.bindings.driver, tvm_ffi.core, markupsafe._speedups, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, mpi4py.MPI (total: 22)
!!!!!!! Segfault encountered !!!!!!!
...
β FAILED: tests/comm/test_trtllm_mnnvl_allreduce.py
```
These tests should be skipping in a single GPU environment, but are
failing, which indicates that they are failing at MPI module load time.
The current `dockerfile.cuXXX` installs MPI via `RUN conda install -n
py312 -y mpi4py`. Upon investigating the docker build logs,
[A month ago (Nov.
4)](https://github.com/flashinfer-ai/flashinfer/actions/runs/19084098717/job/54520197904#step:6:802),
```
flashinfer-ai#17 13.68 mpi-1.0.1 | mpich 6 KB conda-forge
flashinfer-ai#17 13.68 mpi4py-4.1.1 |py312hd0af0b3_100 866 KB conda-forge
flashinfer-ai#17 13.68 mpich-4.3.2 | h79b1c89_100 5.4 MB conda-forge
```
was being installed, [but
yesterday](https://github.com/flashinfer-ai/flashinfer/actions/runs/19960576464/job/57239792717#step:6:673):
```
flashinfer-ai#17 13.59 impi_rt-2021.13.1 | ha770c72_769 41.7 MB conda-forge
flashinfer-ai#17 13.59 mpi-1.0 | impi 6 KB conda-forge
flashinfer-ai#17 13.59 mpi4py-4.1.1 |py312h18f78f0_102 864 KB conda-forge
```
is being installed.
The mpich vs. impi are Implementations to the MPI: MPICH vs. Intel MPI.
This is currently the suspected issue underlying the MPI load failures.
Current PR specifies the MPI implementation via `RUN conda install -n
py312 -y mpi4py mpich`. The result of the current PR produces ([build
log](https://github.com/flashinfer-ai/flashinfer/actions/runs/19976372640/job/57293423165?pr=2182#step:6:436)):
```
flashinfer-ai#15 14.63 mpi-1.0.1 | mpich 6 KB conda-forge
flashinfer-ai#15 14.63 mpi4py-4.1.1 |py312hd0af0b3_102 865 KB conda-forge
flashinfer-ai#15 14.63 mpich-4.3.2 | h79b1c89_100 5.4 MB conda-forge
```
which now matches what we had before
<!-- What does this PR do? Briefly describe the changes and why theyβre
needed. -->
## π Related Issues
<!-- Link any related issues here -->
## π Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.
### β Pre-commit Checks
- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.
> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).
## π§ͺ Tests
- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).
## Reviewer Notes
<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->
0 commit comments