Question and suggestions about custom allreduce #2918
Unanswered
leizhao1234
asked this question in
Q&A
Replies: 3 comments 5 replies
-
The block 0 is always the first block started, and each rank's data is allready in global memory(cudamemcpy). |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
I think when copy_mode is true, it still need block barrier without fence but if copy_mode is false multi_gpu_barrier without a memory fence is suffice |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
What is the meaning of copy_mode? |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
In trtllm's twoshot allreduce, there is one multi_gpu_barrier and one block_barrier, whereas in sgLang's twoshot allreduce, there appear to be two block_barriers.
I noticed that the first block barrier does not use a memory fence, similar to vllm. However, does the first barrier need to be at the block level? I believe a multi_gpu_barrier without a memory fence would suffice.. Here is my code:
`inline device void multi_gpu_barrier(uint32_t** signals, uint32_t const flag, size_t const local_rank,
size_t const world_size, int const tidx, int const bidx)
{
// After this function, at least one block in each GPU has reached the barrier
if (tidx < world_size)
{
// we can think of signals having the shape [world_size, world_size]
// Dimension 0 is the "listening" dimension, dimension 1 is "emitting" dimension
}
inline device void block_barrier(uint32_t** signals, uint32_t const flag, size_t const local_rank,
size_t const world_size, int const tidx, int const bidx, int const grid_size)
{
__syncthreads();
// After this function, the block of id == bidx of each GPU has reached the barrier
if (tidx < world_size)
{
// we can think of signals having the shape [world_size, 2, num_blocks, world_size]
// (+ an offset on dim 2 to account for flags used in multi_gpu_barrier)
// Dimension 0 is the "listening" dimension, dimension 3 is "emitting" dimension
}`
I think oneshot only requires one multi_gpu_barrier, while twoshot requires both one multi_gpu_barrier and one block_barrier.
Beta Was this translation helpful? Give feedback.
All reactions