-
Notifications
You must be signed in to change notification settings - Fork 100
GB200 support: SendRecv DSL collective and per-channel executor connections #810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Binyang2014
wants to merge
151
commits into
main
Choose a base branch
from
binyli/GB200
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 146 commits
Commits
Show all changes
151 commits
Select commit
Hold shift + click to select a range
e711b62
Initial plan
Copilot c881bc5
Replace gtest/gtest.h with framework.hpp in all unit tests
Copilot e227fdc
Convert mp_unit tests from gtest to framework.hpp
Copilot 1e32e17
Address code review comments
Copilot eafa6fb
Add custom test framework and code coverage support
Copilot 3d8a2e7
Add --gtest_filter support to framework
Copilot a10aff5
Address code review feedback
Copilot 1818709
Fix CodeQL workflow by disabling test builds
Copilot 5657e4a
Initial plan for fixing test build with GPU bypass
Copilot 0eae34c
Fix test framework for building with Docker
Copilot 4823583
Move FailHelper and SkipHelper into mscclpp::test namespace
Copilot 403b2fb
Remove unnecessary CMake build artifacts from PR
Copilot 305d157
Remove PerfTestResult and reuse TestResult directly
Copilot b1f458e
Convert test framework identifiers from snake_case to camelCase
Copilot 6da12fa
Comprehensive plan for refactoring
Copilot 7e4365f
Add performance test filtering and remove HTML coverage
Copilot b59196b
Integrate perf tests into unit_tests and add CI coverage step
Copilot ba0451a
Remove build2 CMake artifacts from repository
Copilot 50f6a24
Remove test/perf/ directory completely
Copilot e26f8ab
Address PR review comments
Copilot 7003fec
Simplify filter matching to use substring matching
Copilot 30b9891
simplifying
chhwang b6ce0f2
simplify
chhwang d2efc2f
coverage update
chhwang 4afbf78
minor
chhwang e40c72b
license text update
chhwang bed85b5
codecov upload
chhwang 4d9acea
badge
chhwang b693d1b
lint issue
chhwang 2b4adcc
fix lint
chhwang b64536f
Merge branch 'main' into copilot/remove-gtest-use-custom-framework
chhwang dcdd3fe
update UT CI
chhwang caeec75
updates
chhwang b9609f8
add coverage flags
chhwang 41695ba
Merge branch 'main' into copilot/remove-gtest-use-custom-framework
chhwang febdbf9
WIP; need amd fix
chhwang c4afbe1
Merge branch 'main' into copilot/remove-gtest-use-custom-framework
chhwang 04ebd9b
fix coverage file path
chhwang 54e46ba
rocm fix wip
chhwang 6c2bc8f
coverage fix
chhwang d0c709e
Fix Codecov token usage in coverage upload step
chhwang edda25d
Merge branch 'main' into copilot/remove-gtest-use-custom-framework
chhwang 2f02d38
Merge branch 'main' into copilot/remove-gtest-use-custom-framework
chhwang 2adf4a4
use variable group
chhwang 98b023a
rocm fixes
chhwang 22e5efb
gdrcopy install in container
chhwang 2f27d7d
Update coverage report to exclude additional directories in lcov command
chhwang d88ee8d
Refine coverage report to include only mscclpp source and include dir…
chhwang 11e27e2
Update coverage report commands to handle errors and adjust paths
chhwang 25f31b4
updates
chhwang 75dfdd9
Merge branch 'main' into chhwang/fix-ib-no-atomic
chhwang ac4d713
updates
chhwang ac022c3
a few updates
chhwang 72407af
License
chhwang 8effd97
License
chhwang fd7358d
License, lint
chhwang 67d1706
optimized recv loop
chhwang 060982d
updates
chhwang 6b2f819
Merge branch 'main' into chhwang/fix-ib-no-atomic
chhwang eb99a26
Merge branch 'main' into copilot/remove-gtest-use-custom-framework
chhwang 8c3a436
update CI
chhwang f4b8574
Merge branch 'main' into copilot/remove-gtest-use-custom-framework
chhwang 3b56b08
data direct
chhwang 448ceb6
updates
chhwang 7ce841b
Updates
chhwang bbb9c10
Update Docker image
chhwang 60ff32c
updates
chhwang 00583da
separate pipeline for codecov
chhwang c699b8a
az pipeline refactoring
chhwang 284d913
Merge branch 'main' into copilot/remove-gtest-use-custom-framework
chhwang 75ac8be
fix
chhwang e0c7ddb
fix
chhwang c40a233
fix
chhwang 375bc13
fix
chhwang bcb392f
updates
chhwang ea1dd65
fix
chhwang d6a6fa2
simplified
chhwang a9cf938
fix
chhwang 6647338
debugging
chhwang 7a87c2c
debugging
chhwang cf505d7
debugging
chhwang 757c0ec
debugging
chhwang e2a5be4
debugging
chhwang 2a705f5
fix merge
chhwang a38bd9d
Merge branch 'main' into copilot/remove-gtest-use-custom-framework
chhwang e2a9692
fix merge
chhwang 2c4bab8
fix
chhwang a937ce4
debugging
chhwang d66d7e4
debugging
chhwang 5a65cc7
debugging
chhwang 2297a3d
updates
chhwang 2756221
update
chhwang bff76d5
Fix TearDown() handling and replace assert() in perf tests
Copilot 6082648
fix for npkit
chhwang 79a0149
updates
chhwang 0200532
Merge branch 'copilot/remove-gtest-use-custom-framework' into chhwang…
chhwang 80f554e
Merge branch 'main' into chhwang/fix-ib-no-atomic
chhwang 67f9933
fix data direct
chhwang d1124fb
revert
chhwang 144046b
revert
chhwang f8e94d9
disable mlx5dv_reg_dmabuf_mr
chhwang 4cf5332
updates
chhwang 848b89b
64-bit token reconstruction
chhwang ff4d825
Merge branch 'main' into chhwang/fix-ib-no-atomic
chhwang 94d0508
prerequisites update
chhwang 553fd3b
lint
chhwang 53099a7
Merge branch 'main' into chhwang/fix-ib-no-atomic
chhwang f62633a
mlx5dv bug fixes & enhanced unit tests perf reporting
chhwang b04fa2d
lint
chhwang a4bb8fb
add debugging code
mahdiehghazim 194a79f
add sendrecv correctness check
mahdiehghazim 49979e5
tune #instances and remoce extra barriers
mahdiehghazim 27fbddb
update the executor so we have message size range
mahdiehghazim d07a1ba
show scale in output
mahdiehghazim a191f16
add scripts
mahdiehghazim b1cc649
re-format output
mahdiehghazim a4118ea
update the number of instances
mahdiehghazim 289f89d
update
Binyang2014 1e6d493
update
Binyang2014 251873c
update
Binyang2014 07d97f6
Unique QP per channel and env-controlled GID index
Binyang2014 8cecfee
debug
Binyang2014 ad56728
fix
Binyang2014 e487f83
debug
Binyang2014 2c3f125
add changes from ib and connection
mahdiehghazim 1a065dd
add help scripts
mahdiehghazim 812f6cf
fix hang on 4 ranks and make send/recv test more like nccl-test
mahdiehghazim 3f2ade2
add barrier
mahdiehghazim 6d8fb00
add extra signal/wait and avoid local flush
mahdiehghazim 96defbd
add executor for testing
mahdiehghazim 68690ec
revert dsl
mahdiehghazim 54c2f50
merge main
f83a557
Add sendrecv support with double-buffer to executor_test
76fdd1d
WIP
57f7be6
WIP
65139d6
WIP
mahdiehghazim 456ef7e
fix
mahdiehghazim 36abcbe
WIP
mahdiehghazim a2a1b89
for 4 nodes
Binyang2014 1fd5ed8
update the script
mahdiehghazim 4a17b64
update
Binyang2014 3a1e2d4
clean
Binyang2014 8a42fe2
revert
Binyang2014 7784407
WIP
Binyang2014 e600520
WIP
Binyang2014 4e09967
Merge branch 'main' into binyli/GB200
Binyang2014 3bd24e1
WIP
Binyang2014 142e794
WIP
Binyang2014 fd27fa0
Simplify executor_test: unify single/double-buffer paths via lists
Binyang2014 4463595
Merge branch 'main' into binyli/GB200
Binyang2014 bde8d45
WIP
Binyang2014 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| # Copyright (c) Microsoft Corporation. | ||
| # Licensed under the MIT License. | ||
|
|
||
| import argparse | ||
| from mscclpp.language.channel import * | ||
| from mscclpp.language.rank import * | ||
| from mscclpp.language.general import * | ||
| from mscclpp.language.program import * | ||
| from mscclpp.language.collectives import * | ||
|
|
||
|
|
||
| def send_recv(name, nnodes, gpus_per_node, split_mask, instances): | ||
| gpu_size = nnodes * gpus_per_node | ||
| collective = SendRecv(gpu_size, 1, False) | ||
| with CollectiveProgram( | ||
| name, | ||
| collective, | ||
| gpu_size, | ||
| protocol="Simple", | ||
| num_threads_per_block=1024, | ||
| use_double_scratch_buffer=False, | ||
| min_message_size=0, | ||
| max_message_size=2**64 - 1, | ||
| instances=instances, | ||
| ): | ||
| # Creating separate port channels for next and prev directions. | ||
| # When prev and next are the same peer (e.g., 2-node ring), both channels go to the same peer | ||
| # and get distinct tags. To ensure cross-rank tag matching (rank A's prev_channel signal | ||
| # arrives at rank B's next_channel wait), we create channels in opposite order for the | ||
| # "higher" rank so that tags cross-match: | ||
| # Lower rank: [next(tag0), prev(tag1)] | ||
| # Higher rank: [prev(tag0), next(tag1)] | ||
| # Then lower.prev(tag1) == higher.next(tag1) and higher.prev(tag0) == lower.next(tag0) | ||
| # When prev != next (3+ nodes), each channel targets a different peer so each gets tag 0 | ||
| # and this ordering doesn't matter. | ||
| group_size = split_mask + 1 | ||
| num_groups = gpu_size // group_size | ||
| next_channels = {} # channel for sending to next rank | ||
| prev_channels = {} # channel for receiving from prev rank | ||
| prev_next_ids = {} | ||
| for node in range(nnodes): | ||
| for gpu in range(gpus_per_node): | ||
| global_rank_id = gpu + gpus_per_node * node | ||
| position_in_group = global_rank_id & split_mask | ||
| group_id = global_rank_id // group_size | ||
| next_group_id = (group_id + 1) % num_groups | ||
| next_global_rank_id = next_group_id * group_size + position_in_group | ||
| prev_group_id = (group_id - 1 + num_groups) % num_groups | ||
| prev_global_rank_id = prev_group_id * group_size + position_in_group | ||
|
Binyang2014 marked this conversation as resolved.
Outdated
|
||
| if prev_global_rank_id == next_global_rank_id and global_rank_id > prev_global_rank_id: | ||
| # Higher rank: create prev first, then next (swapped order) | ||
| prev_channels[global_rank_id] = PortChannel(prev_global_rank_id, global_rank_id) | ||
| next_channels[global_rank_id] = PortChannel(next_global_rank_id, global_rank_id) | ||
| else: | ||
| # Lower rank or different peers: create next first, then prev | ||
| next_channels[global_rank_id] = PortChannel(next_global_rank_id, global_rank_id) | ||
| prev_channels[global_rank_id] = PortChannel(prev_global_rank_id, global_rank_id) | ||
| prev_next_ids[global_rank_id] = (prev_global_rank_id, next_global_rank_id) | ||
|
|
||
| # sync with the next rank and the previous rank in the group | ||
| for node in range(nnodes): | ||
| for gpu in range(gpus_per_node): | ||
| global_rank_id = gpu + gpus_per_node * node | ||
| prev_global_rank_id, next_global_rank_id = prev_next_ids[global_rank_id] | ||
| prev_channels[global_rank_id].signal(tb=0, data_sync=SyncType.none) | ||
| next_channels[global_rank_id].wait(tb=0, data_sync=SyncType.after) | ||
|
|
||
| src_rank = Rank(global_rank_id) | ||
| src_buffer = src_rank.get_input_buffer() | ||
| dst_rank = Rank(next_global_rank_id) | ||
| dst_buffer = dst_rank.get_output_buffer() | ||
|
|
||
| next_channels[global_rank_id].put_with_signal(dst_buffer[:], src_buffer[:], tb=0) | ||
| prev_channels[global_rank_id].wait(tb=0, data_sync=SyncType.none) | ||
|
|
||
| print(JSON()) | ||
|
|
||
|
|
||
| parser = argparse.ArgumentParser() | ||
|
|
||
| parser.add_argument("--name", type=str, help="name of the program") | ||
| parser.add_argument("--nnodes", type=int, default=1, help="number of nodes") | ||
| parser.add_argument("--gpus_per_node", type=int, help="number of gpus per node") | ||
| parser.add_argument("--split_mask", type=lambda x: int(x, 0), default=0x3, help="split mask (e.g. 0x3)") | ||
| parser.add_argument("--instances", type=int, default=4, help="number of instances") | ||
|
|
||
| args = parser.parse_args() | ||
|
|
||
| send_recv(args.name, args.nnodes, args.gpus_per_node, args.split_mask, args.instances) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.