Commit 2ac012e
Add intra_group_size to topology (meta-pytorch#3696)
Summary:
Context
---------
Every GB200 node has 2 B200 GPU attached to it, however allows up to 72 B200 connected via NVlink. The planner needs to know how big the intra topology group size is going to be.
This causes the `local_world_size` to be different from the `intra_group_size`.
Implementation
------------------
- Topology class:
- Adds `pod_size`, and uses that to calculate the `intra_group_size` (maximum number of processes linked with high intra bandwidth) to Topology class. If isn't given, then it defaults to local_world_size.
- `shard_estimators.py`
- The shard estimators now use the `intra_group_size` instead of `local_world_size`, this allows RW/TW/CW to properly account for larger NVlink that comes with the pods.
Reviewed By: isururanawaka
Differential Revision: D916178871 parent 06d0acb commit 2ac012e
File tree
4 files changed
+28
-5
lines changed- torchrec/distributed/planner
4 files changed
+28
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
91 | | - | |
| 91 | + | |
92 | 92 | | |
93 | 93 | | |
94 | 94 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
302 | 302 | | |
303 | 303 | | |
304 | 304 | | |
| 305 | + | |
305 | 306 | | |
306 | 307 | | |
307 | 308 | | |
| |||
652 | 653 | | |
653 | 654 | | |
654 | 655 | | |
655 | | - | |
| 656 | + | |
656 | 657 | | |
657 | 658 | | |
658 | 659 | | |
659 | | - | |
| 660 | + | |
660 | 661 | | |
661 | 662 | | |
662 | 663 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
234 | 234 | | |
235 | 235 | | |
236 | 236 | | |
237 | | - | |
| 237 | + | |
238 | 238 | | |
239 | 239 | | |
240 | 240 | | |
| |||
1146 | 1146 | | |
1147 | 1147 | | |
1148 | 1148 | | |
1149 | | - | |
| 1149 | + | |
1150 | 1150 | | |
1151 | 1151 | | |
1152 | 1152 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
286 | 286 | | |
287 | 287 | | |
288 | 288 | | |
| 289 | + | |
289 | 290 | | |
290 | 291 | | |
291 | 292 | | |
| |||
310 | 311 | | |
311 | 312 | | |
312 | 313 | | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
313 | 318 | | |
314 | 319 | | |
315 | 320 | | |
| |||
343 | 348 | | |
344 | 349 | | |
345 | 350 | | |
| 351 | + | |
346 | 352 | | |
347 | 353 | | |
348 | 354 | | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
349 | 364 | | |
350 | 365 | | |
351 | 366 | | |
| |||
381 | 396 | | |
382 | 397 | | |
383 | 398 | | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
384 | 404 | | |
385 | 405 | | |
386 | 406 | | |
| |||
424 | 444 | | |
425 | 445 | | |
426 | 446 | | |
| 447 | + | |
427 | 448 | | |
428 | 449 | | |
429 | 450 | | |
| |||
449 | 470 | | |
450 | 471 | | |
451 | 472 | | |
| 473 | + | |
452 | 474 | | |
453 | 475 | | |
454 | 476 | | |
| |||
0 commit comments