generated from amazon-archives/__template_MIT-0
-
Notifications
You must be signed in to change notification settings - Fork 176
Pull requests: awslabs/awsome-distributed-training
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
Add NVRx resiliency testing for distributed training on Amazon EKS
#1023
opened Mar 17, 2026 by
aravneelaws
Loading…
Updating hyperpod-elastic-agent (HPEA) to v1.1.2 to support torch v2.6+
#1022
opened Mar 13, 2026 by
aravneelaws
Loading…
7 tasks done
Slinky Slurm on HyperPod EKS — Deployment Automation & Infrastructure Updates
#1020
opened Mar 12, 2026 by
bluecrayon52
Loading…
docs: add Instance Compatibility Guide with per-test-case configuration tables
#1017
opened Mar 11, 2026 by
nkumaraws
Loading…
Add NCCL send/recv ring benchmark for multi-GPU testing
#1013
opened Mar 10, 2026 by
paulogallotti
Loading…
Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS
#1010
opened Mar 9, 2026 by
dmvevents
Loading…
6 tasks
Add optional Training Plan support for HyperPod instance groups
#1004
opened Feb 26, 2026 by
newabdosheham
Loading…
Syntax improvements and code quality enhancements for EFA node exporter
#966
opened Feb 17, 2026 by
KeitaW
Loading…
ProTip!
Find all pull requests that aren't related to any open issues with -linked:issue.