Skip to content
Merged
Show file tree
Hide file tree
Changes from 175 commits
Commits
Show all changes
177 commits
Select commit Hold shift + click to select a range
5f42426
inital creation of test files
Empyreus Mar 16, 2026
431234f
inital pipeline test
Empyreus Mar 17, 2026
371dfb3
fix pip
Empyreus Mar 17, 2026
8686d81
testing
Empyreus Mar 17, 2026
ffa120f
rework template
Empyreus Mar 17, 2026
343c367
fix sudo
Empyreus Mar 18, 2026
4742dfe
fix sudo issue
Empyreus Mar 18, 2026
b7ede93
move from apt-get to pip
Empyreus Mar 18, 2026
0809265
install pip systemwide
Empyreus Mar 18, 2026
c38c351
attempting to gix az cli
Empyreus Mar 18, 2026
b7adec0
create sglang docker image
Empyreus Mar 19, 2026
fa24653
update docker image
Empyreus Mar 25, 2026
ee771ec
fix dockerfile
Empyreus Mar 25, 2026
dd3c3ed
change image
Empyreus Mar 25, 2026
c883994
add git clone msccl
Empyreus Mar 26, 2026
d33458d
moidfy to setup sglang
Empyreus Mar 26, 2026
12d935f
fix docker name
Empyreus Mar 26, 2026
4cf9cd7
change cuda version
Empyreus Mar 26, 2026
8b00789
check hostname
Empyreus Mar 26, 2026
257735c
setup sglang python venv
Empyreus Mar 26, 2026
7f016cb
fix venv
Empyreus Mar 26, 2026
57eedc9
Merge branch 'main' into rjsouza/sglang-tests
Empyreus Mar 26, 2026
01b7af9
sanity check
Empyreus Mar 26, 2026
99f2fac
fixes
Empyreus Mar 26, 2026
c541f27
update template
Empyreus Mar 26, 2026
8d64263
update template
Empyreus Mar 26, 2026
e423ca8
rename files
Empyreus Mar 26, 2026
0a6d329
add sshke
Empyreus Mar 26, 2026
fa30289
update for new remote run
Empyreus Mar 26, 2026
3514899
fix cmake
Empyreus Mar 27, 2026
4107fa9
fix run remote
Empyreus Mar 27, 2026
324254d
finish adding sglang steps
Empyreus Mar 27, 2026
38552a6
fix remote run and clean up files
Empyreus Mar 27, 2026
f171663
fix batch size
Empyreus Mar 27, 2026
a9d7bd8
fix
Empyreus Mar 27, 2026
6ac12fa
comment out to fix pipeline
Empyreus Mar 27, 2026
83d9301
full run
Empyreus Mar 27, 2026
a22104c
add remaining tests
Empyreus Mar 30, 2026
f938f60
update sglang-test
Empyreus Mar 30, 2026
48a6a2e
add sglang all_reduce
Empyreus Mar 31, 2026
80194b2
fix directory
Empyreus Mar 31, 2026
0c8f4fd
find directory
Empyreus Mar 31, 2026
49aeea0
fix container deletion
Empyreus Mar 31, 2026
36c496d
readd tests
Empyreus Mar 31, 2026
503647c
fix missing quote
Empyreus Mar 31, 2026
f5159b0
check if pipeline needs creation
Empyreus Apr 1, 2026
d7b0dd6
trying to rework image pull
Empyreus Apr 1, 2026
131f128
comment out old docker pull
Empyreus Apr 1, 2026
4f37637
fixes
Empyreus Apr 1, 2026
2080baa
try new removal
Empyreus Apr 1, 2026
ea8e6af
fix missing container
Empyreus Apr 1, 2026
3196758
remove container
Empyreus Apr 1, 2026
d6dd64f
fix copy
Empyreus Apr 1, 2026
e4244c4
fix clone
Empyreus Apr 1, 2026
7948682
fix
Empyreus Apr 1, 2026
566cf93
remove mscl build
Empyreus Apr 2, 2026
8e4ddc1
remove msccl tests
Empyreus Apr 2, 2026
827ca09
fully remove msccl isntall step
Empyreus Apr 2, 2026
7fddcaf
reset msccl
Empyreus Apr 2, 2026
dba5841
comment out tests
Empyreus Apr 2, 2026
faa600c
Retry
Empyreus Apr 2, 2026
32e2b64
try new msccl
Empyreus Apr 2, 2026
6fd8b18
change cmake version
Empyreus Apr 2, 2026
61e0540
update for new cmake
Empyreus Apr 2, 2026
376a6a2
rmobe build
Empyreus Apr 2, 2026
149be8e
fix -
Empyreus Apr 2, 2026
10648a4
add --priveldged
Empyreus Apr 2, 2026
7b03ece
add prints
Empyreus Apr 3, 2026
53d6f76
simplify container
Empyreus Apr 3, 2026
e8266a1
running on a100
Empyreus Apr 3, 2026
e68125f
change to h100 machine
Empyreus Apr 3, 2026
58c5234
ignore version mismatch
Empyreus Apr 3, 2026
68cf67d
unit test
Empyreus Apr 6, 2026
ea97444
change path
Empyreus Apr 6, 2026
88e1ac7
fix paths
Empyreus Apr 6, 2026
8fb7514
add multi node
Empyreus Apr 7, 2026
0bf5998
try multi-pipeline
Empyreus Apr 7, 2026
d8e1de7
fix formatting
Empyreus Apr 7, 2026
1fbcbfd
fix formatting
Empyreus Apr 7, 2026
a1bc727
fix deploy step
Empyreus Apr 7, 2026
512416e
add resourcegroup
Empyreus Apr 7, 2026
4f677b6
host entries
Empyreus Apr 7, 2026
1ad0f1c
hostentries
Empyreus Apr 7, 2026
e1687e8
update to h100 multinode
Empyreus Apr 8, 2026
14f75d8
fix pool
Empyreus Apr 8, 2026
e091f65
Merge branch 'main' into rjsouza/sglang-tests
Empyreus May 4, 2026
a8b9599
Inital new test
Empyreus May 4, 2026
cb430b3
clean up deploy
Empyreus May 4, 2026
97a4b1a
remove duplicate stop
Empyreus May 4, 2026
de244e5
update sglang bench
Empyreus May 4, 2026
eaa611f
split multi node test
Empyreus May 4, 2026
dfdc9f7
update pool
Empyreus May 4, 2026
21197f7
change directory
Empyreus May 4, 2026
f6637cc
attempt to print nvidia-smi for cuda drivers
Empyreus May 4, 2026
812d43d
return failed result for new test
Empyreus May 4, 2026
3f86480
add container name to deploy
Empyreus May 4, 2026
1f3f81b
make fixes
Empyreus May 4, 2026
3e655e5
remove cd
Empyreus May 4, 2026
7ca6343
trying without python
Empyreus May 4, 2026
77cb367
fix eof
Empyreus May 5, 2026
f61407e
try to install everything not just python
Empyreus May 5, 2026
22a0953
fix vmss name
Empyreus May 5, 2026
1fe41e0
sglang
Empyreus May 5, 2026
47494ea
check sglang versions
Empyreus May 5, 2026
783c73b
try to resolve single
Empyreus May 5, 2026
1ca7b65
host file
Empyreus May 5, 2026
3b96b5a
disable flashinfer version
Empyreus May 6, 2026
97dda7b
clean up for PR
Empyreus May 11, 2026
e28ef44
small clean up for multi
Empyreus May 11, 2026
c8c55ea
final commit for testing
Empyreus May 11, 2026
a44613e
revert to original pipeline files
Empyreus May 11, 2026
64d7913
Merge branch 'main' into rjsouza/sglang-tests
Empyreus May 11, 2026
445d32d
update multinode to run more batches
Empyreus May 12, 2026
678ba8d
combine single node to one benchmark
Empyreus May 12, 2026
7ae184f
fix parameters
Empyreus May 12, 2026
64beee4
test removal of manual mscclpp install
Empyreus May 12, 2026
a81ba07
readd mscclpp build
Empyreus May 12, 2026
dd5f175
fix world size
Empyreus May 12, 2026
57bf8cc
remove bench
Empyreus May 12, 2026
00e5d89
improve container name handling in deploy.sh
Empyreus May 12, 2026
40d5b27
Potential fix for pull request finding
Empyreus May 12, 2026
c910ec6
fix python env
Empyreus May 12, 2026
b58b047
Merge branch 'rjsouza/sglang-tests' of https://github.com/microsoft/m…
Empyreus May 12, 2026
cadfc2f
remove unneeded install
Empyreus May 12, 2026
405619b
fix some issues with single node all reduce
Empyreus May 13, 2026
596635f
removed uneeded single node values
Empyreus May 13, 2026
6195b31
remove redundent mscclpp builds
Empyreus May 13, 2026
9064ca0
Merge branch 'main' into rjsouza/sglang-tests
Empyreus May 13, 2026
9db1733
remove unneeded file
Empyreus May 13, 2026
b1bc5fa
update sglang branch
Empyreus May 14, 2026
27ed16c
rename test now that it runs all batch sizes
Empyreus May 14, 2026
e4c4d15
update bench one batch comand
Empyreus May 14, 2026
c3c1e1a
revert unneeded change
Empyreus May 14, 2026
d5cb5e3
update docker run to remove container
Empyreus May 14, 2026
c52f310
update to variable names for docker run
Empyreus May 14, 2026
011f9e4
test breaking change
Empyreus May 14, 2026
ebf5db2
reconfigure container
Empyreus May 14, 2026
157e438
readd basic container image
Empyreus May 14, 2026
c1227d0
fix
Empyreus May 14, 2026
5af3867
clean up comment
Empyreus May 14, 2026
50974e5
fix typo
Empyreus May 14, 2026
5cc72f8
Merge branch 'rjsouza/sglang-tests' of https://github.com/microsoft/m…
Empyreus May 14, 2026
d76b142
check dlpack fix
Empyreus May 14, 2026
2b0aab3
cleanup old image
Empyreus May 14, 2026
d3df69a
revert changes
Empyreus May 14, 2026
c1361c2
Fix spelling
Empyreus May 14, 2026
311274c
Merge branch 'main' into rjsouza/sglang-tests
Empyreus May 15, 2026
dafc2d1
custom image no longer needed
Empyreus May 15, 2026
0ea1103
Check if pip flag needed
Empyreus May 15, 2026
ca2289a
remove bench one batch from single node
Empyreus May 15, 2026
ba83f3f
add all reduce to multi-node
Empyreus May 15, 2026
aef8137
fixing missing containerImage
Empyreus May 15, 2026
d652a1f
temp to delete old image
Empyreus May 15, 2026
02dec16
remove temp fix
Empyreus May 15, 2026
fe09089
readd bench one batch
Empyreus May 15, 2026
2a44522
remove unneeded comments
Empyreus May 15, 2026
be64cca
fix docker image
Empyreus May 18, 2026
7915f46
remove test
Empyreus May 18, 2026
51d1aa0
remove install cmake
Empyreus May 18, 2026
20a8936
fxi naming
Empyreus May 18, 2026
fba0416
remove sglang from docker build
Empyreus May 18, 2026
8a5adc6
readd missing vmss
Empyreus May 18, 2026
afb0c43
revert docker image changes
Empyreus May 19, 2026
0ec5218
update sglang image
Empyreus May 19, 2026
3b881c0
change sglang install multi-node
Empyreus May 19, 2026
45dc77e
specifiy cuda runtime
Empyreus May 20, 2026
136f764
Merge branch 'main' into rjsouza/sglang-tests
Empyreus May 20, 2026
de13566
update sglang docker image
Empyreus May 20, 2026
377dc56
attempt to use specific branch
Empyreus May 22, 2026
0fc5452
update base image
Empyreus May 22, 2026
5818410
revet branch change
Empyreus May 22, 2026
5879bc0
revert container image version
Empyreus May 22, 2026
7eb0f34
Merge branch 'main' into rjsouza/sglang-tests
Empyreus May 26, 2026
2dfae6b
add defaults to deployArgs
Empyreus May 26, 2026
c06e86a
Merge branch 'main' into rjsouza/sglang-tests
Empyreus May 26, 2026
dad9ceb
remove added sudo update
Empyreus May 27, 2026
150b035
remove comments
Empyreus May 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions .azure-pipelines/integration-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@ pr:
drafts: false
paths:
exclude:
- .devcontainer/**
- .github/**
- docker/**
- docs/**
- '**/*.md'
- .devcontainer/**
- .github/**
- docker/**
- docs/**
- '**/*.md'

jobs:
- job: IntegrationTestA100
Expand All @@ -43,9 +43,9 @@ jobs:
steps:
- template: templates/integration-test.yml
parameters:
subscription: mscclpp-ci
vmssName: mscclpp-ci
gpuArch: '80'
subscription: mscclpp-ci
vmssName: mscclpp-ci
gpuArch: '80'

- job: IntegrationTestH100
displayName: Integration test H100
Expand All @@ -62,7 +62,7 @@ jobs:
steps:
- template: templates/integration-test.yml
parameters:
subscription: mscclpp-ci-h100
vmssName: mscclpp-h100-ci
subscription: mscclpp-ci-h100
vmssName: mscclpp-h100-ci
perfBaselineFile: test/deploy/perf_ndmv5.jsonl
gpuArch: '90'
gpuArch: '90'
11 changes: 5 additions & 6 deletions .azure-pipelines/multi-nodes-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ trigger:
# Do not run multi-nodes-test for PR, we can trigger it manually
pr: none


parameters:
- name: vmssName
type: string
Expand Down Expand Up @@ -79,10 +78,10 @@ jobs:

- template: templates/deploy.yml
parameters:
subscription: mscclpp-ci-h100
vmssName: ${{ parameters.vmssName }}
subscription: mscclpp-ci-h100
vmssName: ${{ parameters.vmssName }}
resourceGroup: mscclpp
gpuArch: '90'
gpuArch: '90'

- template: templates/run-remote-task.yml
parameters:
Expand Down Expand Up @@ -119,6 +118,6 @@ jobs:

- template: templates/stop.yml
parameters:
subscription: mscclpp-ci-h100
vmssName: ${{ parameters.vmssName }}
subscription: mscclpp-ci-h100
vmssName: ${{ parameters.vmssName }}
resourceGroup: mscclpp
141 changes: 141 additions & 0 deletions .azure-pipelines/sglang-multi-node-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# =============================================================================
# Multi-node SGLang integration test pipeline.
#
# This pipeline runs MSCCL++ SGLang tests across two H100 VMSS GPU nodes.
# High-level flow:
# 1. The pipeline agent runs inside a container on the `mscclpp-multi-node`
# pool. The agent itself has no GPUs.
# 2. SSH/host configuration is generated so the agent can reach the two
# pre-provisioned VMSS GPU nodes.
# 3. `templates/deploy.yml` builds and ships MSCCL++ to the GPU nodes.
# 4. `templates/sglang-multi-test.yml` runs the SGLang multi-node tests.
# 5. `templates/stop.yml` tears down / stops the VMSS nodes.
#
# Docs / non-code changes are excluded from triggering this pipeline.
# =============================================================================

trigger:
branches:
include:
- main
- release/*
paths:
exclude:
- .devcontainer/**
- .github/**
- docker/**
- docs/**
- '**/*.md'

pr:
branches:
include:
- main
- release/*
drafts: false
paths:
exclude:
- .devcontainer/**
- .github/**
- docker/**
- docs/**
- '**/*.md'

parameters:
# Name of the pre-provisioned Azure VMSS that hosts the GPU test nodes.
# Node hostnames are derived as "${vmssName}000000" and "${vmssName}000001".
- name: vmssName
type: string
default: mscclpp-h100-multinode-ci
# Static /etc/hosts entries mapping VMSS node hostnames to their private IPs.
# These IPs are tied to the specific VMSS above; update both together if the
# VMSS is reprovisioned or renamed.
- name: hostEntries
type: string
default: |
10.0.0.5 mscclpp-h100-multinode-ci000000
10.0.0.4 mscclpp-h100-multinode-ci000001
Comment thread
caiomcbr marked this conversation as resolved.
# Docker image used for the SGLang test container on the GPU nodes.
- name: sglangImage
type: string
default: lmsysorg/sglang:latest-cu129

jobs:
- job: SGLangTestMultiNode
displayName: SGLang Test Multi Node
strategy:
matrix:
cuda12:
containerImage: ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.9
pool:
name: mscclpp-multi-node
container:
image: $(containerImage)

steps:
# Ensure the VMSS node hostnames resolve from the pipeline agent container.
# Idempotent: only appends lines that are not already present in /etc/hosts.
- task: Bash@3
displayName: Add HostEntry
inputs:
targetType: 'inline'
script: |
while IFS= read -r line; do
[ -z "$line" ] && continue
if ! grep -qxF "$line" /etc/hosts; then
echo "Adding to /etc/hosts: $line"
echo "$line" | sudo tee -a /etc/hosts
else
echo "Entry already exists: $line"
fi
done <<< "${{ parameters.hostEntries }}"

# Generate the SSH config and hostfile consumed by the deploy / test
# templates below:
# - config : SSH client config (custom port + key) for each node
# - hostfile : user@host list used by deploy / test scripts (parallel-ssh)
- task: Bash@3
displayName: Generate deploy files
inputs:
targetType: 'inline'
script: |
set -e
VMSS="${{ parameters.vmssName }}"
DEPLOY_DIR="$(System.DefaultWorkingDirectory)/test/deploy"
NODE0="${VMSS}000000"
NODE1="${VMSS}000001"

echo "Host ${NODE0}
Port 22345
IdentityFile /root/mscclpp/sshkey
StrictHostKeyChecking no
Host ${NODE1}
Port 22345
IdentityFile /root/mscclpp/sshkey
StrictHostKeyChecking no" > "${DEPLOY_DIR}/config"

printf '%s\n%s\n' "azureuser@${NODE0}" "azureuser@${NODE1}" > "${DEPLOY_DIR}/hostfile"

# Build MSCCL++ and deploy it onto the VMSS GPU nodes.
- template: templates/deploy.yml
parameters:
subscription: mscclpp-ci-h100
vmssName: ${{ parameters.vmssName }}
resourceGroup: mscclpp
gpuArch: '90'
deployArgs: 'multi-node-test true cuda'
containerName: 'sglang-mscclpp-test'
sglangImage: ${{ parameters.sglangImage }}

# Run the SGLang multi-node tests across the two GPU nodes.
- template: templates/sglang-multi-test.yml
parameters:
subscription: mscclpp-ci-h100
vmssName: ${{ parameters.vmssName }}

# Stop/deallocate the VMSS GPU nodes to release resources.
- template: templates/stop.yml
parameters:
subscription: mscclpp-ci-h100
vmssName: ${{ parameters.vmssName }}
resourceGroup: mscclpp
63 changes: 63 additions & 0 deletions .azure-pipelines/sglang-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# =============================================================================
# Single-node SGLang integration test pipeline.
#
# Runs MSCCL++ SGLang tests on a single H100 GPU node from the `msccl-ci-h100`
# pool. All deploy / run / teardown logic is delegated to
# `templates/sglang-test.yml`.
#
# Docs / non-code changes are excluded from triggering this pipeline.
# =============================================================================

trigger:
branches:
include:
- main
- release/*
paths:
exclude:
- .devcontainer/**
- .github/**
- docker/**
- docs/**
- '**/*.md'

pr:
branches:
include:
- main
- release/*
drafts: false
paths:
exclude:
- .devcontainer/**
- .github/**
- docker/**
- docs/**
- '**/*.md'

parameters:
# Docker image used for the SGLang test container on the GPU node.
- name: sglangImage
type: string
default: lmsysorg/sglang:latest-cu129

jobs:
- job: SGLangTest
displayName: SGLang Test
strategy:
matrix:
cuda12:
containerImage: ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.9
pool:
name: msccl-ci-h100
container:
image: $(containerImage)

steps:
# Deploy MSCCL++ to the GPU node and run the SGLang single-node tests.
- template: templates/sglang-test.yml
parameters:
subscription: mscclpp-ci-h100
vmssName: mscclpp-h100-ci
gpuArch: '90'
sglangImage: ${{ parameters.sglangImage }}
9 changes: 8 additions & 1 deletion .azure-pipelines/templates/deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,12 @@ parameters:
- name: deployArgs
type: string
default: ''
- name: containerName
type: string
default: 'mscclpp-test'
- name: sglangImage
type: string
default: ''

steps:
# 0. Ensure Azure CLI exists before running AzureCLI@2 tasks.
Expand All @@ -56,6 +62,7 @@ steps:
targetType: 'inline'
script: |
set -e
sudo apt-get update -y
Comment thread
Empyreus marked this conversation as resolved.
Outdated
rm -rf build
mkdir -p build && cd build
BUILD_TESTS_ARG=""
Expand Down Expand Up @@ -147,5 +154,5 @@ steps:
inputs:
targetType: filePath
filePath: test/deploy/deploy.sh
arguments: ${{ parameters.deployArgs }}
arguments: ${{ parameters.deployArgs }} ${{ parameters.containerName }} ${{ parameters.sglangImage }}
workingDirectory: '$(System.DefaultWorkingDirectory)'
2 changes: 1 addition & 1 deletion .azure-pipelines/templates/integration-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ steps:
subscription: ${{ parameters.subscription }}
vmssName: ${{ parameters.vmssName }}
gpuArch: ${{ parameters.gpuArch }}
deployArgs: 'single-node-test'
deployArgs: 'single-node-test true cuda'

- template: run-remote-task.yml
parameters:
Expand Down
2 changes: 1 addition & 1 deletion .azure-pipelines/templates/nccl-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ steps:
subscription: ${{ parameters.subscription }}
vmssName: ${{ parameters.vmssName }}
gpuArch: ${{ parameters.gpuArch }}
deployArgs: 'nccltest-single-node'
deployArgs: 'nccltest-single-node true cuda'

- template: run-remote-task.yml
parameters:
Expand Down
Loading
Loading