Skip to content

Improve vLLM workflows#5480

Merged
jinyan-li1 merged 9 commits intomainfrom
improve-workflow
Nov 17, 2025
Merged

Improve vLLM workflows#5480
jinyan-li1 merged 9 commits intomainfrom
improve-workflow

Conversation

@jinyan-li1
Copy link
Contributor

@jinyan-li1 jinyan-li1 commented Nov 15, 2025

GitHub Issue #, if available:

Note:

  • If merging this PR should also close the associated Issue, please also add that Issue # to the Linked Issues section on the right.

  • All PR's are checked weekly for staleness. This PR will be closed if not updated in 30 days.

Description

Tests Run

vLLM build and test: d5cb001

rayserve build and test: e73992d

Both tested after removing old workflows and renaming new workflow: ccbfc03

By default, docker image builds and tests are disabled. Two ways to run builds and tests:

  1. Using dlc_developer_config.toml
  2. Using this PR description (currently only supported for PyTorch, TensorFlow, vllm, and base images)
How to use the helper utility for updating dlc_developer_config.toml

Assuming your remote is called origin (you can find out more with git remote -v)...

  • Run default builds and tests for a particular buildspec - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp origin

  • Enable specific tests for a buildspec or set of buildspecs - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -t sanity_tests -cp origin

  • Restore TOML file when ready to merge

python src/prepare_dlc_dev_environment.py -rcp origin

NOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:

  • sagemaker_remote_tests = true
  • sagemaker_efa_tests = true
  • sagemaker_rc_tests = true
  • sagemaker_local_tests = true
How to use PR description Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:
  • # /buildspec <buildspec_path>
    • e.g.: # /buildspec pytorch/training/buildspec.yml
    • If this line is commented out, dlc_developer_config.toml will be used.
  • # /tests <test_list>
    • e.g.: # /tests sanity security ec2
    • If this line is commented out, it will run the default set of tests (same as the defaults in dlc_developer_config.toml): sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local.
# /buildspec <buildspec_path>
# /tests <test_list>

Formatting

PR Checklist

Expand
  • I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron/graviton] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]
  • If the PR changes affects SM test, I've modified dlc_developer_config.toml in my PR branch by setting sagemaker_tests = true and efa_tests = true
  • If this PR changes existing code, the change fully backward compatible with pre-existing code. (Non backward-compatible changes need special approval.)
  • (If applicable) I've documented below the DLC image/dockerfile this relates to
  • (If applicable) I've documented below the tests I've run on the DLC image
  • (If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
  • (If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Pytest Marker Checklist

Expand
  • (If applicable) I have added the marker @pytest.mark.model("<model-type>") to the new tests which I have added, to specify the Deep Learning model that is used in the test (use "N/A" if the test doesn't use a model)
  • (If applicable) I have added the marker @pytest.mark.integration("<feature-being-tested>") to the new tests which I have added, to specify the feature that will be tested
  • (If applicable) I have added the marker @pytest.mark.multinode(<integer-num-nodes>) to the new tests which I have added, to specify the number of nodes used on a multi-node test
  • (If applicable) I have added the marker @pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">) to the new tests which I have added, if a test is specifically applicable to only one processor type

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Signed-off-by: Jinyan Li <jinyali@amazon.com>
Signed-off-by: Jinyan Li <jinyali@amazon.com>
Signed-off-by: Jinyan Li <jinyali@amazon.com>
@aws-deep-learning-containers-ci aws-deep-learning-containers-ci bot added authorized Size:XL Determines the size of the PR labels Nov 15, 2025
Signed-off-by: Jinyan Li <jinyali@amazon.com>
Signed-off-by: Jinyan Li <jinyali@amazon.com>
Signed-off-by: Jinyan Li <jinyali@amazon.com>
Signed-off-by: Jinyan Li <jinyali@amazon.com>
Signed-off-by: Jinyan Li <jinyali@amazon.com>
@jinyan-li1 jinyan-li1 marked this pull request as ready for review November 17, 2025 18:16
Copy link
Contributor

@junpuf junpuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try make a small dummy change to both dockerfiles just to test if the CI workflow behave the same way we thought they would be. We can revert those changes afterwards.

Signed-off-by: Jinyan Li <jinyali@amazon.com>
@jinyan-li1 jinyan-li1 merged commit 3e82f3c into main Nov 17, 2025
3 checks passed
@jinyan-li1 jinyan-li1 deleted the improve-workflow branch November 17, 2025 20:44
sirutBuasai pushed a commit to sirutBuasai/deep-learning-containers that referenced this pull request Nov 18, 2025
* combinr workflow for vllm

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
sirutBuasai added a commit that referenced this pull request Nov 21, 2025
* Improve vLLM workflows (#5480)

* combinr workflow for vllm

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* formatting

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix region

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix artifacts path

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* isntall dependencies

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* reverse order

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add port

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* move scripts into their separate dir

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* change port

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* style: format pre-commit check

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* chore: run test choices

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove commitizen

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use main branch

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix dir

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* run benchmark sglang

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* ac run container id instead

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add port and host

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use composite action

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* input secrets

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use shell bash

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove interactive

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* logs tail 200

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add sleep

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add cleanup

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove -it

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix names

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use container cleanup action

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove unused step

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* try env.containerid

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use env

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add -it

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* full run

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use output

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use vars expose

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use input

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* comment artifacts name

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* set output

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* echo

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add outputs id

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* iamge uri output

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* test using hardcoded string

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* no echo

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* set my output

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use image uri

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use secret image uri file

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix inputs

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* correct docker pull

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use non screte var

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use steps image_uri

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove unused steps

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* change uri var name

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* change step name

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* run regression test

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* change container_pull to ecr_authenticate

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove }}

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use 12xlarge fleet

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use base sglang

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* revert dockerfile

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* run using g6e

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove unnecessary wait

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* rename tests

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use matrix srt test

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove srt

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add pytest requirements

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add add conftest

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* test_utils

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove dup isort

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* run test exampels

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix runner name

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* force rich terminal

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove console width

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* disable rich tracebacks

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add columns

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix intall

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove rich logger

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove rich logger

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add sm endpoint

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* source and install dependencies

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* reduce sleep time

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use serve cmd instead

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* activate venev

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add pytest cache

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* enable debug

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* input image uri

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use assert in test instead

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove f

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* show test output

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use ack test direction

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove predictor

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix self error

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix input

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove model_id yield

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix get endpoint status

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add predictory class

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* change predictor

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* format tests

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* change pytest cmd

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* move aws_session to global conftest

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* move aws_session to global conftest

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix comments

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* fix comments

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove force color from pytest workflow and move to global

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* rename workflow

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* test remove paths

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* add runner setup script

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* uv venv project name

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* use json serializers

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* sagemaker requirements

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* change conccurency name

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

* remove concurrency

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>

---------

Signed-off-by: sirutBuasai <sirutbuasai27@outlook.com>
Co-authored-by: Jinyan Li <97153458+jinyan-li1@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

authorized Size:XL Determines the size of the PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants