Skip to content

Commit ca85575

Browse files
Merge branch 'master' into vllm-arm64-release
2 parents b933d73 + 28231cf commit ca85575

36 files changed

+2003
-143
lines changed

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 21 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,13 @@
77

88
### Description
99

10-
### Tests run
11-
12-
**NOTE: By default, docker builds are disabled. In order to build your container, please update dlc_developer_config.toml and specify the framework to build in "build_frameworks"**
13-
- [ ] I have run builds/tests on commit <INSERT COMMIT ID> for my changes.
10+
### Tests Run
11+
By default, docker image builds and tests are disabled. Two ways to run builds and tests:
12+
1. Using dlc_developer_config.toml
13+
2. Using this PR description (currently only supported for PyTorch, TensorFlow, vllm, and base images)
1414

1515
<details>
16-
<summary>Confused on how to run tests? Try using the helper utility...</summary>
16+
<summary>How to use the helper utility for updating dlc_developer_config.toml</summary>
1717

1818
Assuming your remote is called `origin` (you can find out more with `git remote -v`)...
1919

@@ -28,50 +28,34 @@ Assuming your remote is called `origin` (you can find out more with `git remote
2828
- Restore TOML file when ready to merge
2929

3030
`python src/prepare_dlc_dev_environment.py -rcp origin`
31-
</details>
32-
33-
**NOTE: If you are creating a PR for a new framework version, please ensure success of the standard, rc, and efa sagemaker remote tests by updating the dlc_developer_config.toml file:**
34-
<details>
35-
<summary>Expand</summary>
3631

32+
**NOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:**
3733
- [ ] `sagemaker_remote_tests = true`
3834
- [ ] `sagemaker_efa_tests = true`
3935
- [ ] `sagemaker_rc_tests = true`
40-
41-
**Additionally, please run the sagemaker local tests in at least one revision:**
4236
- [ ] `sagemaker_local_tests = true`
43-
4437
</details>
4538

46-
### Formatting
47-
- [ ] I have run `black -l 100` on my code (formatting tool: https://black.readthedocs.io/en/stable/getting_started.html)
48-
49-
### DLC image/dockerfile
50-
51-
#### Builds to Execute
5239
<details>
53-
<summary>Expand</summary>
54-
55-
Fill out the template and click the checkbox of the builds you'd like to execute
56-
57-
*Note: Replace with <X.Y> with the major.minor framework version (i.e. 2.2) you would like to start.*
58-
59-
- [ ] build_pytorch_training_<X.Y>_sm
60-
- [ ] build_pytorch_training_<X.Y>_ec2
61-
62-
- [ ] build_pytorch_inference_<X.Y>_sm
63-
- [ ] build_pytorch_inference_<X.Y>_ec2
64-
- [ ] build_pytorch_inference_<X.Y>_graviton
40+
<summary>How to use PR description</summary>
41+
Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:
6542

66-
- [ ] build_tensorflow_training_<X.Y>_sm
67-
- [ ] build_tensorflow_training_<X.Y>_ec2
43+
- `# /buildspec <buildspec_path>`
44+
- e.g.: `# /buildspec pytorch/training/buildspec.yml`
45+
- If this line is commented out, dlc_developer_config.toml will be used.
46+
- `# /tests <test_list>`
47+
- e.g.: `# /tests sanity security ec2`
48+
- If this line is commented out, it will run the default set of tests (same as the defaults in dlc_developer_config.toml): `sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local`.
6849

69-
- [ ] build_tensorflow_inference_<X.Y>_sm
70-
- [ ] build_tensorflow_inference_<X.Y>_ec2
71-
- [ ] build_tensorflow_inference_<X.Y>_graviton
7250
</details>
7351

74-
### Additional context
52+
```
53+
# /buildspec <buildspec_path>
54+
# /tests <test_list>
55+
```
56+
57+
### Formatting
58+
- [ ] I have run `black -l 100` on my code (formatting tool: https://black.readthedocs.io/en/stable/getting_started.html)
7559

7660
### PR Checklist
7761
<details>
@@ -84,14 +68,6 @@ Fill out the template and click the checkbox of the builds you'd like to execute
8468
- [ ] (If applicable) I've documented below the tests I've run on the DLC image
8569
- [ ] (If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See [https://www.apache.org/legal/resolved.html](https://www.apache.org/legal/resolved.html).
8670
- [ ] (If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.
87-
88-
#### NEURON/GRAVITON Testing Checklist
89-
* When creating a PR:
90-
- [ ] I've modified `dlc_developer_config.toml` in my PR branch by setting `neuron_mode = true` or `graviton_mode = true`
91-
92-
#### Benchmark Testing Checklist
93-
* When creating a PR:
94-
- [ ] I've modified `dlc_developer_config.toml` in my PR branch by setting `ec2_benchmark_tests = true` or `sagemaker_benchmark_tests = true`
9571
</details>
9672

9773
### Pytest Marker Checklist

base/buildspec-cu128-ubuntu24.yml

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
2+
prod_account_id: &PROD_ACCOUNT_ID 763104351884
3+
region: &REGION <set-$REGION-in-environment>
4+
framework: &FRAMEWORK base
5+
version: &VERSION 12.8.1
6+
short_version: &SHORT_VERSION "12.8"
7+
arch_type: &ARCH_TYPE x86_64
8+
autopatch_build: "False"
9+
10+
repository_info:
11+
base_repository: &BASE_REPOSITORY
12+
image_type: &IMAGE_TYPE gpu
13+
root: .
14+
repository_name: &REPOSITORY_NAME !join [ pr, "-", *FRAMEWORK ]
15+
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
16+
release_repository_name: &RELEASE_REPOSITORY_NAME !join [ *FRAMEWORK ]
17+
release_repository: &RELEASE_REPOSITORY !join [ *PROD_ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *RELEASE_REPOSITORY_NAME ]
18+
19+
context:
20+
base_context: &BASE_CONTEXT
21+
deep_learning_container:
22+
source: src/deep_learning_container.py
23+
target: deep_learning_container.py
24+
install_python:
25+
source: scripts/install_python.sh
26+
target: install_python.sh
27+
install_cuda:
28+
source: scripts/install_cuda.sh
29+
target: install_cuda.sh
30+
install_efa:
31+
source: scripts/install_efa.sh
32+
target: install_efa.sh
33+
34+
images:
35+
base_x86_64_gpu_cuda128:
36+
<<: *BASE_REPOSITORY
37+
context:
38+
<<: *BASE_CONTEXT
39+
image_size_baseline: 11000
40+
device_type: &DEVICE_TYPE gpu
41+
cuda_version: &CUDA_VERSION cu128
42+
python_version: &DOCKER_PYTHON_VERSION py3
43+
tag_python_version: &TAG_PYTHON_VERSION py312
44+
os_version: &OS_VERSION ubuntu24.04
45+
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
46+
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
47+
docker_file: !join [ *FRAMEWORK, /, *ARCH_TYPE, /, *DEVICE_TYPE, /, *CUDA_VERSION, /, *OS_VERSION, /Dockerfile ]
48+
target: final
49+
build: true
50+
enable_common_stage_build: false
51+
test_configs:
52+
test_platforms:
53+
- sanity
54+
- security

base/buildspec-cu129-ubuntu22.yml

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
2+
prod_account_id: &PROD_ACCOUNT_ID 763104351884
3+
region: &REGION <set-$REGION-in-environment>
4+
framework: &FRAMEWORK base
5+
version: &VERSION 12.9.1
6+
short_version: &SHORT_VERSION "12.9"
7+
arch_type: &ARCH_TYPE x86_64
8+
autopatch_build: "False"
9+
10+
repository_info:
11+
base_repository: &BASE_REPOSITORY
12+
image_type: &IMAGE_TYPE gpu
13+
root: .
14+
repository_name: &REPOSITORY_NAME !join [ pr, "-", *FRAMEWORK ]
15+
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
16+
release_repository_name: &RELEASE_REPOSITORY_NAME !join [ *FRAMEWORK ]
17+
release_repository: &RELEASE_REPOSITORY !join [ *PROD_ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *RELEASE_REPOSITORY_NAME ]
18+
19+
context:
20+
base_context: &BASE_CONTEXT
21+
deep_learning_container:
22+
source: src/deep_learning_container.py
23+
target: deep_learning_container.py
24+
install_python:
25+
source: scripts/install_python.sh
26+
target: install_python.sh
27+
install_cuda:
28+
source: scripts/install_cuda.sh
29+
target: install_cuda.sh
30+
install_efa:
31+
source: scripts/install_efa.sh
32+
target: install_efa.sh
33+
34+
images:
35+
base_x86_64_gpu_cuda129_ubuntu22:
36+
<<: *BASE_REPOSITORY
37+
context:
38+
<<: *BASE_CONTEXT
39+
image_size_baseline: 11000
40+
device_type: &DEVICE_TYPE gpu
41+
cuda_version: &CUDA_VERSION cu129
42+
python_version: &DOCKER_PYTHON_VERSION py3
43+
tag_python_version: &TAG_PYTHON_VERSION py312
44+
os_version: &OS_VERSION ubuntu22.04
45+
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
46+
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
47+
docker_file: !join [ *FRAMEWORK, /, *ARCH_TYPE, /, *DEVICE_TYPE, /, *CUDA_VERSION, /, *OS_VERSION, /Dockerfile ]
48+
target: final
49+
build: true
50+
enable_common_stage_build: false
51+
test_configs:
52+
test_platforms:
53+
- sanity
54+
- security

base/buildspec.yml

Lines changed: 1 addition & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,54 +1 @@
1-
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
2-
prod_account_id: &PROD_ACCOUNT_ID 763104351884
3-
region: &REGION <set-$REGION-in-environment>
4-
framework: &FRAMEWORK base
5-
version: &VERSION 12.8.1
6-
short_version: &SHORT_VERSION "12.8"
7-
arch_type: &ARCH_TYPE x86_64
8-
autopatch_build: "False"
9-
10-
repository_info:
11-
base_repository: &BASE_REPOSITORY
12-
image_type: &IMAGE_TYPE gpu
13-
root: .
14-
repository_name: &REPOSITORY_NAME !join [ pr, "-", *FRAMEWORK ]
15-
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
16-
release_repository_name: &RELEASE_REPOSITORY_NAME !join [ *FRAMEWORK ]
17-
release_repository: &RELEASE_REPOSITORY !join [ *PROD_ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *RELEASE_REPOSITORY_NAME ]
18-
19-
context:
20-
base_context: &BASE_CONTEXT
21-
deep_learning_container:
22-
source: src/deep_learning_container.py
23-
target: deep_learning_container.py
24-
install_python:
25-
source: scripts/install_python.sh
26-
target: install_python.sh
27-
install_cuda:
28-
source: scripts/install_cuda.sh
29-
target: install_cuda.sh
30-
install_efa:
31-
source: scripts/install_efa.sh
32-
target: install_efa.sh
33-
34-
images:
35-
base_x86_64_gpu_cuda128:
36-
<<: *BASE_REPOSITORY
37-
context:
38-
<<: *BASE_CONTEXT
39-
image_size_baseline: 11000
40-
device_type: &DEVICE_TYPE gpu
41-
cuda_version: &CUDA_VERSION cu128
42-
python_version: &DOCKER_PYTHON_VERSION py3
43-
tag_python_version: &TAG_PYTHON_VERSION py312
44-
os_version: &OS_VERSION ubuntu24.04
45-
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
46-
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
47-
docker_file: !join [ *FRAMEWORK, /, *ARCH_TYPE, /, *DEVICE_TYPE, /, *CUDA_VERSION, /Dockerfile ]
48-
target: final
49-
build: true
50-
enable_common_stage_build: false
51-
test_configs:
52-
test_platforms:
53-
- sanity
54-
- security
1+
buildspec_pointer: buildspec-cu129-ubuntu22.yml
File renamed without changes.
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
ARG PYTHON="python3"
2+
ARG PYTHON_VERSION="3.12.10"
3+
ARG PYTHON_SHORT_VERSION="3.12"
4+
ARG CUDA_MAJOR="12"
5+
ARG CUDA_MINOR="9"
6+
ARG EFA_VERSION="1.43.1"
7+
FROM nvidia/cuda:12.9.1-base-ubuntu22.04 AS base-builder
8+
9+
10+
RUN mv /usr/local/cuda/compat /usr/local \
11+
&& apt-get update \
12+
&& apt-get -y upgrade --only-upgrade systemd \
13+
&& apt-get install -y --allow-change-held-packages --no-install-recommends \
14+
automake \
15+
build-essential \
16+
ca-certificates \
17+
cmake \
18+
curl \
19+
emacs \
20+
git \
21+
jq \
22+
libcurl4-openssl-dev \
23+
libglib2.0-0 \
24+
libegl1 \
25+
libgl1 \
26+
libsm6 \
27+
libssl-dev \
28+
libxext6 \
29+
libxrender-dev \
30+
zlib1g-dev \
31+
unzip \
32+
vim \
33+
wget \
34+
libhwloc-dev \
35+
libgomp1 \
36+
libibverbs-dev \
37+
libnuma1 \
38+
libnuma-dev \
39+
libtool \
40+
openssl \
41+
python3-dev \
42+
autoconf \
43+
pkg-config \
44+
check \
45+
libsubunit0 \
46+
libsubunit-dev \
47+
libffi-dev \
48+
libbz2-dev \
49+
liblzma-dev \
50+
&& apt-get autoremove -y \
51+
&& apt-get clean \
52+
&& rm -rf /var/lib/apt/lists/*
53+
54+
##############################################################################
55+
FROM base-builder AS python-builder
56+
ARG PYTHON_VERSION
57+
COPY install_python.sh install_python.sh
58+
RUN bash install_python.sh ${PYTHON_VERSION} && rm install_python.sh
59+
60+
##############################################################################
61+
FROM base-builder AS cuda-builder
62+
ARG CUDA_MAJOR
63+
ARG CUDA_MINOR
64+
COPY install_cuda.sh install_cuda.sh
65+
RUN bash install_cuda.sh "${CUDA_MAJOR}.${CUDA_MINOR}" && rm install_cuda.sh
66+
67+
##############################################################################
68+
FROM nvidia/cuda:12.9.1-base-ubuntu22.04 AS final
69+
ARG PYTHON
70+
ARG PYTHON_SHORT_VERSION
71+
ARG CUDA_MAJOR
72+
ARG CUDA_MINOR
73+
ARG EFA_VERSION
74+
LABEL maintainer="Amazon AI"
75+
LABEL dlc_major_version="1"
76+
ENV DEBIAN_FRONTEND=noninteractive \
77+
LANG=C.UTF-8 \
78+
LC_ALL=C.UTF-8 \
79+
DLC_CONTAINER_TYPE=base \
80+
# Python won’t try to write .pyc or .pyo files on the import of source modules
81+
# Force stdin, stdout and stderr to be totally unbuffered. Good for logging
82+
PYTHONDONTWRITEBYTECODE=1 \
83+
PYTHONUNBUFFERED=1 \
84+
PYTHONIOENCODING=UTF-8 \
85+
CUDA_HOME="/usr/local/cuda" \
86+
PATH="/opt/amazon/openmpi/bin:/opt/amazon/efa/bin:/usr/local/cuda/bin:${PATH}" \
87+
LD_LIBRARY_PATH="/usr/local/lib:/usr/local/cuda/lib64:/opt/amazon/ofi-nccl/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:${LD_LIBRARY_PATH}"
88+
89+
WORKDIR /
90+
91+
# + python and pip packages (awscli, boto3, requests)
92+
COPY --from=python-builder /usr/local/lib/python${PYTHON_SHORT_VERSION} /usr/local/lib/python${PYTHON_SHORT_VERSION}
93+
COPY --from=python-builder /usr/local/include/python${PYTHON_SHORT_VERSION} /usr/local/include/python${PYTHON_SHORT_VERSION}
94+
COPY --from=python-builder /usr/local/bin /usr/local/bin
95+
# + cuda-toolkit, cudnn, nccl
96+
COPY --from=cuda-builder /usr/local/cuda-${CUDA_MAJOR}.${CUDA_MINOR} /usr/local/cuda-${CUDA_MAJOR}.${CUDA_MINOR}
97+
COPY install_efa.sh install_efa.sh
98+
COPY deep_learning_container.py /usr/local/bin/deep_learning_container.py
99+
COPY bash_telemetry.sh /usr/local/bin/bash_telemetry.sh
100+
RUN chmod +x /usr/local/bin/deep_learning_container.py && \
101+
chmod +x /usr/local/bin/bash_telemetry.sh && \
102+
echo 'source /usr/local/bin/bash_telemetry.sh' >> /etc/bash.bashrc && \
103+
# Install EFA
104+
bash install_efa.sh ${EFA_VERSION} && \
105+
rm install_efa.sh && \
106+
# OSS compliance
107+
apt-get update && \
108+
apt-get upgrade -y && \
109+
apt-get install -y --allow-change-held-packages --no-install-recommends \
110+
unzip \
111+
wget && \
112+
apt-get clean && \
113+
HOME_DIR=/root && \
114+
curl -o ${HOME_DIR}/oss_compliance.zip https://aws-dlinfra-utilities.s3.amazonaws.com/oss_compliance.zip && \
115+
unzip ${HOME_DIR}/oss_compliance.zip -d ${HOME_DIR}/ && \
116+
cp ${HOME_DIR}/oss_compliance/test/testOSSCompliance /usr/local/bin/testOSSCompliance && \
117+
chmod +x /usr/local/bin/testOSSCompliance && \
118+
chmod +x ${HOME_DIR}/oss_compliance/generate_oss_compliance.sh && \
119+
${HOME_DIR}/oss_compliance/generate_oss_compliance.sh ${HOME_DIR} ${PYTHON} && \
120+
rm -rf ${HOME_DIR}/oss_compliance* && \
121+
rm -rf /tmp/tmp* && \
122+
rm -rf /var/lib/apt/lists/* && \
123+
rm -rf /root/.cache | true
124+
125+
CMD ["/bin/bash"]

0 commit comments

Comments
 (0)