Skip to content

Commit 968b967

Browse files
author
Bhanu Teja Goshikonda
committed
Update baseline image size for PyTorch 2.10
1 parent fc868e1 commit 968b967

File tree

4 files changed

+188
-2
lines changed

4 files changed

+188
-2
lines changed

docs/BASE_IMAGE_BUILD_GUIDE.md

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
# DLC Base Image Build Guide
2+
3+
This guide documents the process for building AWS Deep Learning Container (DLC) base images, using base image 13.0.2 as a reference.
4+
5+
## Overview
6+
7+
Base images provide the foundational layer for DLC framework images (PyTorch, TensorFlow, etc.). They include:
8+
- CUDA toolkit, cuDNN, and NCCL
9+
- Python with essential packages
10+
- EFA (Elastic Fabric Adapter) for distributed training
11+
- OSS compliance tooling
12+
13+
## Directory Structure
14+
15+
```
16+
deep-learning-containers/
17+
├── base/
18+
│ ├── buildspec-cu1302-ubuntu22.yml # Build configuration
19+
│ └── x86_64/gpu/cu130/ubuntu22.04/
20+
│ ├── py312/Dockerfile # Python 3.12 variant
21+
│ └── py313/Dockerfile # Python 3.13 variant
22+
├── scripts/
23+
│ ├── install_python.sh # Python installation
24+
│ ├── install_cuda.sh # CUDA/cuDNN/NCCL installation
25+
│ └── install_efa.sh # EFA installation
26+
└── src/
27+
└── deep_learning_container.py # DLC metadata script
28+
```
29+
30+
## Version Components
31+
32+
For base image 13.0.2 with Python 3.13:
33+
34+
| Component | Version | Notes |
35+
|-----------|---------|-------|
36+
| CUDA | 13.0.2 | CUDA toolkit |
37+
| cuDNN | 9.15.1.9 | Deep neural network library |
38+
| NCCL | v2.28.9-1 | Multi-GPU communication |
39+
| Python | 3.13.11 | Latest Python 3.13 |
40+
| EFA | 1.47.0 | Elastic Fabric Adapter |
41+
| Ubuntu | 22.04 | Base OS |
42+
43+
## Step 1: Create/Update Dockerfile
44+
45+
Location: `base/x86_64/gpu/cu130/ubuntu22.04/py313/Dockerfile`
46+
47+
### Key ARG Variables
48+
```dockerfile
49+
ARG PYTHON="python3"
50+
ARG PYTHON_VERSION="3.13.11"
51+
ARG PYTHON_SHORT_VERSION="3.13"
52+
ARG CUDA_MAJOR="13"
53+
ARG CUDA_MINOR="0"
54+
ARG CUDA_PATCH="2"
55+
ARG EFA_VERSION="1.47.0"
56+
ARG OS_VERSION="ubuntu22.04"
57+
```
58+
59+
### Multi-stage Build Structure
60+
1. **base-builder**: Installs system dependencies
61+
2. **python-builder**: Compiles Python from source
62+
3. **cuda-builder**: Installs CUDA stack
63+
4. **final**: Combines all components
64+
65+
## Step 2: Update install_cuda.sh
66+
67+
Add a new function for the CUDA version in `scripts/install_cuda.sh`:
68+
69+
```bash
70+
function install_cuda1302_stack_ul22 {
71+
CUDNN_VERSION="9.15.1.9"
72+
NCCL_VERSION="v2.28.9-1"
73+
CUDA_HOME="/usr/local/cuda"
74+
75+
# Remove existing CUDA
76+
rm -rf /usr/local/cuda-*
77+
rm -rf /usr/local/cuda
78+
79+
# Install CUDA
80+
wget -q https://developer.download.nvidia.com/compute/cuda/13.0.2/local_installers/cuda_13.0.2_580.95.05_linux.run
81+
chmod +x cuda_13.0.2_580.95.05_linux.run
82+
./cuda_13.0.2_580.95.05_linux.run --toolkit --silent
83+
rm -f cuda_13.0.2_580.95.05_linux.run
84+
ln -s /usr/local/cuda-13.0 /usr/local/cuda
85+
mv /usr/local/compat /usr/local/cuda/compat
86+
87+
# Install cuDNN
88+
mkdir -p /tmp/cudnn && cd /tmp/cudnn
89+
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda13-archive.tar.xz
90+
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda13-archive.tar.xz
91+
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda13-archive/include/* /usr/local/cuda/include/
92+
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda13-archive/lib/* /usr/local/cuda/lib64/
93+
94+
# Install NCCL
95+
mkdir -p /tmp/nccl && cd /tmp/nccl
96+
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
97+
cd nccl && make -j src.build
98+
cp -a build/include/* /usr/local/cuda/include/
99+
cp -a build/lib/* /usr/local/cuda/lib64/
100+
101+
prune_cuda
102+
ldconfig
103+
}
104+
```
105+
106+
Add case handling:
107+
```bash
108+
13.0.2)
109+
case "$2" in
110+
"ubuntu22.04") install_cuda1302_stack_ul22 ;;
111+
*) echo "bad OS version $2"; exit 1 ;;
112+
esac
113+
;;
114+
```
115+
116+
## Step 3: Create Buildspec
117+
118+
Location: `base/buildspec-cu1302-ubuntu22.yml`
119+
120+
```yaml
121+
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
122+
prod_account_id: &PROD_ACCOUNT_ID 763104351884
123+
region: &REGION <set-$REGION-in-environment>
124+
framework: &FRAMEWORK base
125+
version: &VERSION 13.0.2
126+
short_version: &SHORT_VERSION "13.0"
127+
arch_type: &ARCH_TYPE x86_64
128+
autopatch_build: "False"
129+
130+
images:
131+
base_x86_64_gpu_cuda1302_ubuntu22:
132+
# ... repository config ...
133+
device_type: &DEVICE_TYPE gpu
134+
cuda_version: &CUDA_VERSION cu130
135+
tag_python_version: &TAG_PYTHON_VERSION py313
136+
os_version: &OS_VERSION ubuntu22.04
137+
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
138+
docker_file: !join [ *FRAMEWORK, /, *ARCH_TYPE, /, *DEVICE_TYPE, /, *CUDA_VERSION, /, *OS_VERSION, /, *TAG_PYTHON_VERSION, /Dockerfile ]
139+
target: final
140+
build: true
141+
```
142+
143+
## Step 4: Build and Test
144+
145+
Push changes to trigger the build pipeline. The build will:
146+
1. Build the Docker image
147+
2. Run sanity tests
148+
3. Run security scans
149+
4. Push to ECR if all tests pass
150+
151+
## Step 5: Documentation Update
152+
153+
Update `available_images.md` with the new base image information.
154+
155+
## Finding Latest Versions
156+
157+
- **CUDA**: https://developer.nvidia.com/cuda-downloads
158+
- **cuDNN**: https://developer.nvidia.com/cudnn-downloads
159+
- **NCCL**: https://github.com/NVIDIA/nccl/releases
160+
- **EFA**: https://github.com/aws/aws-efa-installer/releases
161+
- **Python**: https://www.python.org/downloads/
162+
163+
## Image Tag Naming Convention
164+
165+
```
166+
{version}-{device_type}-{python_version}-{cuda_version}-{os_version}-{platform}
167+
```
168+
169+
Example: `13.0.2-gpu-py313-cu130-ubuntu22.04-ec2`
170+
171+
- **version**: Base image version (e.g., 13.0.2)
172+
- **device_type**: gpu or cpu
173+
- **python_version**: py312, py313, etc.
174+
- **cuda_version**: cu128, cu129, cu130, etc.
175+
- **os_version**: ubuntu22.04, ubuntu24.04
176+
- **platform**: ec2 or sagemaker
177+
178+
## Checklist for New Base Image
179+
180+
- [ ] Create/update Dockerfile in `base/x86_64/gpu/{cuda_version}/{os_version}/{python_version}/`
181+
- [ ] Add CUDA install function in `scripts/install_cuda.sh`
182+
- [ ] Create buildspec YAML in `base/`
183+
- [ ] Test build via CI
184+
- [ ] Verify security scans pass
185+
- [ ] Update documentation (`available_images.md`)

docs/tutorials

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Subproject commit 09690aab2e9786fb3adc3de7bc5553820fa76b86

pytorch/training/buildspec-2-10-ec2.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ images:
4444
BuildEC2CPUPTTrainPy3DockerImage:
4545
<<: *TRAINING_REPOSITORY
4646
build: &PYTORCH_CPU_TRAINING_PY3 false
47-
image_size_baseline: 7200
47+
image_size_baseline: 12000
4848
device_type: &DEVICE_TYPE cpu
4949
python_version: &DOCKER_PYTHON_VERSION py3
5050
tag_python_version: &TAG_PYTHON_VERSION py313

pytorch/training/buildspec-2-10-sm.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ images:
4444
BuildSageMakerCPUPTTrainPy3DockerImage:
4545
<<: *TRAINING_REPOSITORY
4646
build: &PYTORCH_CPU_TRAINING_PY3 false
47-
image_size_baseline: 7200
47+
image_size_baseline: 12000
4848
device_type: &DEVICE_TYPE cpu
4949
python_version: &DOCKER_PYTHON_VERSION py3
5050
tag_python_version: &TAG_PYTHON_VERSION py313

0 commit comments

Comments
 (0)