|
| 1 | +# DLC Base Image Build Guide |
| 2 | + |
| 3 | +This guide documents the process for building AWS Deep Learning Container (DLC) base images, using base image 13.0.2 as a reference. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Base images provide the foundational layer for DLC framework images (PyTorch, TensorFlow, etc.). They include: |
| 8 | +- CUDA toolkit, cuDNN, and NCCL |
| 9 | +- Python with essential packages |
| 10 | +- EFA (Elastic Fabric Adapter) for distributed training |
| 11 | +- OSS compliance tooling |
| 12 | + |
| 13 | +## Directory Structure |
| 14 | + |
| 15 | +``` |
| 16 | +deep-learning-containers/ |
| 17 | +├── base/ |
| 18 | +│ ├── buildspec-cu1302-ubuntu22.yml # Build configuration |
| 19 | +│ └── x86_64/gpu/cu130/ubuntu22.04/ |
| 20 | +│ ├── py312/Dockerfile # Python 3.12 variant |
| 21 | +│ └── py313/Dockerfile # Python 3.13 variant |
| 22 | +├── scripts/ |
| 23 | +│ ├── install_python.sh # Python installation |
| 24 | +│ ├── install_cuda.sh # CUDA/cuDNN/NCCL installation |
| 25 | +│ └── install_efa.sh # EFA installation |
| 26 | +└── src/ |
| 27 | + └── deep_learning_container.py # DLC metadata script |
| 28 | +``` |
| 29 | + |
| 30 | +## Version Components |
| 31 | + |
| 32 | +For base image 13.0.2 with Python 3.13: |
| 33 | + |
| 34 | +| Component | Version | Notes | |
| 35 | +|-----------|---------|-------| |
| 36 | +| CUDA | 13.0.2 | CUDA toolkit | |
| 37 | +| cuDNN | 9.15.1.9 | Deep neural network library | |
| 38 | +| NCCL | v2.28.9-1 | Multi-GPU communication | |
| 39 | +| Python | 3.13.11 | Latest Python 3.13 | |
| 40 | +| EFA | 1.47.0 | Elastic Fabric Adapter | |
| 41 | +| Ubuntu | 22.04 | Base OS | |
| 42 | + |
| 43 | +## Step 1: Create/Update Dockerfile |
| 44 | + |
| 45 | +Location: `base/x86_64/gpu/cu130/ubuntu22.04/py313/Dockerfile` |
| 46 | + |
| 47 | +### Key ARG Variables |
| 48 | +```dockerfile |
| 49 | +ARG PYTHON="python3" |
| 50 | +ARG PYTHON_VERSION="3.13.11" |
| 51 | +ARG PYTHON_SHORT_VERSION="3.13" |
| 52 | +ARG CUDA_MAJOR="13" |
| 53 | +ARG CUDA_MINOR="0" |
| 54 | +ARG CUDA_PATCH="2" |
| 55 | +ARG EFA_VERSION="1.47.0" |
| 56 | +ARG OS_VERSION="ubuntu22.04" |
| 57 | +``` |
| 58 | + |
| 59 | +### Multi-stage Build Structure |
| 60 | +1. **base-builder**: Installs system dependencies |
| 61 | +2. **python-builder**: Compiles Python from source |
| 62 | +3. **cuda-builder**: Installs CUDA stack |
| 63 | +4. **final**: Combines all components |
| 64 | + |
| 65 | +## Step 2: Update install_cuda.sh |
| 66 | + |
| 67 | +Add a new function for the CUDA version in `scripts/install_cuda.sh`: |
| 68 | + |
| 69 | +```bash |
| 70 | +function install_cuda1302_stack_ul22 { |
| 71 | + CUDNN_VERSION="9.15.1.9" |
| 72 | + NCCL_VERSION="v2.28.9-1" |
| 73 | + CUDA_HOME="/usr/local/cuda" |
| 74 | + |
| 75 | + # Remove existing CUDA |
| 76 | + rm -rf /usr/local/cuda-* |
| 77 | + rm -rf /usr/local/cuda |
| 78 | + |
| 79 | + # Install CUDA |
| 80 | + wget -q https://developer.download.nvidia.com/compute/cuda/13.0.2/local_installers/cuda_13.0.2_580.95.05_linux.run |
| 81 | + chmod +x cuda_13.0.2_580.95.05_linux.run |
| 82 | + ./cuda_13.0.2_580.95.05_linux.run --toolkit --silent |
| 83 | + rm -f cuda_13.0.2_580.95.05_linux.run |
| 84 | + ln -s /usr/local/cuda-13.0 /usr/local/cuda |
| 85 | + mv /usr/local/compat /usr/local/cuda/compat |
| 86 | + |
| 87 | + # Install cuDNN |
| 88 | + mkdir -p /tmp/cudnn && cd /tmp/cudnn |
| 89 | + wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda13-archive.tar.xz |
| 90 | + tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda13-archive.tar.xz |
| 91 | + cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda13-archive/include/* /usr/local/cuda/include/ |
| 92 | + cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda13-archive/lib/* /usr/local/cuda/lib64/ |
| 93 | + |
| 94 | + # Install NCCL |
| 95 | + mkdir -p /tmp/nccl && cd /tmp/nccl |
| 96 | + git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git |
| 97 | + cd nccl && make -j src.build |
| 98 | + cp -a build/include/* /usr/local/cuda/include/ |
| 99 | + cp -a build/lib/* /usr/local/cuda/lib64/ |
| 100 | + |
| 101 | + prune_cuda |
| 102 | + ldconfig |
| 103 | +} |
| 104 | +``` |
| 105 | + |
| 106 | +Add case handling: |
| 107 | +```bash |
| 108 | +13.0.2) |
| 109 | + case "$2" in |
| 110 | + "ubuntu22.04") install_cuda1302_stack_ul22 ;; |
| 111 | + *) echo "bad OS version $2"; exit 1 ;; |
| 112 | + esac |
| 113 | + ;; |
| 114 | +``` |
| 115 | + |
| 116 | +## Step 3: Create Buildspec |
| 117 | + |
| 118 | +Location: `base/buildspec-cu1302-ubuntu22.yml` |
| 119 | + |
| 120 | +```yaml |
| 121 | +account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment> |
| 122 | +prod_account_id: &PROD_ACCOUNT_ID 763104351884 |
| 123 | +region: ®ION <set-$REGION-in-environment> |
| 124 | +framework: &FRAMEWORK base |
| 125 | +version: &VERSION 13.0.2 |
| 126 | +short_version: &SHORT_VERSION "13.0" |
| 127 | +arch_type: &ARCH_TYPE x86_64 |
| 128 | +autopatch_build: "False" |
| 129 | + |
| 130 | +images: |
| 131 | + base_x86_64_gpu_cuda1302_ubuntu22: |
| 132 | + # ... repository config ... |
| 133 | + device_type: &DEVICE_TYPE gpu |
| 134 | + cuda_version: &CUDA_VERSION cu130 |
| 135 | + tag_python_version: &TAG_PYTHON_VERSION py313 |
| 136 | + os_version: &OS_VERSION ubuntu22.04 |
| 137 | + tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ] |
| 138 | + docker_file: !join [ *FRAMEWORK, /, *ARCH_TYPE, /, *DEVICE_TYPE, /, *CUDA_VERSION, /, *OS_VERSION, /, *TAG_PYTHON_VERSION, /Dockerfile ] |
| 139 | + target: final |
| 140 | + build: true |
| 141 | +``` |
| 142 | +
|
| 143 | +## Step 4: Build and Test |
| 144 | +
|
| 145 | +Push changes to trigger the build pipeline. The build will: |
| 146 | +1. Build the Docker image |
| 147 | +2. Run sanity tests |
| 148 | +3. Run security scans |
| 149 | +4. Push to ECR if all tests pass |
| 150 | +
|
| 151 | +## Step 5: Documentation Update |
| 152 | +
|
| 153 | +Update `available_images.md` with the new base image information. |
| 154 | + |
| 155 | +## Finding Latest Versions |
| 156 | + |
| 157 | +- **CUDA**: https://developer.nvidia.com/cuda-downloads |
| 158 | +- **cuDNN**: https://developer.nvidia.com/cudnn-downloads |
| 159 | +- **NCCL**: https://github.com/NVIDIA/nccl/releases |
| 160 | +- **EFA**: https://github.com/aws/aws-efa-installer/releases |
| 161 | +- **Python**: https://www.python.org/downloads/ |
| 162 | + |
| 163 | +## Image Tag Naming Convention |
| 164 | + |
| 165 | +``` |
| 166 | +{version}-{device_type}-{python_version}-{cuda_version}-{os_version}-{platform} |
| 167 | +``` |
| 168 | +
|
| 169 | +Example: `13.0.2-gpu-py313-cu130-ubuntu22.04-ec2` |
| 170 | +
|
| 171 | +- **version**: Base image version (e.g., 13.0.2) |
| 172 | +- **device_type**: gpu or cpu |
| 173 | +- **python_version**: py312, py313, etc. |
| 174 | +- **cuda_version**: cu128, cu129, cu130, etc. |
| 175 | +- **os_version**: ubuntu22.04, ubuntu24.04 |
| 176 | +- **platform**: ec2 or sagemaker |
| 177 | +
|
| 178 | +## Checklist for New Base Image |
| 179 | +
|
| 180 | +- [ ] Create/update Dockerfile in `base/x86_64/gpu/{cuda_version}/{os_version}/{python_version}/` |
| 181 | +- [ ] Add CUDA install function in `scripts/install_cuda.sh` |
| 182 | +- [ ] Create buildspec YAML in `base/` |
| 183 | +- [ ] Test build via CI |
| 184 | +- [ ] Verify security scans pass |
| 185 | +- [ ] Update documentation (`available_images.md`) |
0 commit comments