-
Notifications
You must be signed in to change notification settings - Fork 101
Description
Summary
Got Step-Audio-2-mini serving on AMD Radeon 8060S (gfx1151, Strix Halo APU, 128GB unified memory) using the step-audio2-mini branch of stepfun-ai/vllm. The vLLM fork had two build bugs and two runtime issues on this hardware. Full root cause analysis and fixes below.
Related vLLM upstream issue: vllm-project/vllm#35642
Build Issues
1. Mangled compiler flags from offload-arch in Docker
When building with HIP in Docker (no GPU at build time), offload-arch fails and its stderr gets captured into CMAKE_HIP_FLAGS through two paths:
- Path 1:
cmake/utils.cmakereads contaminatedCOMMON_HIPCC_FLAGSfrom PyTorch'storch.utils.cpp_extension - Path 2: CMake's
enable_language(HIP)runsCMakeDetermineHIPCompiler.cmakewhich bakes the warning intoCMAKE_HIP_FLAGS
Result: every compile command in build.ninja contains a literal warning string that clang++ interprets as a filename:
clang++: error: no such file or directory: '[WARNING] offload-arch failed with return code 1[stderr] -D__HIP_PLATFORM_AMD__=1'
Fix: Inject a build.ninja sanitizer into setup.py between cmake configure and cmake build (see patch_setup.py below).
2. gfx1151 not in HIP_SUPPORTED_ARCHS
CMakeLists.txt doesn't include gfx1151. Fix: sed patch to add it.
Runtime Issues
3. MFMA16 assertion failure during CUDA graph capture
paged_attention_ll4mi_QKV_mfma16_kernel uses MFMA16 instructions not supported on RDNA 4. Fix: --enforce-eager
4. Triton flash attention doesn't support SWA on gfx1151
Fix: VLLM_USE_TRITON_FLASH_ATTN=0 (uses CK flash attention instead)
Working Result
INFO [model_runner.py:1112] Model loading took 16.0381 GiB and 346.263282 seconds
INFO [worker.py:296] the current vLLM instance can use total_gpu_memory (124.00GiB) x gpu_memory_utilization (0.30) = 37.20GiB
INFO [executor_base.py:119] Maximum concurrency for 16384 tokens per request: 8.01x
INFO [api_server.py:1951] Starting vLLM API server 0 on http://0.0.0.0:8000
INFO: Application startup complete.
(APIServer pid=1) INFO 03-01 06:40:01 [api_server.py:1951] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:36] Available routes are:
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /health, Methods: GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /load, Methods: GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /ping, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /ping, Methods: GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /version, Methods: GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /pooling, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /classify, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /score, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /metrics, Methods: GET
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
ss(APIServer pid=1) INFO: 127.0.0.1:38398 - "GET /v1/models HTTP/1.1" 200 OK
Hardware
- GMKtec EVO-X1: AMD Ryzen AI Max+ 395, Radeon 8060S (gfx1151), 128GB unified memory
- ROCm: TheRock nightlies 7.11.0a20260106
- Host OS: Ubuntu 24.04 with ROCm 7.1.1 kernel drivers
Dockerfile
# syntax=docker/dockerfile:1
# Step-Audio 2 Mini with native gfx1151 support
#
# ROOT CAUSE: CMake's enable_language(HIP) runs offload-arch to detect GPU.
# In Docker (no GPU), offload-arch fails and its stderr gets captured into
# CMAKE_HIP_FLAGS. This mangled string then appears in every compile command
# in build.ninja, causing clang++ to error on it as a filename.
#
# FIX: Inject a build.ninja sanitizer into setup.py between the cmake
# configure step and the cmake build step. This strips the mangled warning
# from all ninja files before compilation begins.
FROM fedora:43
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
ARG ROCM_INDEX_URL=https://rocm.nightlies.amd.com/v2/gfx1151/
WORKDIR /opt/vllm-build
RUN uv venv --python 3.12
ENV VIRTUAL_ENV=/opt/vllm-build/.venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
# Install system dependencies
RUN dnf install -y \
wget curl git gcc gcc-c++ make cmake \
python3-pip python3-devel openssl-devel libffi-devel \
ca-certificates tar gzip libatomic \
&& dnf clean all
# Install ROCm Python packages from TheRock nightlies (gfx1151 support)
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --index-url ${ROCM_INDEX_URL} "rocm[libraries,devel]" && \
uv pip install --index-url ${ROCM_INDEX_URL} --pre torch torchaudio torchvision
# Download and extract matching ROCm tarball
RUN --mount=type=cache,target=/var/cache/rocm-downloads \
ROCM_VERSION=$(uv pip show torch | grep Version | awk -F'+rocm' '{print $2}') && \
echo "Detected ROCm Version: $ROCM_VERSION" && \
TARBALL="therock-dist-linux-gfx1151-${ROCM_VERSION}.tar.gz" && \
if [ -f "/var/cache/rocm-downloads/${TARBALL}" ]; then \
echo "Using cached ROCm tarball" && \
ln -s "/var/cache/rocm-downloads/${TARBALL}" . ; \
else \
echo "Downloading ROCm tarball" && \
curl -#LO "https://therock-nightly-tarball.s3.amazonaws.com/${TARBALL}" && \
cp "${TARBALL}" "/var/cache/rocm-downloads/${TARBALL}" ; \
fi && \
echo "Extracting ROCm from ${TARBALL}" && \
mkdir -p rocm-${ROCM_VERSION} && \
tar xzf ${TARBALL} -C rocm-${ROCM_VERSION} && \
rm ${TARBALL} && \
echo "${ROCM_VERSION}" > /opt/rocm_version.txt && \
ln -s /opt/vllm-build/rocm-${ROCM_VERSION} /opt/rocm-current
# Set ROCm environment
ENV ROCM_PATH=/opt/rocm-current
ENV LD_LIBRARY_PATH=$ROCM_PATH/lib
ENV DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode
ENV HIP_DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode
ENV PYTORCH_ROCM_ARCH="gfx1151"
ENV CUDA_HOME=/opt/rocm-current
ENV VLLM_TORCH_COMPILE_LEVEL=0
# Clone StepFun's vLLM fork
RUN --mount=type=cache,target=/root/.cache/git \
git clone -b step-audio2-mini --depth 1 https://github.com/stepfun-ai/vllm.git
WORKDIR /opt/vllm-build/vllm
# Patch: import amdsmi before torch to avoid crash
RUN sed -i '/from \.version import __version__/a import amdsmi' vllm/__init__.py
# Patch: Add gfx1151 to HIP_SUPPORTED_ARCHS
RUN sed -i 's/set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1200;gfx1201")/set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1151;gfx1200;gfx1201")/' CMakeLists.txt && \
grep "HIP_SUPPORTED_ARCHS" CMakeLists.txt | head -1
# Fix: DisabledTqdm passes duplicate 'disable' kwarg with newer huggingface_hub
RUN sed -i 's/super().__init__(\*args, \*\*kwargs, disable=True)/kwargs.pop("disable", None); super().__init__(*args, **kwargs, disable=True)/' \
vllm/model_executor/model_loader/weight_utils.py
# THE KEY FIX: Patch setup.py to sanitize build.ninja after configure
COPY patch_setup.py /tmp/patch_setup.py
RUN python3 /tmp/patch_setup.py
# Install audio dependencies
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --no-cache-dir librosa
# Install build dependencies
RUN --mount=type=cache,target=/root/.cache/uv \
export HIP_VISIBLE_DEVICES=-1 && \
export ROCR_VISIBLE_DEVICES=-1 && \
export CMAKE_PREFIX_PATH=/opt/vllm-build/.venv/lib/python3.12/site-packages/torch && \
uv pip uninstall amdsmi && \
uv pip install "numpy<2" && \
python use_existing_torch.py && \
uv pip install --upgrade numba scipy huggingface-hub[cli,hf_transfer] setuptools_scm && \
uv pip install -r requirements/rocm.txt
# Build vLLM
RUN --mount=type=cache,target=/root/.cache/uv \
export CMAKE_PREFIX_PATH=/opt/vllm-build/.venv/lib/python3.12/site-packages/torch && \
export HIP_VISIBLE_DEVICES=-1 && \
export ROCR_VISIBLE_DEVICES=-1 && \
MAX_JOBS=1 VERBOSE=1 python setup.py develop && \
uv pip install ${ROCM_PATH}/share/amd_smi && \
echo "StepFun vLLM build complete"
# Build Flash Attention
RUN --mount=type=cache,target=/root/.cache/git \
cd /opt/vllm-build && \
git clone https://github.com/ROCm/flash-attention.git && \
cd flash-attention && \
git checkout main_perf
WORKDIR /opt/vllm-build/flash-attention
RUN sed -i '/from wheel.bdist_wheel import bdist_wheel/a import amdsmi' setup.py && \
sed -i '1i import amdsmi' flash_attn/__init__.py
ENV FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
RUN --mount=type=cache,target=/root/.cache/uv \
python setup.py develop
WORKDIR /opt/vllm-build/vllm
ENV VLLM_USE_V1=0
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
ENV VLLM_LOGGING_LEVEL=INFO
# gfx1151 (RDNA 4) doesn't support MFMA16 instructions used in ROCm paged attention
# Use CK flash attention instead of Triton, and skip CUDA graph capture
ENV VLLM_USE_TRITON_FLASH_ATTN=0
CMD ["vllm", "serve", "stepfun-ai/Step-Audio-2-mini", \
"--served-model-name", "step-audio-2-mini", \
"--port", "8000", \
"--host", "0.0.0.0", \
"--max-model-len", "16384", \
"--max-num-seqs", "32", \
"--tensor-parallel-size", "1", \
"--gpu-memory-utilization", "0.3", \
"--enable-auto-tool-choice", \
"--tool-call-parser", "step_audio_2", \
"--tokenizer-mode", "step_audio_2", \
"--chat_template_content_format", "string", \
"--audio-parser", "step_audio_2_tts_ta4", \
"--trust-remote-code", \
"--dtype", "float16", \
"--enforce-eager", \
"--disable-frontend-multiprocessing"]patch_setup.py
#!/usr/bin/env python3
"""
Patch StepFun's vLLM setup.py:
1. Sanitize build.ninja files after cmake configure, before cmake build
2. ROCm compatibility fixes (dummy CUDA version, runtime detection)
The key fix is injecting a ninja sanitizer between configure (line 232) and
build (line 244) in build_extensions(). This removes the mangled
offload-arch warning that CMake's enable_language(HIP) bakes into
CMAKE_HIP_FLAGS when offload-arch fails (no GPU in Docker).
"""
with open('setup.py', 'r') as f:
content = f.read()
# === FIX 1: Inject ninja file sanitizer between configure and build ===
old_block = ''' # Build all the extensions
for ext in self.extensions:
self.configure(ext)
targets.append(target_name(ext.name))
num_jobs, _ = self.compute_num_jobs()'''
new_block = ''' # Build all the extensions
for ext in self.extensions:
self.configure(ext)
targets.append(target_name(ext.name))
# === PATCH: Sanitize build.ninja files ===
# CMake's enable_language(HIP) runs offload-arch which fails in Docker
# (no GPU). The stderr gets baked into CMAKE_HIP_FLAGS as a literal
# compiler flag like: "[WARNING] offload-arch failed ... -D__HIP_PLATFORM_AMD__=1"
# This flag appears in every compile command in build.ninja and causes
# clang++ to fail with "no such file or directory".
import glob as _glob
import re as _re
_ninja_files = _glob.glob(
os.path.join(self.build_temp, "**", "build.ninja"),
recursive=True)
_ninja_files += _glob.glob(
os.path.join(self.build_temp, "build.ninja"))
_ninja_pattern = _re.compile(
r'\\s*"\\[WARNING\\][^"]*"' # quoted: "[WARNING]..."
r'|'
r"\\s*'\\[WARNING\\][^']*'" # single-quoted
)
for _nf in set(_ninja_files):
try:
with open(_nf, 'r') as _f:
_ninja_content = _f.read()
if '[WARNING]' in _ninja_content:
_new_content = _ninja_pattern.sub('', _ninja_content)
with open(_nf, 'w') as _f:
_f.write(_new_content)
print(f"[PATCH] Sanitized {_nf}: removed mangled offload-arch warnings")
else:
print(f"[PATCH] {_nf}: no contamination found (OK)")
except Exception as _e:
print(f"[PATCH] Warning: could not sanitize {_nf}: {_e}")
# === END PATCH ===
num_jobs, _ = self.compute_num_jobs()'''
if old_block in content:
content = content.replace(old_block, new_block)
print("Patched setup.py: injected ninja sanitizer between configure and build")
else:
print("ERROR: Could not find injection point in setup.py")
print("Looking for the build_extensions method...")
for i, line in enumerate(content.split('\n'), 1):
if 'configure(ext)' in line or 'compute_num_jobs' in line:
print(f" Line {i}: {line.rstrip()}")
raise SystemExit(1)
# === FIX 2: ROCm compatibility - dummy CUDA version ===
content = content.replace(
'def get_nvcc_cuda_version() -> Version:',
'def get_nvcc_cuda_version() -> Version:\n'
' from packaging.version import Version\n'
' return Version("12.0") # Patched for ROCm\n'
'\n'
'def get_nvcc_cuda_version_original() -> Version:'
)
# === FIX 3: ROCm compatibility - unknown runtime ===
content = content.replace(
'raise RuntimeError("Unknown runtime environment")',
'return "0.1.0+rocm" # Patched for ROCm'
)
# === FIX 4: ROCm compatibility - unsupported platform ===
content = content.replace(
' raise ValueError(\n "Unsupported platform, please use CUDA, ROCm, Neuron, or CPU.")',
' requirements = _read_requirements("rocm.txt") # Patched: Use ROCm requirements'
)
with open('setup.py', 'w') as f:
f.write(content)
print("All setup.py patches applied successfully")Run
docker run --rm --device=/dev/kfd --device=/dev/dri --network=host step-audio-rocm