Skip to content

Step-Audio-2-mini running on AMD RDNA 3.5 (gfx1151 / Strix Halo) via ROCm — build fixes and Dockerfile #86

@npathak13

Description

@npathak13

Summary

Got Step-Audio-2-mini serving on AMD Radeon 8060S (gfx1151, Strix Halo APU, 128GB unified memory) using the step-audio2-mini branch of stepfun-ai/vllm. The vLLM fork had two build bugs and two runtime issues on this hardware. Full root cause analysis and fixes below.

Related vLLM upstream issue: vllm-project/vllm#35642

Build Issues

1. Mangled compiler flags from offload-arch in Docker

When building with HIP in Docker (no GPU at build time), offload-arch fails and its stderr gets captured into CMAKE_HIP_FLAGS through two paths:

  • Path 1: cmake/utils.cmake reads contaminated COMMON_HIPCC_FLAGS from PyTorch's torch.utils.cpp_extension
  • Path 2: CMake's enable_language(HIP) runs CMakeDetermineHIPCompiler.cmake which bakes the warning into CMAKE_HIP_FLAGS

Result: every compile command in build.ninja contains a literal warning string that clang++ interprets as a filename:

clang++: error: no such file or directory: '[WARNING] offload-arch failed with return code 1[stderr] -D__HIP_PLATFORM_AMD__=1'

Fix: Inject a build.ninja sanitizer into setup.py between cmake configure and cmake build (see patch_setup.py below).

2. gfx1151 not in HIP_SUPPORTED_ARCHS

CMakeLists.txt doesn't include gfx1151. Fix: sed patch to add it.

Runtime Issues

3. MFMA16 assertion failure during CUDA graph capture

paged_attention_ll4mi_QKV_mfma16_kernel uses MFMA16 instructions not supported on RDNA 4. Fix: --enforce-eager

4. Triton flash attention doesn't support SWA on gfx1151

Fix: VLLM_USE_TRITON_FLASH_ATTN=0 (uses CK flash attention instead)

Working Result

INFO [model_runner.py:1112] Model loading took 16.0381 GiB and 346.263282 seconds
INFO [worker.py:296] the current vLLM instance can use total_gpu_memory (124.00GiB) x gpu_memory_utilization (0.30) = 37.20GiB
INFO [executor_base.py:119] Maximum concurrency for 16384 tokens per request: 8.01x
INFO [api_server.py:1951] Starting vLLM API server 0 on http://0.0.0.0:8000
INFO:     Application startup complete.
(APIServer pid=1) INFO 03-01 06:40:01 [api_server.py:1951] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:36] Available routes are:
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /health, Methods: GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /load, Methods: GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /ping, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /ping, Methods: GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /version, Methods: GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /pooling, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /classify, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /score, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 03-01 06:40:01 [launcher.py:44] Route: /metrics, Methods: GET
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
ss(APIServer pid=1) INFO:     127.0.0.1:38398 - "GET /v1/models HTTP/1.1" 200 OK

Hardware

  • GMKtec EVO-X1: AMD Ryzen AI Max+ 395, Radeon 8060S (gfx1151), 128GB unified memory
  • ROCm: TheRock nightlies 7.11.0a20260106
  • Host OS: Ubuntu 24.04 with ROCm 7.1.1 kernel drivers

Dockerfile

# syntax=docker/dockerfile:1
# Step-Audio 2 Mini with native gfx1151 support
#
# ROOT CAUSE: CMake's enable_language(HIP) runs offload-arch to detect GPU.
# In Docker (no GPU), offload-arch fails and its stderr gets captured into
# CMAKE_HIP_FLAGS. This mangled string then appears in every compile command
# in build.ninja, causing clang++ to error on it as a filename.
#
# FIX: Inject a build.ninja sanitizer into setup.py between the cmake
# configure step and the cmake build step. This strips the mangled warning
# from all ninja files before compilation begins.

FROM fedora:43
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

ARG ROCM_INDEX_URL=https://rocm.nightlies.amd.com/v2/gfx1151/

WORKDIR /opt/vllm-build
RUN uv venv --python 3.12
ENV VIRTUAL_ENV=/opt/vllm-build/.venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# Install system dependencies
RUN dnf install -y \
    wget curl git gcc gcc-c++ make cmake \
    python3-pip python3-devel openssl-devel libffi-devel \
    ca-certificates tar gzip libatomic \
    && dnf clean all

# Install ROCm Python packages from TheRock nightlies (gfx1151 support)
RUN --mount=type=cache,target=/root/.cache/uv \
    uv pip install --index-url ${ROCM_INDEX_URL} "rocm[libraries,devel]" && \
    uv pip install --index-url ${ROCM_INDEX_URL} --pre torch torchaudio torchvision

# Download and extract matching ROCm tarball
RUN --mount=type=cache,target=/var/cache/rocm-downloads \
    ROCM_VERSION=$(uv pip show torch | grep Version | awk -F'+rocm' '{print $2}') && \
    echo "Detected ROCm Version: $ROCM_VERSION" && \
    TARBALL="therock-dist-linux-gfx1151-${ROCM_VERSION}.tar.gz" && \
    if [ -f "/var/cache/rocm-downloads/${TARBALL}" ]; then \
        echo "Using cached ROCm tarball" && \
        ln -s "/var/cache/rocm-downloads/${TARBALL}" . ; \
    else \
        echo "Downloading ROCm tarball" && \
        curl -#LO "https://therock-nightly-tarball.s3.amazonaws.com/${TARBALL}" && \
        cp "${TARBALL}" "/var/cache/rocm-downloads/${TARBALL}" ; \
    fi && \
    echo "Extracting ROCm from ${TARBALL}" && \
    mkdir -p rocm-${ROCM_VERSION} && \
    tar xzf ${TARBALL} -C rocm-${ROCM_VERSION} && \
    rm ${TARBALL} && \
    echo "${ROCM_VERSION}" > /opt/rocm_version.txt && \
    ln -s /opt/vllm-build/rocm-${ROCM_VERSION} /opt/rocm-current

# Set ROCm environment
ENV ROCM_PATH=/opt/rocm-current
ENV LD_LIBRARY_PATH=$ROCM_PATH/lib
ENV DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode
ENV HIP_DEVICE_LIB_PATH=$ROCM_PATH/llvm/amdgcn/bitcode
ENV PYTORCH_ROCM_ARCH="gfx1151"
ENV CUDA_HOME=/opt/rocm-current
ENV VLLM_TORCH_COMPILE_LEVEL=0

# Clone StepFun's vLLM fork
RUN --mount=type=cache,target=/root/.cache/git \
    git clone -b step-audio2-mini --depth 1 https://github.com/stepfun-ai/vllm.git

WORKDIR /opt/vllm-build/vllm

# Patch: import amdsmi before torch to avoid crash
RUN sed -i '/from \.version import __version__/a import amdsmi' vllm/__init__.py

# Patch: Add gfx1151 to HIP_SUPPORTED_ARCHS
RUN sed -i 's/set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1200;gfx1201")/set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1151;gfx1200;gfx1201")/' CMakeLists.txt && \
    grep "HIP_SUPPORTED_ARCHS" CMakeLists.txt | head -1

# Fix: DisabledTqdm passes duplicate 'disable' kwarg with newer huggingface_hub
RUN sed -i 's/super().__init__(\*args, \*\*kwargs, disable=True)/kwargs.pop("disable", None); super().__init__(*args, **kwargs, disable=True)/' \
    vllm/model_executor/model_loader/weight_utils.py

# THE KEY FIX: Patch setup.py to sanitize build.ninja after configure
COPY patch_setup.py /tmp/patch_setup.py
RUN python3 /tmp/patch_setup.py

# Install audio dependencies
RUN --mount=type=cache,target=/root/.cache/uv \
    uv pip install --no-cache-dir librosa

# Install build dependencies
RUN --mount=type=cache,target=/root/.cache/uv \
    export HIP_VISIBLE_DEVICES=-1 && \
    export ROCR_VISIBLE_DEVICES=-1 && \
    export CMAKE_PREFIX_PATH=/opt/vllm-build/.venv/lib/python3.12/site-packages/torch && \
    uv pip uninstall amdsmi && \
    uv pip install "numpy<2" && \
    python use_existing_torch.py && \
    uv pip install --upgrade numba scipy huggingface-hub[cli,hf_transfer] setuptools_scm && \
    uv pip install -r requirements/rocm.txt

# Build vLLM
RUN --mount=type=cache,target=/root/.cache/uv \
    export CMAKE_PREFIX_PATH=/opt/vllm-build/.venv/lib/python3.12/site-packages/torch && \
    export HIP_VISIBLE_DEVICES=-1 && \
    export ROCR_VISIBLE_DEVICES=-1 && \
    MAX_JOBS=1 VERBOSE=1 python setup.py develop && \
    uv pip install ${ROCM_PATH}/share/amd_smi && \
    echo "StepFun vLLM build complete"

# Build Flash Attention
RUN --mount=type=cache,target=/root/.cache/git \
    cd /opt/vllm-build && \
    git clone https://github.com/ROCm/flash-attention.git && \
    cd flash-attention && \
    git checkout main_perf

WORKDIR /opt/vllm-build/flash-attention

RUN sed -i '/from wheel.bdist_wheel import bdist_wheel/a import amdsmi' setup.py && \
    sed -i '1i import amdsmi' flash_attn/__init__.py

ENV FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
RUN --mount=type=cache,target=/root/.cache/uv \
    python setup.py develop

WORKDIR /opt/vllm-build/vllm

ENV VLLM_USE_V1=0
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
ENV VLLM_LOGGING_LEVEL=INFO
# gfx1151 (RDNA 4) doesn't support MFMA16 instructions used in ROCm paged attention
# Use CK flash attention instead of Triton, and skip CUDA graph capture
ENV VLLM_USE_TRITON_FLASH_ATTN=0

CMD ["vllm", "serve", "stepfun-ai/Step-Audio-2-mini", \
     "--served-model-name", "step-audio-2-mini", \
     "--port", "8000", \
     "--host", "0.0.0.0", \
     "--max-model-len", "16384", \
     "--max-num-seqs", "32", \
     "--tensor-parallel-size", "1", \
     "--gpu-memory-utilization", "0.3", \
     "--enable-auto-tool-choice", \
     "--tool-call-parser", "step_audio_2", \
     "--tokenizer-mode", "step_audio_2", \
     "--chat_template_content_format", "string", \
     "--audio-parser", "step_audio_2_tts_ta4", \
     "--trust-remote-code", \
     "--dtype", "float16", \
     "--enforce-eager", \
     "--disable-frontend-multiprocessing"]

patch_setup.py

#!/usr/bin/env python3
"""
Patch StepFun's vLLM setup.py:
1. Sanitize build.ninja files after cmake configure, before cmake build
2. ROCm compatibility fixes (dummy CUDA version, runtime detection)

The key fix is injecting a ninja sanitizer between configure (line 232) and
build (line 244) in build_extensions(). This removes the mangled
offload-arch warning that CMake's enable_language(HIP) bakes into
CMAKE_HIP_FLAGS when offload-arch fails (no GPU in Docker).
"""

with open('setup.py', 'r') as f:
    content = f.read()

# === FIX 1: Inject ninja file sanitizer between configure and build ===
old_block = '''        # Build all the extensions
        for ext in self.extensions:
            self.configure(ext)
            targets.append(target_name(ext.name))

        num_jobs, _ = self.compute_num_jobs()'''

new_block = '''        # Build all the extensions
        for ext in self.extensions:
            self.configure(ext)
            targets.append(target_name(ext.name))

        # === PATCH: Sanitize build.ninja files ===
        # CMake's enable_language(HIP) runs offload-arch which fails in Docker
        # (no GPU). The stderr gets baked into CMAKE_HIP_FLAGS as a literal
        # compiler flag like: "[WARNING] offload-arch failed ... -D__HIP_PLATFORM_AMD__=1"
        # This flag appears in every compile command in build.ninja and causes
        # clang++ to fail with "no such file or directory".
        import glob as _glob
        import re as _re
        _ninja_files = _glob.glob(
            os.path.join(self.build_temp, "**", "build.ninja"),
            recursive=True)
        _ninja_files += _glob.glob(
            os.path.join(self.build_temp, "build.ninja"))
        _ninja_pattern = _re.compile(
            r'\\s*"\\[WARNING\\][^"]*"'  # quoted: "[WARNING]..."
            r'|'
            r"\\s*'\\[WARNING\\][^']*'"  # single-quoted
        )
        for _nf in set(_ninja_files):
            try:
                with open(_nf, 'r') as _f:
                    _ninja_content = _f.read()
                if '[WARNING]' in _ninja_content:
                    _new_content = _ninja_pattern.sub('', _ninja_content)
                    with open(_nf, 'w') as _f:
                        _f.write(_new_content)
                    print(f"[PATCH] Sanitized {_nf}: removed mangled offload-arch warnings")
                else:
                    print(f"[PATCH] {_nf}: no contamination found (OK)")
            except Exception as _e:
                print(f"[PATCH] Warning: could not sanitize {_nf}: {_e}")
        # === END PATCH ===

        num_jobs, _ = self.compute_num_jobs()'''

if old_block in content:
    content = content.replace(old_block, new_block)
    print("Patched setup.py: injected ninja sanitizer between configure and build")
else:
    print("ERROR: Could not find injection point in setup.py")
    print("Looking for the build_extensions method...")
    for i, line in enumerate(content.split('\n'), 1):
        if 'configure(ext)' in line or 'compute_num_jobs' in line:
            print(f"  Line {i}: {line.rstrip()}")
    raise SystemExit(1)

# === FIX 2: ROCm compatibility - dummy CUDA version ===
content = content.replace(
    'def get_nvcc_cuda_version() -> Version:',
    'def get_nvcc_cuda_version() -> Version:\n'
    '    from packaging.version import Version\n'
    '    return Version("12.0")  # Patched for ROCm\n'
    '\n'
    'def get_nvcc_cuda_version_original() -> Version:'
)

# === FIX 3: ROCm compatibility - unknown runtime ===
content = content.replace(
    'raise RuntimeError("Unknown runtime environment")',
    'return "0.1.0+rocm"  # Patched for ROCm'
)

# === FIX 4: ROCm compatibility - unsupported platform ===
content = content.replace(
    '        raise ValueError(\n            "Unsupported platform, please use CUDA, ROCm, Neuron, or CPU.")',
    '        requirements = _read_requirements("rocm.txt")  # Patched: Use ROCm requirements'
)

with open('setup.py', 'w') as f:
    f.write(content)

print("All setup.py patches applied successfully")

Run

docker run --rm --device=/dev/kfd --device=/dev/dri --network=host step-audio-rocm

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions