|
| 1 | +# How to Install FlagTree Triton Compiler on Huawei Ascend NPU: A Step-by-Step Guide |
| 2 | + |
| 3 | +> This tutorial is based on real hands-on experience, documenting the full process of building FlagTree from source on an openEuler + Ascend910 environment — including 4 failed build attempts and how we debugged them. Hopefully this saves you some headaches. |
| 4 | +
|
| 5 | +## 1. What Is FlagTree? |
| 6 | + |
| 7 | +FlagTree is a **multi-backend unified Triton compiler** developed by the FlagOS team, forked from triton-lang/triton. It supports multiple AI chip backends including NVIDIA, Huawei Ascend, Hygon DCU, Moore Threads, and more. |
| 8 | + |
| 9 | +In short: **if you want to write Triton kernels on Ascend NPU, FlagTree is the compiler you need.** |
| 10 | + |
| 11 | +Project repository: `github.com/flagos-ai/flagtree` |
| 12 | + |
| 13 | +## 2. Environment |
| 14 | + |
| 15 | +Here is the environment we are working with: |
| 16 | + |
| 17 | +| Item | Version | |
| 18 | +|------|---------| |
| 19 | +| OS | openEuler 2203sp4, Linux 5.10.0 aarch64 | |
| 20 | +| Python | 3.9.9 | |
| 21 | +| Device | Ascend910_9382 (16 NPUs) | |
| 22 | +| CANN | 8.5.0 | |
| 23 | +| PyTorch | 2.8.0+cpu | |
| 24 | +| torch_npu | 2.8.0.post2 | |
| 25 | +| GCC | 10.3.1 | |
| 26 | + |
| 27 | +> **Note**: This is an **aarch64** machine, not x86_64. This matters because many pre-built toolchains are x86-only — you need the aarch64 versions. |
| 28 | +
|
| 29 | +## 3. Installation Steps |
| 30 | + |
| 31 | +### Step 1: Clone the FlagTree Repository |
| 32 | + |
| 33 | +```bash |
| 34 | +git clone https://github.com/flagos-ai/flagtree.git ~/FlagTree |
| 35 | +cd ~/FlagTree |
| 36 | +git submodule update --init --recursive |
| 37 | +``` |
| 38 | + |
| 39 | +After cloning, verify that the third-party dependencies are in place: |
| 40 | + |
| 41 | +```bash |
| 42 | +ls third_party/ascend/AscendNPU-IR/ # Ascend NPU IR submodule |
| 43 | +ls third_party/flir/ # FLIR (FlagTree Linalg IR) submodule |
| 44 | +``` |
| 45 | + |
| 46 | +If these directories are empty, the submodules were not pulled properly. Re-run `git submodule update --init`. |
| 47 | + |
| 48 | +### Step 2: Check Build Dependencies |
| 49 | + |
| 50 | +FlagTree requires the following build tools: |
| 51 | + |
| 52 | +```bash |
| 53 | +cmake --version # >= 3.18 (we used 4.2.3) |
| 54 | +ninja --version # >= 1.11 (we used 1.13.0) |
| 55 | +pip install pybind11 # >= 2.13.1 |
| 56 | +``` |
| 57 | + |
| 58 | +> If cmake or ninja are missing, `pip install cmake ninja` will do. The setup.py also auto-installs them into a temporary build environment. |
| 59 | +
|
| 60 | +### Step 3: Obtain Pre-built LLVM |
| 61 | + |
| 62 | +This is the most critical step. FlagTree requires an LLVM toolchain with MLIR support to compile Triton. |
| 63 | + |
| 64 | +For the Ascend backend, FlagTree provides a pre-built LLVM: |
| 65 | + |
| 66 | +```bash |
| 67 | +# If you have internet access, setup.py will download it automatically. |
| 68 | +# If not, download and extract manually to ~/.flagtree/ascend/ |
| 69 | +mkdir -p ~/.flagtree/ascend |
| 70 | +cd ~/.flagtree/ascend |
| 71 | +wget https://baai-cp-web.ks3-cn-beijing.ksyuncs.com/trans/llvm-a66376b0-ubuntu-aarch64-python311-compat_v0.3.0.tar.gz |
| 72 | +tar xzf llvm-a66376b0-ubuntu-aarch64-python311-compat_v0.3.0.tar.gz |
| 73 | +``` |
| 74 | + |
| 75 | +After extraction, the directory structure looks like: |
| 76 | + |
| 77 | +``` |
| 78 | +~/.flagtree/ascend/llvm-a66376b0-ubuntu-aarch64-python311-compat/ |
| 79 | +├── bin/ # clang, clang++, mlir-opt, etc. |
| 80 | +├── include/ # LLVM/MLIR headers |
| 81 | +└── lib/ # LLVM/MLIR static libraries + libstdc++.so.6.0.30 |
| 82 | +``` |
| 83 | + |
| 84 | +> **Key point**: This pre-built LLVM ships with its own `libstdc++.so.6.0.30`. You will need this later. |
| 85 | +
|
| 86 | +### Step 4: Extract Build Dependencies |
| 87 | + |
| 88 | +FlagTree ships a pre-packaged dependency tarball in its repository: |
| 89 | + |
| 90 | +```bash |
| 91 | +cd ~ |
| 92 | +tar xzf ~/FlagTree/build-deps-triton_3.2.x-linux-aarch64.tar.gz |
| 93 | +``` |
| 94 | + |
| 95 | +This extracts googletest, the JSON library, and other build dependencies into `~/.triton/`. |
| 96 | + |
| 97 | +### Step 5: Set Environment Variables (Critical!) |
| 98 | + |
| 99 | +This step is **the most error-prone part** of the entire process. It took us 4 attempts to get right. Here are the key lessons learned. |
| 100 | + |
| 101 | +```bash |
| 102 | +# 1. Specify the backend as Ascend |
| 103 | +export FLAGTREE_BACKEND=ascend |
| 104 | + |
| 105 | +# 2. Point to the pre-built LLVM |
| 106 | +export LLVM_SYSPATH=~/.flagtree/ascend/llvm-a66376b0-ubuntu-aarch64-python311-compat |
| 107 | + |
| 108 | +# 3. Add LLVM's bin to PATH (CMake needs to find clang/clang++) |
| 109 | +export PATH=$LLVM_SYSPATH/bin:$PATH |
| 110 | + |
| 111 | +# 4. [CRITICAL] Add LLVM's lib to the linker search path |
| 112 | +# The pre-built LLVM's static libraries require GLIBCXX_3.4.30, |
| 113 | +# but the system GCC 10 only provides GLIBCXX_3.4.28. |
| 114 | +# The LLVM bundle includes libstdc++.so.6.0.30 — the linker must find it. |
| 115 | +export LIBRARY_PATH=$LLVM_SYSPATH/lib:${LIBRARY_PATH:-} |
| 116 | +export LD_LIBRARY_PATH=$LLVM_SYSPATH/lib:${LD_LIBRARY_PATH:-} |
| 117 | + |
| 118 | +# 5. Offline build (optional — prevents downloads during build) |
| 119 | +export TRITON_OFFLINE_BUILD=1 |
| 120 | + |
| 121 | +# 6. Disable Proton (profiling tool, not needed for Ascend) |
| 122 | +export TRITON_BUILD_PROTON=OFF |
| 123 | + |
| 124 | +# 7. [CRITICAL] Append extra CMake arguments to fix two build issues: |
| 125 | +# - Disable -Werror (LLVM headers trigger dangling-assignment-gsl warnings) |
| 126 | +# - Tell the linker to search LLVM's lib directory |
| 127 | +export TRITON_APPEND_CMAKE_ARGS="-DLLVM_ENABLE_WERROR=OFF \ |
| 128 | + -DCMAKE_CXX_FLAGS=-Wno-error=dangling-assignment-gsl \ |
| 129 | + -DCMAKE_EXE_LINKER_FLAGS=-L$LLVM_SYSPATH/lib \ |
| 130 | + -DCMAKE_SHARED_LINKER_FLAGS=-L$LLVM_SYSPATH/lib" |
| 131 | + |
| 132 | +# 8. Limit parallel jobs (aarch64 machines have many cores but may lack memory) |
| 133 | +export MAX_JOBS=16 |
| 134 | +``` |
| 135 | + |
| 136 | +### Step 6: Build and Install |
| 137 | + |
| 138 | +Everything is ready. Start the build: |
| 139 | + |
| 140 | +```bash |
| 141 | +cd ~/FlagTree/python |
| 142 | + |
| 143 | +# Clean any previously failed build artifacts |
| 144 | +rm -rf build/ |
| 145 | + |
| 146 | +# Install in editable mode (convenient for development and debugging) |
| 147 | +pip install -e . -v 2>&1 | tee ~/flagtree_build.log |
| 148 | +``` |
| 149 | + |
| 150 | +The build takes approximately **10-20 minutes** (depending on `MAX_JOBS` and machine performance). |
| 151 | + |
| 152 | +If everything goes well, you should see: |
| 153 | + |
| 154 | +``` |
| 155 | +Successfully installed flagtree-0.5.0+gitXXXXXXX |
| 156 | +``` |
| 157 | + |
| 158 | +### Step 7: Verify the Installation |
| 159 | + |
| 160 | +```python |
| 161 | +import triton |
| 162 | +print(triton.__version__) # 3.2.0 |
| 163 | +print(triton.__file__) # Should point to ~/FlagTree/python/triton/__init__.py |
| 164 | +``` |
| 165 | + |
| 166 | +Check that the Ascend backend is available: |
| 167 | + |
| 168 | +```python |
| 169 | +from triton.backends.ascend import driver as ascend_driver |
| 170 | +print("Ascend backend loaded!") |
| 171 | +``` |
| 172 | + |
| 173 | +Run a simple kernel test: |
| 174 | + |
| 175 | +```python |
| 176 | +import triton |
| 177 | +import triton.language as tl |
| 178 | +import torch |
| 179 | +import torch_npu |
| 180 | + |
| 181 | +@triton.jit |
| 182 | +def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr): |
| 183 | + pid = tl.program_id(axis=0) |
| 184 | + block_start = pid * BLOCK_SIZE |
| 185 | + offsets = block_start + tl.arange(0, BLOCK_SIZE) |
| 186 | + mask = offsets < n_elements |
| 187 | + x = tl.load(x_ptr + offsets, mask=mask) |
| 188 | + y = tl.load(y_ptr + offsets, mask=mask) |
| 189 | + output = x + y |
| 190 | + tl.store(output_ptr + offsets, output, mask=mask) |
| 191 | + |
| 192 | +n = 1024 |
| 193 | +x = torch.randn(n, dtype=torch.float32).to('npu') |
| 194 | +y = torch.randn(n, dtype=torch.float32).to('npu') |
| 195 | +out = torch.empty_like(x) |
| 196 | + |
| 197 | +add_kernel[(1,)](x, y, out, n, BLOCK_SIZE=1024) |
| 198 | + |
| 199 | +# Verify results on CPU (avoids CANN OPP issues) |
| 200 | +diff = torch.max(torch.abs(out.cpu() - (x.cpu() + y.cpu()))).item() |
| 201 | +print(f"Max diff: {diff}") # Should be 0.0 |
| 202 | +``` |
| 203 | + |
| 204 | +## 4. Troubleshooting: Lessons from 4 Failed Builds |
| 205 | + |
| 206 | +If you follow the steps above exactly, you should succeed on the first try. But if you are curious why those "weird" environment variables are necessary, here is the record of our 4 failed attempts: |
| 207 | + |
| 208 | +### Pitfall 1: clang Not Found |
| 209 | + |
| 210 | +**Symptom**: CMake error — `CMAKE_C_COMPILER: clang is not a full path and was not found in the PATH` |
| 211 | + |
| 212 | +**Root cause**: `LLVM_SYSPATH` was set, but the LLVM `bin` directory was not added to `PATH`. The CMakeLists.txt hardcodes `set(CMAKE_C_COMPILER clang)`, which only searches by name in `PATH`. |
| 213 | + |
| 214 | +**Fix**: `export PATH=$LLVM_SYSPATH/bin:$PATH` |
| 215 | + |
| 216 | +### Pitfall 2: -Werror Causes Compilation Failure |
| 217 | + |
| 218 | +**Symptom**: |
| 219 | +``` |
| 220 | +mlir/IR/OperationSupport.h:1000:27: error: object backing the pointer |
| 221 | +will be destroyed [-Werror,-Wdangling-assignment-gsl] |
| 222 | +``` |
| 223 | + |
| 224 | +**Root cause**: FlagTree builds with `-Werror` by default (all warnings treated as errors). However, the pre-built LLVM headers trigger a `dangling-assignment-gsl` warning introduced in clang-21. This is not a code bug — the compiler simply became stricter. |
| 225 | + |
| 226 | +**Fix**: Append `-Wno-error=dangling-assignment-gsl` via `TRITON_APPEND_CMAKE_ARGS`. |
| 227 | + |
| 228 | +### Pitfall 3: Linker Error — undefined reference to std::__throw_bad_array_new_length |
| 229 | + |
| 230 | +**Symptom**: |
| 231 | +``` |
| 232 | +undefined reference to `std::__throw_bad_array_new_length()' |
| 233 | +``` |
| 234 | + |
| 235 | +**Root cause**: `std::__throw_bad_array_new_length` was introduced in GCC 12 / libstdc++ 12. Our system has GCC 10.3 (GLIBCXX_3.4.28), but the pre-built LLVM static libraries were compiled with GCC 12+ and require GLIBCXX_3.4.30. |
| 236 | + |
| 237 | +**Fix**: The pre-built LLVM ships its own `libstdc++.so.6.0.30`. Set `LIBRARY_PATH` and `LD_LIBRARY_PATH` to point to the LLVM `lib` directory, and add `-L$LLVM_SYSPATH/lib` to the CMake linker flags. |
| 238 | + |
| 239 | +### Pitfall 4: Using GCC Instead of Clang? Dead End. |
| 240 | + |
| 241 | +**Symptom**: Setting `FLAGTREE_USE_SYSTEM_CC=1` to compile with GCC 10 results in a flood of template syntax errors. |
| 242 | + |
| 243 | +**Root cause**: FlagTree's C++ code (especially the FLIR and AscendNPU-IR components) extensively uses clang-specific template syntax that GCC 10 cannot parse. |
| 244 | + |
| 245 | +**Lesson**: **Do not use GCC to compile FlagTree — you must use clang.** The `FLAGTREE_USE_SYSTEM_CC` flag does not work in the Ascend aarch64 environment. |
| 246 | + |
| 247 | +## 5. Environment Variable Quick Reference |
| 248 | + |
| 249 | +Every time you open a new terminal, set the following variables to use FlagTree: |
| 250 | + |
| 251 | +```bash |
| 252 | +# Base environment (CANN + PyTorch + venv) |
| 253 | +source /your/venv/setup_env.sh |
| 254 | + |
| 255 | +# FlagTree runtime |
| 256 | +export LLVM_SYSPATH=~/.flagtree/ascend/llvm-a66376b0-ubuntu-aarch64-python311-compat |
| 257 | +export LD_LIBRARY_PATH=$LLVM_SYSPATH/lib:${LD_LIBRARY_PATH:-} |
| 258 | +``` |
| 259 | + |
| 260 | +> We recommend creating a `setup_flagtree.sh` script to set everything up in one line. |
| 261 | +
|
| 262 | +## 6. Summary |
| 263 | + |
| 264 | +| Step | Description | Time | |
| 265 | +|------|-------------|------| |
| 266 | +| Clone repo + submodules | Pull source code | ~5min | |
| 267 | +| Install build deps | cmake, ninja, pybind11 | ~2min | |
| 268 | +| Download pre-built LLVM | ~500MB | ~5min | |
| 269 | +| Extract build deps | googletest, json | ~1min | |
| 270 | +| Set environment variables | The most critical step | ~5min | |
| 271 | +| Build and install | pip install -e . | ~15min | |
| 272 | +| Verify | import triton + kernel test | ~2min | |
| 273 | + |
| 274 | +**Key takeaways**: |
| 275 | +1. You **must use clang** (from the LLVM bundle), not system GCC |
| 276 | +2. You **must add LLVM's lib to the linker path** (libstdc++ version mismatch) |
| 277 | +3. You **must append** `-Wno-error=dangling-assignment-gsl` (new clang-21 warning) |
| 278 | +4. `TRITON_APPEND_CMAKE_ARGS` is your lifeline for passing extra CMake arguments |
| 279 | + |
| 280 | +We hope this tutorial helps anyone working with Triton on Ascend NPU. Feel free to leave questions in the comments! |
| 281 | + |
| 282 | +--- |
| 283 | + |
| 284 | +*This tutorial is based on FlagTree v0.5.0 (commit 4d9e18e), verified on Ascend910 + CANN 8.5.0 + openEuler 2203sp4 aarch64.* |
0 commit comments