Skip to content

Commit 04784a9

Browse files
benvanikclaude
andauthored
Rewriting the HAL CTS to support bazel and scale better. (#23644)
Rewrites the HAL Conformance Test Suite (CTS) from a CMake-only template-instantiation system to a link-time composition architecture that works with both Bazel and CMake. The new design compiles each test once and links it against multiple backends, replacing the old approach of generating a separate test binary for every (driver, test) pair. This cuts CTS build work from O(drivers x tests) to O(drivers + tests) and enables Bazel-native CTS support for the first time. ## Motivation The old CTS had three compounding problems: **Build scaling.** Each test header was compiled once per driver through CMake `configure_file()` template instantiation. With 16 test suites and 8 drivers, that's 128 separate test binaries, each independently compiling the same test logic against the same HAL API. As we add drivers and tests, build time grows multiplicatively. **CMake exclusivity.** The template-based code generation was a CMake mechanism with no Bazel equivalent. This meant CTS tests couldn't run in Bazel-based workflows, and adding Bazel support would have required reimplementing the entire generation system in Starlark — with the same scaling problem. **Invisible test logic.** The old system generated `.cc` files at CMake configure time from a `.cc.in` template that `#include`d test headers. The actual test source lived in `.h` files that were neither standalone translation units nor normal headers — they required specific macros to be defined by the template before inclusion. This made tests hard to navigate, hard to debug (breakpoints in generated files), and hard to understand for new contributors. ## Design: link-time composition The new CTS uses a registration-based architecture where test logic and backend configuration are independent concerns connected at link time. **Tests** are ordinary `.cc` files compiled into object libraries. Each test class inherits from `CtsTestBase` and registers itself with the CTS registry via a static initializer macro: ```cpp class AllocatorTest : public CtsTestBase<> { ... }; TEST_P(AllocatorTest, BufferCompatibility) { ... } CTS_REGISTER_TEST_SUITE(AllocatorTest); ``` **Backends** register themselves the same way — a single `.cc` file per driver that provides a device factory, capability tags, and executable format information: ```cpp static bool registered_ = (CtsRegistry::RegisterBackend({ "local_task", {.name = "local_task", .factory = CreateLocalTaskDevice}, {"async_queue", "events", "file_io", "indirect"}, {{.name = "vmvx", .format = "vmvx-bytecode-fb", .data_fn = ...}}, }), true); ``` **Composition** happens at link time: the build system links a backend `.cc` against selected test object libraries and a shared `test_main.cc`. At program start, static initializers populate the registry, then `main()` calls `CtsRegistry::InstantiateAll()` to create gtest parameterized test instances for every (backend, test suite) pair that the backend's capabilities satisfy. The result: each test file compiles exactly once. Adding a new driver means writing one `backends.cc` file and one short build rule — the test objects are already compiled and waiting to be linked. ## Build system integration ### Test suite macros Both Bazel and CMake provide a `iree_hal_cts_test_suite()` macro that generates the complete set of CTS test binaries for a driver. A typical driver CTS configuration is around 20 lines of build rules: ```python iree_hal_cts_test_suite( backends_lib = ":backends", executable_formats = { "amdgpu": { "target_device": "amdgpu", "identifier": "iree_cts_testdata_amdgpu", "backend_name": "amdgpu", "format_string": '"amdgcn-amd-amdhsa--{ROCM_TARGET}"', "flags": ["--iree-rocm-target={ROCM_TARGET}", ...], }, }, flag_values = {"ROCM_TARGET": "//build_tools/bazel:rocm_test_target"}, ) ``` This produces 7 test binaries per driver (5 non-executable suites + 2 executable suites), each containing all tests in its category parameterized across the driver's backends and formats. Compare to the old system's 16 separate binaries per driver. ### iree_hal_executable rules The CTS dispatch tests need compiled HAL executables as test data. New `iree_hal_executable` and `iree_hal_executables` Starlark rules handle this compilation in both Bazel and CMake: - Compile MLIR sources to `.bin` files using `iree-compile --compile-mode=hal-executable` - Embed the binaries as C data arrays via `iree_c_embed_data` - Support template variables in compiler flags via `flag_values`, resolved at analysis time from Bazel `string_flag` build settings or file targets The `flag_values` mechanism enables hardware-specific compilation without hard-coding target architectures. For example, AMDGPU tests compile for `gfx1100` by default, but a developer can override at build time: ``` bazel test --//build_tools/bazel:rocm_test_target=gfx942 //runtime/src/iree/hal/drivers/amdgpu/cts/... ``` ### Test organization Tests are organized by HAL API area, matching the structure developers navigate when implementing a new driver: ``` runtime/src/iree/hal/cts/ buffer/ allocator, mapping command_buffer/ basic ops, fill, copy, update, dispatch variants core/ driver, event, semaphore, executable, executable_cache file/ file mapping queue/ host calls, semaphore submission testdata/ MLIR sources for dispatch tests util/ registry, test_base, test_main ``` ## Runtime features ### Capability-based test filtering Backends declare their capabilities as tags (`"events"`, `"indirect"`, `"file_io"`, etc.) at registration time. Test suites declare tag requirements. The registry automatically skips tests for backends that lack required capabilities — no per-driver exclusion lists needed. Command buffer tests get special treatment: each test runs in both direct and indirect recording modes, with indirect-mode tests filtered to backends that advertise the `"indirect"` tag. ### Test exclusions and expected failures Backends can declare permanent exclusions (features that will never be supported) and temporary expected failures with explanations: ```cpp .unsupported_tests = {{"FileTest.*", "WebGPU has no file I/O support"}}, .expected_failures = {{"SemaphoreTest.WaitThenSignal", "Requires async signal from host thread; " "blocked on WebGPU event loop integration"}}, ``` Expected failures are skipped by default. Setting `IREE_CTS_VERIFY_XFAILS=1` runs them instead, flagging unexpected passes (XPASS) as test failures — this catches stale xfail entries that should be removed after fixes land. ### GPU device caching GPU backends can't afford to create and destroy devices per test — cloud GPU runners have reliability issues with rapid device churn. The test base caches backend resources (driver, device group, device, allocator) across all tests for a given backend, creating them on first access and releasing them in the correct order at program exit. Individual tests hold their own references for isolation while sharing the underlying resources. ## Other improvements in this PR **Semaphore failure propagation in local_task.** CTS tests exposed a bug where failures during command buffer dispatch were silently swallowed, leaving semaphores in a permanently waiting state. The fix captures dispatch failures and propagates them to signal semaphores, converting them to error state so waiters get a clear failure instead of hanging. **File descriptor exhaustion diagnostics.** `eventfd()` and `pipe()` failures from hitting fd limits now return `RESOURCE_EXHAUSTED` with actionable diagnostics (suggesting `ulimit -n` or `sysctl` adjustments) instead of a generic errno translation. **Bazel build for the HIP HAL driver.** Adds BUILD.bazel files for the HIP driver, registration module, and utility library, plus a `hip-api-headers` third-party dependency. This is the first step toward full Bazel support for AMD GPU workflows. **Three new dispatch tests.** `dispatch_constants_bindings` (push constants with buffer bindings), `dispatch_multi_entrypoint` (multiple entry points in one executable), and `dispatch_multi_workgroup` (multi-dimensional workgroup dispatch) increase coverage beyond what the old CTS tested. --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent 92bbb28 commit 04784a9

104 files changed

Lines changed: 7377 additions & 3073 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CMakeLists.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -652,6 +652,7 @@ include(iree_c_embed_data)
652652
include(iree_amdgpu_binary)
653653
include(iree_bitcode_library)
654654
include(iree_bytecode_module)
655+
include(iree_hal_executable)
655656
include(iree_c_module)
656657
include(iree_python)
657658
include(iree_lit_test)
@@ -1088,6 +1089,7 @@ if(IREE_HAL_DRIVER_CUDA)
10881089
endif()
10891090

10901091
if(IREE_HAL_DRIVER_HIP)
1092+
add_subdirectory(build_tools/third_party/hip-api-headers EXCLUDE_FROM_ALL)
10911093
add_subdirectory(build_tools/third_party/rccl EXCLUDE_FROM_ALL)
10921094
endif()
10931095

MODULE.bazel

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,13 +71,16 @@ use_repo(
7171
iree_ext = use_extension("//build_tools/bazel:extensions.bzl", "iree_extension")
7272
use_repo(
7373
iree_ext,
74+
"amdgpu_device_libs",
7475
"com_github_dvidelabs_flatcc",
7576
"com_google_benchmark",
7677
"com_google_googletest",
78+
"hip_api_headers",
7779
"hsa_runtime_headers",
7880
"iree_cuda",
7981
"llvm-raw",
8082
"nccl",
83+
"rccl",
8184
"spirv_cross",
8285
"stablehlo",
8386
"tracy_client",

build_tools/bazel/BUILD.bazel

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44
# See https://llvm.org/LICENSE.txt for license information.
55
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
66

7+
load("@bazel_skylib//rules:common_settings.bzl", "string_flag")
8+
79
package(
810
default_visibility = ["//visibility:public"],
911
features = ["layering_check"],
@@ -45,3 +47,10 @@ config_setting(
4547
name = "iree_is_windows",
4648
constraint_values = ["@platforms//os:windows"],
4749
)
50+
51+
# ROCM GPU chip target for test and sample compilation. Override with:
52+
# --//build_tools/bazel:rocm_test_target=gfx942
53+
string_flag(
54+
name = "rocm_test_target",
55+
build_setting_default = "gfx1100",
56+
)

build_tools/bazel/build_test_all.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,9 @@ fi
4646
if ! [[ -v IREE_HIP_DISABLE ]]; then
4747
IREE_HIP_DISABLE=1
4848
fi
49+
if ! [[ -v IREE_AMDGPU_DISABLE ]]; then
50+
IREE_AMDGPU_DISABLE=1
51+
fi
4952
if ! [[ -v IREE_METAL_DISABLE ]]; then
5053
IREE_METAL_DISABLE=1
5154
fi
@@ -69,6 +72,7 @@ declare -a test_env_args=(
6972
--test_env="LD_PRELOAD=libvulkan.so.1"
7073
--test_env=IREE_CUDA_DISABLE="${IREE_CUDA_DISABLE}"
7174
--test_env=IREE_HIP_DISABLE="${IREE_HIP_DISABLE}"
75+
--test_env=IREE_AMDGPU_DISABLE="${IREE_AMDGPU_DISABLE}"
7276
--test_env=IREE_METAL_DISABLE="${IREE_METAL_DISABLE}"
7377
--test_env=IREE_VULKAN_DISABLE="${IREE_VULKAN_DISABLE}"
7478
--test_env=IREE_NVIDIA_GPU_TESTS_DISABLE="${IREE_NVIDIA_GPU_TESTS_DISABLE}"
@@ -93,6 +97,9 @@ fi
9397
if (( IREE_HIP_DISABLE == 1 )); then
9498
default_test_tag_filters+=("-driver=hip")
9599
fi
100+
if (( IREE_AMDGPU_DISABLE == 1 )); then
101+
default_test_tag_filters+=("-driver=amdgpu")
102+
fi
96103
if (( IREE_METAL_DISABLE == 1 )); then
97104
default_test_tag_filters+=("-driver=metal")
98105
fi

build_tools/bazel/extensions.bzl

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
"""Bzlmod extension for IREE repository rules."""
88

9+
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
910
load("@bazel_tools//tools/build_defs/repo:local.bzl", "local_repository", "new_local_repository")
1011
load("//build_tools/bazel:workspace.bzl", "cuda_auto_configure")
1112

@@ -74,20 +75,43 @@ def _iree_extension_impl(module_ctx):
7475
path = "third_party/nccl",
7576
)
7677

78+
# HIP API headers
79+
new_local_repository(
80+
name = "hip_api_headers",
81+
build_file = "@iree_core//:build_tools/third_party/hip-api-headers/BUILD.overlay",
82+
path = "third_party/hip-build-deps",
83+
)
84+
7785
# HSA runtime headers
7886
new_local_repository(
7987
name = "hsa_runtime_headers",
8088
build_file = "@iree_core//:build_tools/third_party/hsa-runtime-headers/BUILD.overlay",
8189
path = "third_party/hsa-runtime-headers",
8290
)
8391

92+
# RCCL
93+
new_local_repository(
94+
name = "rccl",
95+
build_file = "@iree_core//:build_tools/third_party/rccl/BUILD.overlay",
96+
path = "third_party/rccl",
97+
)
98+
8499
# WebGPU headers
85100
new_local_repository(
86101
name = "webgpu_headers",
87102
build_file = "@iree_core//:build_tools/third_party/webgpu-headers/BUILD.overlay",
88103
path = "third_party/webgpu-headers",
89104
)
90105

106+
# AMDGPU device library bitcode (ocml, ockl) for ROCM compilation.
107+
# Matches the CMake fetch in compiler/plugins/target/ROCM/CMakeLists.txt.
108+
http_archive(
109+
name = "amdgpu_device_libs",
110+
urls = ["https://github.com/shark-infra/amdgpu-device-libs/releases/download/v20231101/amdgpu-device-libs-llvm-6086c272a3a59eb0b6b79dcbe00486bf4461856a.tgz"],
111+
sha256 = "336362416c68fdd8bb80328f65ca7ebaa0c119ea19c95df6df30c832a4df39b9",
112+
build_file = "@iree_core//:build_tools/third_party/amdgpu_device_libs/BUILD.overlay",
113+
)
114+
91115
# CUDA auto-configuration
92116
cuda_auto_configure(
93117
name = "iree_cuda",
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Copyright 2026 The IREE Authors
2+
#
3+
# Licensed under the Apache License v2.0 with LLVM Exceptions.
4+
# See https://llvm.org/LICENSE.txt for license information.
5+
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
7+
"""Rule for bundling files into a directory (TreeArtifact).
8+
9+
Bazel operates on individual files, but some tools require a directory
10+
path rather than individual file paths. iree_directory bridges this gap:
11+
it copies source files into a declared directory, producing a single
12+
TreeArtifact output whose path can be used as a directory reference.
13+
14+
Usage:
15+
16+
load("//build_tools/bazel:iree_directory.bzl", "iree_directory")
17+
18+
iree_directory(
19+
name = "my_data_dir",
20+
srcs = glob(["*.dat"]),
21+
)
22+
23+
When used as a dependency in iree_hal_executable's flag_values, the
24+
placeholder resolves to the directory path (since TreeArtifacts produce
25+
a single output whose path is the directory itself).
26+
"""
27+
28+
def _iree_directory_impl(ctx):
29+
"""Copies source files into a declared directory."""
30+
srcs = ctx.files.srcs
31+
if not srcs:
32+
fail("iree_directory requires at least one source file")
33+
directory = ctx.actions.declare_directory(ctx.attr.name)
34+
args = ctx.actions.args()
35+
args.add(directory.path)
36+
args.add_all(srcs)
37+
ctx.actions.run_shell(
38+
inputs = srcs,
39+
outputs = [directory],
40+
arguments = [args],
41+
command = 'dest="$1"; shift; mkdir -p "$dest" && cp "$@" "$dest"',
42+
)
43+
return [DefaultInfo(files = depset([directory]))]
44+
45+
iree_directory = rule(
46+
implementation = _iree_directory_impl,
47+
attrs = {
48+
"srcs": attr.label_list(mandatory = True, allow_files = True),
49+
},
50+
)

0 commit comments

Comments
 (0)