Runtime TXN generation by jgmelber · Pull Request #3002 · Xilinx/mlir-aie

jgmelber · 2026-03-26T17:48:09Z

Summary

Add dynamic runtime TXN generation: compile a single XCLBIN once, then generate NPU instruction streams at runtime for arbitrary problem sizes. The compiler emits a standalone C++ function that builds TXN binaries parameterized by runtime values (e.g., matrix dimensions M, K, N).

Key changes

TxnEncoding.h — Header-only library for encoding NPU TXN instructions at runtime. Zero MLIR/LLVM dependencies; shared between compiler and generated host code as a single source of truth for instruction format.
ConvertAIEXToEmitC pass — New conversion pass that lowers AIEX runtime sequence ops (npu.write32, npu.sync, npu.address_patch, npu.blockwrite) to EmitC dialect, producing compilable C++ via translateToCpp.
--aie-generate-txn-cpp flag — New aiecc option that generates a C++ header alongside the XCLBIN. The header contains an inline function generate_txn_sequence(...) that returns std::vector<uint32_t> of NPU instruction words.
Dynamic npu.dma_memcpy_nd — Extended DMA-to-NPU lowering to support SSA-parameterized sizes, strides, and offsets. Dynamic values flow through arith ops to compute BD register words at runtime.
RuntimeSequenceOp changes — Removed IsolatedFromAbove to allow referencing parent DeviceOp values. SCF-to-CF conversion now scoped to aie.core ops only (walk-and-convert, no exclusion lists).
IRON Python support — RuntimeScalar type for scalar runtime parameters, write_rtp() for runtime-tunable parameters, BD ID allocator for dynamic DMA tasks, direct npu_dma_memcpy_nd emission path.
Placement infrastructure — New AIEPlaceTiles pass with sequential placer algorithm for logical-to-physical tile mapping.

Examples

Dynamic GEMM (single_core_dynamic/): One XCLBIN, any M/K/N (multiples of 32). Verified 32x32x32 through 128x128x128 on NPU Strix Halo.
Dynamic passthrough (passthrough_kernel/): Runtime-configurable buffer sizes with parameterized TXN.
TXN comparison tool (compare_txn.cpp): Validates dynamic-generated TXN matches static compiler output.

Test coverage

3 new FileCheck tests for EmitC conversion (test/Conversion/AIEXToEmitC/)
test/aiecc/cpp_dynamic_txn.mlir for end-to-end aiecc pipeline
Existing DMA, dialect, and Python tests updated for dynamic op variants

Test plan

EmitC FileCheck tests pass (basic, dynamic values, unsupported ops)
aiecc compilation pipeline succeeds (XCLBIN + TXN C++ in one invocation)
Passthrough kernel: PASS on NPU (dynamic TXN path)
Dynamic GEMM: PASS on NPU at 32x32x32, 64x64x64, 96x96x96, 128x128x128
CI: Lint and format
CI: Build across platforms (Ubuntu 22.04/24.04, Windows, gcc/llvm, assert on/off)
CI: Ryzen AI hardware tests

🤖 Generated with Claude Code

Introduce new operations that accept SSA values instead of static attributes, enabling runtime parameterization of NPU sequences: - aiex.npu.dyn_write32: Dynamic write with SSA address and value - aiex.npu.dyn_maskwrite32: Dynamic masked write with SSA operands - aiex.npu.dyn_dma_memcpy_nd: Fully dynamic N-D DMA with SSA sizes/strides - aiex.npu.dyn_sync: Dynamic synchronization with SSA tile/channel These operations can be lowered to templated C++ code for runtime transaction generation, allowing a single compiled artifact to support multiple problem sizes determined at runtime. Added verification to ensure SSA operands have correct types (index or signless integers). Maximum 4 dimensions enforced for DMA operations to match hardware constraints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Implements the dynamic runtime sequences infrastructure: Phase 1: Extract TXN instruction encoding from AIETargetNPU.cpp into include/aie/Runtime/TxnEncoding.h - a header-only C++ library with zero MLIR/LLVM dependencies. Refactor AIETargetNPU.cpp to use it. Phase 2: Add ConvertAIEXToEmitC pass that lowers AIEX runtime sequence ops (both static npu.write32/blockwrite/sync/address_patch and dynamic npu.dyn_write32/dyn_maskwrite32/dyn_sync) plus SCF/arith ops into EmitC dialect. The EmitC IR calls aie_runtime::txn_append_* functions. Phase 3: Wire aie-translate --aie-generate-txn-cpp translation that runs the NPU lowering pipeline then the EmitC pass, producing compilable C++ that generates TXN binaries at runtime. Phase 4: Add test_dynamic.cpp example that uses the generated C++ instead of loading insts.bin. Verified bit-for-bit identical TXN output and PASS on NPU hardware. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extract the NPU lowering pipeline (7 passes) into a shared populateNpuLoweringPipeline() function used by both aiecc and aie-translate, eliminating the duplicated pass list. Unify the host test executable: test_dynamic.cpp is deleted and test.cpp gains a USE_DYNAMIC_TXN compile flag that sets a generate_instr callback on the args struct in xrt_test_wrapper.h. Both static (insts.bin) and dynamic (generated TXN) paths now share the same XRT setup, buffer management, verification, and timing infrastructure. Add --aie-generate-txn-cpp and --txn-cpp-name flags to aiecc so C++ TXN generation is accessible through the compiler driver alongside --aie-generate-npu-insts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add test_dynamic_size.mlir using aiex.npu.dyn_write32 with BLOCKWRITE for BD configuration (required for address_patch compatibility) and dynamic buffer_length as a runtime parameter. Add --dynamic-size flag to test executable. When set, the host passes the transfer size to generate_txn_sequence() at runtime instead of using a pre-compiled instruction binary. The core loops processing fixed-size ObjectFIFO tiles, so any multiple of the tile size works. Demonstrated: XCLBIN compiled once at 1024-byte tile size, single executable runs correctly at 1024, 2048, 3072, and 4096 bytes with zero recompilation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extend NpuWrite32Op, NpuMaskWrite32Op, and NpuSyncOp with optional SSA operands (dyn_address, dyn_value, etc.) so a single op handles both compile-time constant and runtime-parameterized forms. Delete the 4 separate Dyn ops (NpuDynWrite32Op, NpuDynMaskWrite32Op, NpuDynSyncOp, NpuDynDmaMemcpyNdOp) that previously duplicated them. - Add AttrSizedOperandSegments trait and custom parse/print/verify - Add custom builders preserving existing call-site signatures - Merge EmitC conversion handlers (static vs dynamic dispatch) - Add error guards in NPU binary translation for dynamic operands - Add Python wrappers: npu_write32_dynamic, npu_maskwrite32_dynamic, npu_sync_dynamic - Replace hand-written test_dynamic_size.mlir with Python design file (passthrough_kernel_dynamic.py) 100% backward compatible: all 294 existing LIT tests pass unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…amic Worker Extend IRON APIs to support dynamic (RTP-based) loop bounds: - Worker: add dynamic_objfifo_lowering parameter - Runtime: extend sequence() for mixed array/scalar types, add write_rtp() - RuntimeScalar: new class for scalar runtime sequence parameters - RtpWriteTask: new task class wrapping npu_rtp_write - single_core_iron_dynamic.py: IRON-level dynamic GEMM example Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… and LIT tests Extract the 2 dynamic designs (low-level, IRON) and their test harness from single_core/ into a new single_core_dynamic/ directory. Add a new placed dynamic variant using shim_dma_single_bd_task. Create LIT tests for all 3 variants and a passthrough_kernel dynamic LIT test. Fix dynamic_gemm_txn.h: add missing MASKWRITE before S2MM queue push (required for XRT completion token) and restructure C output BDs to use the batched pingpong pattern matching the static compiler. Verified on NPU2 hardware: all 3 dynamic sizes (32x32x32, 64x64x64, 128x128x128) PASS with a single XCLBIN for both low-level and placed variants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The dynamic TXN generator now parses the static instruction stream (generated alongside the XCLBIN) to discover the RTP buffer address and S2MM control register values. This makes the dynamic test harness work with all 3 design variants (low-level, placed, IRON) since each may place the RTP buffer at a different address. Previously the IRON variant failed because its buffer allocator placed the RTP at 0x204d00 while the code hardcoded 0x200600. All 3 variants now PASS on NPU2 at 32x32x32, 64x64x64, 128x128x128. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Enable auto-generated C++ TXN code from MLIR runtime sequences with SSA parameters, allowing a single XCLBIN to run matrix multiplications at any M/K/N (multiples of 32) determined at runtime. Key changes: - Add IsolatedFromAbove trait to RuntimeSequenceOp, preventing SCF-to-CF from entering runtime sequences while still lowering core bodies. Also prevents constant hoisting across the isolation boundary. - Extend DmaToNpuPattern with dynamic code path: when sizes/strides are SSA values, compute BD words via arith ops and emit npu.write32. Fixes bf16 d0_size (multiply-first-then-divide), stride underflow guards (size>1 check), and repeat_count off-by-one. - Extend EmitC conversion for scf.for with iter_args (VariableOp + LoadOp + AssignOp pattern), scf.if with results, and new arith ops (TruncI, ExtUI, ExtSI, MinSI, MaxSI). Add pre-scan for values hoisted outside runtime_sequence and cross-reference fixup pass. - Add dyn_arg_plus to NpuAddressPatchOp and dyn_value to NpuWriteRTPOp for runtime-parameterized buffer offsets and RTP writes. - Scope SCF-to-CF in aiecc via markOpRecursivelyLegal on RuntimeSequenceOp, and disable cross-region constant CSE in AIEVectorTransferLoweringPass. - Add unified aiecc compilation (--aie-generate-xclbin + --aie-generate-txn-cpp) producing both XCLBIN and C++ TXN from the same MLIR with identical buffer addresses. Verified on NPU Strix Halo: 32x32x32, 64x64x64, 64x32x64, 96x96x96, 128x64x128, 128x128x128 all PASS against reference matmul. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adapt to upstream API changes and fix IsolatedFromAbove interaction with MLIR's dialect conversion infrastructure: - Move convert-vector-to-aievec from resource allocation pipeline to per-core LLVM lowering, preventing vectorization of scalar arith ops (e.g. arith.minsi → aievec.min) inside runtime_sequence - Walk RuntimeSequenceOps explicitly in DmaToNpu, DMATasksToNPU, LowerSetLock, SubstituteShimDMA, since applyPartialConversion no longer descends into IsolatedFromAbove regions in newer LLVM - Skip materialize pass in AIETranslateToCppTxn (runtime_sequence is already in final form) - Add type casts in EmitC yield handler for mixed i32/opaque types - Disable constant CSE in AIEMaterializeBDChains - Update Python API: link_with on external_func, TraceShimRouting enum - Add hasVerifier to RunOp, dyn_arg_plus to AIEInsertTraceFlows All 6 GEMM sizes verified on NPU Strix Halo after rebase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix getAsValue to zero-extend narrow values (was truncate-only) - Fix yieldTargets ArrayRef invalidation by using stack directly - Fail on unsupported ops in EmitC instead of silently emitting comments - Fix 0x80000000u token bit cast to int32_t - Remove dead preSCFModule global variable - Add IsolatedFromAbove negative test for RuntimeSequenceOp - Add bf16 d0_stride hardware constraint comment - Remove dead vectorized variable, name ROWS_PER_BLOCK constant - Deduplicate trace event list into module-level constant - Fix npu_time_min initialization to numeric_limits::max Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Stale submodule pointer from before rebase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AIETargetCppTxn.cpp and AIENpuLowering.cpp link AIEXTransforms, which uses BdIdGenerator from AIETransforms. Without this transitive dependency, static Release builds fail with undefined references to BdIdGenerator::nextBdId etc. in AIEAssignRuntimeSequenceBDIDs.cpp. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

IsolatedFromAbove broke 62 existing tests that reference device-scope values (tiles, locks) from inside runtime_sequence. Instead, protect against constant hoisting by stripping runtime_sequences from LLVM lowering clones (where convert-vector-to-aievec's canonicalizer was the source of the hoisting). The markOpRecursivelyLegal SCF→CF scoping and enableConstantCSE(false) in AIEVectorTransferLowering remain as the primary guards. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-26T22:25:58Z

Coverage Report

Created: 2026-05-05 23:18

Click here for information about interpreting this report.

Filename	Function Coverage	Line Coverage	Region Coverage	Branch Coverage
Conversion/AIEToConfiguration/AIEToConfiguration.cpp	91.30%	67.58%	61.27%	45.19%
Conversion/AIEXToEmitC/AIEXToEmitC.cpp	68.00%	49.65%	46.83%	37.50%
Dialect/AIE/Transforms/AIEAssignCoreLinkFiles.cpp	100.00%	100.00%	100.00%	84.62%
Dialect/AIE/Transforms/AIEInsertTraceFlows.cpp	77.78%	84.32%	81.13%	73.74%
Dialect/AIE/Transforms/AIEVectorTransferLowering.cpp	83.33%	79.17%	72.73%	50.00%
Dialect/AIEX/IR/AIEXDialect.cpp	98.57%	81.27%	81.00%	66.05%
Dialect/AIEX/Transforms/AIEDMATasksToNPU.cpp	95.45%	85.16%	89.06%	80.53%
Dialect/AIEX/Transforms/AIEDmaToNpu.cpp	100.00%	82.72%	76.65%	55.75%
Dialect/AIEX/Transforms/AIELowerSetLock.cpp	100.00%	82.35%	80.00%	50.00%
Dialect/AIEX/Transforms/AIEMaterializeBDChains.cpp	100.00%	84.71%	80.00%	57.14%
Totals	91.90%	77.10%	74.12%	61.17%

Generated by llvm-cov -- llvm version 18.1.3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

andrej

Nice to see this starting to take shape. The biggest question is if we want to deprecate the attributes; I'd be in favor of it, although it would mean touching potentially a lot of tests (but the actual code would be smaller).

The GEMM test unfortunately doesn't seem to use the added infrastructure. I think we are already on the same page, but just to make sure, what I'm envisioning looks more like this:

User writes a single runtime sequence in MLIR/Python (pseudocode):

aiex.runtime_sequence @my_sequence(%A: memref, %B: memref, %C: memref, %param_M: int, %param_K: int, %param_N: int) {\
  ...
  aie.dma_memcpy_nd(...)
  ...
}

User calls compiler roughly like so:

aie-opt --aie-to-cpp aie.mlir -o my_runtime_sequence.h

which produces something like this using emitC (important -- this is compiler generated from the above MLIR, not manually written like in the GEMM test):

#include <txn_encoding.h>
std::vector<uint32_t> my_sequence(void *A, void *B, void *C, int param_M, int param_K, int param_N) {
    std::vector<uint32_t> txn;
    aie_runtime::txn_append_write32(txn, param_M, ...)
    ...
}

and then can use that generated file in their test.cpp like so:

#include "my_runtime_sequence.h"

int main(){
   // setup XRT
   xrt::kernel my_kernel = // get out of xclbin
   std::vector<uint32_t> insts = my_sequence( my params ... )
   my_kernel(insts, a, b, c);
}

So ideally the GEMM test's test.cpp at the end of this would not look significantly more complicated than the existing ones do.

Again, cool to see this taking shape!

andrej · 2026-03-26T22:38:30Z

+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+// (c) Copyright 2025 Advanced Micro Devices, Inc.


andrej · 2026-03-26T22:43:23Z

+        OptionalAttr<I32Attr>:$value,
+        Optional<I32>:$dyn_value
  );
  let results = (outs );
-  let assemblyFormat = [{ `(` $buffer `,` $index `,` $value `)` attr-dict
-  }];
+  let hasCustomAssemblyFormat = 1;
+  let hasVerifier = 1;
  let description = [{
-    rtp write operator
+    rtp write operator.
+    When `dyn_value` is provided, it supplies the RTP value at runtime
+    instead of the static `value` attribute.
+  }];
+  let extraClassDeclaration = [{
+    bool hasDynamicValue() { return getDynValue() != nullptr; }
  }];


I'm a little worried about code bloat with having every parameter for these ops duplicated, once as an attribute and once as an SSA value, along with the added custom verifier and assembly format for each op.

Could we consider removing the attributes altogether and instead use SSA values, with arith.constant for the static case? All existing lowerings can get the value from arith.constant and throw an error if it's not a constant, this emitC pass can use the actual SSA values. This approach would of course touch a lot of code (all examples etc. that use these ops with attributes would have to be rewritten to use arith.constant), but I think AI could handle it. I think it would be cleaner and might remove the need for customAssemblyFormat and hasVerifier for every op (haven't gotten to those yet but assume they're there because of this).

andrej · 2026-03-26T22:44:51Z

+    Optionally, SSA values can be provided for 'dyn_address', 'dyn_value', and
+    'dyn_mask' to enable runtime-parameterized sequences.
+
+    Static syntax (unchanged): `aiex.npu.maskwrite32 {address = 123 : ui32, ...}`


Suggest removing "(unchanged)" from these comments

andrej · 2026-03-26T22:45:33Z

+//===- TxnEncoding.h - Standalone TXN instruction encoding -------*- C++
+//-*-===//


suggest formatting onto a single line

andrej · 2026-03-26T22:50:35Z

+  // Use encoding library for the core format, then fix up col/row field.
+  aie_runtime::txn_append_blockwrite(instructions, *address, payload.data(),
+                                     payload.size());

-  // XAIE_IO_BLOCKWRITE
-  words[0] = XAIE_IO_BLOCKWRITE;
-  words[2] = op.getAddress();
+  // The encoding library leaves word[1] as 0. If col/row are present, set it.
  auto col = op.getColumn();
  auto row = op.getRow();
  if (col && row) {
-    words[1] = (*col & 0xff) | ((*row & 0xff) << 8);
+    // word[1] is at position (current_size - headerSize - count + 1)
+    size_t headerPos = instructions.size() - 4 - payload.size();
+    instructions[headerPos + 1] = (*col & 0xff) | ((*row & 0xff) << 8);
  }


Why doesn't the encoding library aie_runtime::txn_append_blockwrite take row/col?

andrej · 2026-03-26T23:30:19Z

I think this file could benefit from some code deduplication and cleanup. If NpuWriteBd op were changed to also accept SSA values, maybe there would be less of a need for separate code paths here. I'd like to avoid having to make every change in two places (dynamic and static path) for future changes to these ops.

andrej · 2026-03-26T23:31:20Z

    GreedyRewriteConfig rewriter_config = GreedyRewriteConfig();
    rewriter_config.setRegionSimplificationLevel(
        GreedySimplifyRegionLevel::Disabled);
+    rewriter_config.enableConstantCSE(false);


Why is this needed?

andrej · 2026-03-26T23:31:58Z

+//===- AIETargetCppTxn.cpp - EmitC-based C++ TXN translation ------*- C++
+//-*-===//


andrej · 2026-03-26T23:36:59Z

+/// Extract design-specific constants from the static instruction stream.
+///
+/// The static instructions always begin with:
+///   [4-word TXN header]
+///   [6-word write32: RTP write 0 — rtp_addr = words[header+2]]
+///   [6-word write32: RTP write 1]
+///   ... then DMA configuration including a maskwrite before S2MM push ...
+///
+/// We scan for the first write32 (opcode 0) to get the RTP address, and
+/// for the first maskwrite32 (opcode 3) to get the S2MM control register
+/// address and its value/mask.
+inline DesignConstants extract_constants(const std::vector<uint32_t> &insts) {
+  DesignConstants c{};
+  constexpr uint32_t HEADER_SIZE = 4;
+
+  bool found_rtp = false, found_s2mm = false;
+  size_t i = HEADER_SIZE;
+  while (i < insts.size() && (!found_rtp || !found_s2mm)) {
+    uint32_t opcode = insts[i];
+
+    if (opcode == aie_runtime::TXN_OPC_WRITE && !found_rtp) {
+      // First write32: RTP address is at word [i+2]
+      c.rtp_addr = insts[i + 2];
+      found_rtp = true;
+      i += 6;
+    } else if (opcode == aie_runtime::TXN_OPC_MASKWRITE && !found_s2mm) {
+      // First maskwrite: S2MM control register
+      c.s2mm_ctrl = insts[i + 2];
+      c.s2mm_ctrl_val = insts[i + 4];
+      c.s2mm_ctrl_mask = insts[i + 5];
+      found_s2mm = true;
+      i += 7;
+    } else {
+      // Skip this op by reading its size field
+      uint32_t op_size_bytes = 0;
+      if (opcode == aie_runtime::TXN_OPC_WRITE)
+        op_size_bytes = insts[i + 5];
+      else if (opcode == aie_runtime::TXN_OPC_MASKWRITE)
+        op_size_bytes = insts[i + 6];
+      else if (opcode == aie_runtime::TXN_OPC_BLOCKWRITE)
+        op_size_bytes = insts[i + 3];
+      else if (opcode == aie_runtime::TXN_OPC_TCT)
+        op_size_bytes = insts[i + 1];
+      else if (opcode == aie_runtime::TXN_OPC_DDR_PATCH)
+        op_size_bytes = insts[i + 1];
+      else
+        break; // unknown opcode
+
+      i += op_size_bytes / sizeof(uint32_t);
+    }
+  }
+
+  if (!found_rtp)
+    throw std::runtime_error("Could not find RTP write in static instructions");
+  if (!found_s2mm)
+    throw std::runtime_error(
+        "Could not find S2MM maskwrite in static instructions");
+
+  return c;
+}


Might it make sense to add a compiler option to export some of these "magic values" at compile time, into say a JSON that could be ingested at runtime? Some of them, like the controller ID, probably should also just get baked into the code generated by emitC.

andrej · 2026-03-26T23:39:25Z

This doesn't seem to use the code generated from the emitC, but instead reconstructs the instruction sequence manually using the encoding library.

Remove dynamic_gemm_txn.h and the #ifdef USE_GENERATED_TXN paths from test_dynamic.cpp and Makefile — the auto-generated C++ TXN path is now the only path. Add compare_txn.cpp to tracking, remove AI agent artifacts (AGENTS.md, .codex/), clean build artifacts, and stage all previously unstaged working-copy changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nces

- C1: Add null check for blockwrite data in non-fused EmitC path - C2: Propagate errors from NPU binary translator (void → LogicalResult) - C3: Fix syntax error in test_dynamic.cpp option chain - C4: Add set-once guard to RuntimeScalar.op setter - C5: Fix BD ID aliasing for unplaced tiles (id(tile) → stable key) - M2: Guard repeatCount underflow when sizes[3] == 0 in dynamic DMA - M5: Restrict dynamic operand verifiers to 32-bit integers only - M9: Fix CMake variable syntax in AIEXToEmitC CMakeLists.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- M1: Replace O(n) txn_prepend_header insert with txn_init + in-place overwrite - M4: Add explicit error diagnostics for unhandled ops in EmitC pass - M8: Document emit_free() no-op for direct NPU DMA tasks - M10: Raise TypeError on unsupported sequence argument types - m1: Remove redundant static keyword in anonymous namespace - m2: Add comments explaining blockwrite fusion op consumption - m3: Move raw_string_ostream outside loop in EmitC blockwrite emission - m4: Make EmitC pass a no-op when no runtime sequences exist - m5: Document all address_patch word fields in TxnEncoding.h - m6: Fix arg_plus type to uint32_t for consistency - m10: Replace std::distance with early-exit loops in DMATasksToNPU - m14: Move __task_group_index to instance variable - m15: Use deque for O(1) BD ID allocation - m19: Fix "prohibitted" typo Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- M6: Document RuntimeSequenceOp non-IsolatedFromAbove contract - m7: Remove redundant module clone in generateCppTxnCode - m8: Use emitc::FuncOp with inline specifier to prevent ODR violations - m9: Fix formatString to replace all occurrences of {0}/{1} - m12: Add tellg() and gcount() error checks in compare_txn.cpp - m13: Validate M/K/N are positive before uint32_t cast - m16: Raise RuntimeError on MLIR verification failure - m17: Document DMATask offset/sizes/strides parameters - m18: Remove unused _orig_npu_rtp_write saved reference - m20: Replace std::optional<std::string> with std::string for cachedPeanoDir Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three new tests in test/Conversion/AIEXToEmitC/: - basic_txn_cpp.mlir: end-to-end static ops through aie-translate - dynamic_values.mlir: SSA-parameterized ops with uint32_t casts - unsupported_ops.mlir: negative test for npu.push_queue error Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The dynamic TXN path needs the full aiecc pipeline (placement, objectFIFO lowering) before EmitC conversion. Switch from aie-translate to aiecc and generate both XCLBIN and C++ TXN in one invocation, matching the GEMM Makefile pattern. Also use the dynamic XCLBIN in the run_dynamic target. Verified on NPU: passthrough PASS, GEMM 32-128 all PASS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the fragile module-level ConversionTarget exclusion approach with a clean walk over CoreOp instances. Each core's region gets SCF-to-CF applied individually, so runtime_sequence SCF ops are never visited — no exclusion lists, no markOpRecursivelyLegal needed. Verified on NPU: passthrough PASS, GEMM 32-128 all PASS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The NpuWriteRTPOp now prints typed operands (0 : ui32, 7 : i32) instead of bare integers. Update CHECK lines to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Runs the auto-generated TXN path (aiecc --aie-generate-txn-cpp) with M=128, K=128, N=128 on NPU2 Strix hardware. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Passthrough dynamic: accept -d npu/npu2 flag, add run_makefile_dynamic.lit - GEMM dynamic: accept --dev npu, make devicename configurable, add run_makefile_dynamic.lit - Fix buffer_resolution.py FileCheck for typed rtp_write assembly format Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three coordinated toolchain changes that together let dynamic runtime_sequence Python code use natural arith on SSA i32 args (e.g. tile counts derived from runtime M/K/N) and still lower through aie-translate --aie-generate-txn-cpp: * AIEX.td: relax aiex.npu.dma_memcpy_nd offsets/sizes/strides from Variadic<I64> to Variadic<I32>. i32 matches the underlying NPU descriptor register width and is what AIEDmaToNpu lowers every dynamic operand to anyway. * python/dialects/aiex.py: add _cast_to_i32 helper and apply it in NpuDmaMemcpyNd.__init__ so callers can pass index (from scf.for ivar), i64, or any signless int and have it canonically cast at the IR boundary. * AIEXToEmitC.cpp: pull in arith::populateCeilFloorDivExpand OpsPatterns so SSA `M // m` (which becomes arith.floordivsi) expands to divsi+cmp+select that populateArithToEmitCPatterns can then convert. Adds MLIRArithTransforms link dep. Also: runtime_lib/CMakeLists.txt installs cxxopts.hpp, test_utils.h, xrt_test_wrapper.h unconditionally so the wheels build (which doesn't enable the test_lib ExternalProject) still produces a working install tree for host test executables. Static IR path is unaffected: i64 constants still fold to i32 via getAsValue, and the canonicalizer's all-static rebuild path uses empty SSA value lists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rewrites the dynamic GEMM example to mirror single_core.py section-for-section, keeping the device tile graph, ObjectFIFO plumbing, and core body byte-identical to the static version. The runtime_sequence is the only place that diverges, and the diff there is now small and self-documenting: * range -> range_ (scf.for over SSA i32 bound) * if num_tile_rows <= 0: break -> with if_(num_tile_rows > 0, hasElse=False) * min(...) -> arith.minsi(...) (Python min cannot bool() i1) * the inner per-tile-row loop is unrolled to range(rows_per_block//2) with an if_ guard, since bd_id must be Python-time integer * dma_wait(outC) -> npu_sync(...) (dma_wait inside scf.if has terminator-conversion issues today) * runtime args A, B, C, M, K, N replace fixed shape captures Makefile: drop stale --dynamic-txn flag (no longer accepted by the script) and pass -m -k -n --dtype_in --dtype_out --dev so the embedded kernel object name matches the one the Makefile actually built. Smoke-tested on NPU2 / Strix HX 370 with M=K=N in {32,64,128} against a single XCLBIN: all PASS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two fixes that together make the dynamic-DMA TXN stream structurally identical to (and bit-equivalent with) the static stream when the runtime arg resolves to the same constant. 1. AIEDmaToNpu: drop a duplicate controller_id maskwrite32 in the dynamic path. The inline emission ran unconditionally above the `if (repeatCountDynamic)` switch, but the static-repeat-count `else` branch creates an NpuPushQueueOp which PushQueuetoWrite32Pattern then expands into its own controller_id maskwrite32 + push-queue write32. Move the inline emission into the dynamic-repeat-count branch so the two paths are mutually exclusive. 2. AIEXToEmitC: make the blockwrite + dynamic-write32-overrides + address_patch fusion in cloneBlock robust to interleaved arith ops. The dma-to-npu lowering inserts arith.constant / andi / shli / ori chains between the override write32s to compute their dynamic values, and the previous strict scan bailed on the first non-write32 op. The scan now skips pure helper ops while still treating other AIEX ops and side-effecting ops as fences. After a successful match, pre-clone the intervening source ops (except the consumed override write32s themselves) so the dyn_value SSA refs are mapped into the new IR before emitTxnBlockWriteDynamicWords runs. Net effect: the dynamic-DMA path now emits a single `txn_append_blockwrite` per BD with array slots patched at runtime (`v60[0] = ...; v60[3] = ...;`), instead of one blockwrite plus N post-hoc `txn_append_write32` calls. Op count and order in the generated TXN stream now match the static path exactly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two related changes: 1. Revert AIE_NpuDmaMemcpyNdOp's offsets/sizes/strides operand type from `Variadic<I32>` back to `Variadic<I64>`. The earlier i32 switch broke the entire static-IR test corpus (test/aiecc, test/npu-xrt, programming_examples) which uses `arith.constant ... : i64` as the source for these operands. AIEDmaToNpu's `getAsValue` already does width coercion in either direction, so the IR-level type is a convention rather than a correctness requirement. Update the Python `npu_dma_memcpy_nd` wrapper to cast SSA i32 values up to i64 at the IR boundary (renamed `_cast_to_i32` -> `_cast_to_i64`). A TODO is left in the .td file documenting the intended future tightening to i32 once the existing test corpus has been migrated. 2. Move the dynamic-vs-static TXN equivalence check out of programming_examples/basic/passthrough_kernel/ and into a proper compiler test under test/aiecc/. Programming examples should demonstrate runnable designs, not host compiler-correctness checks. - Delete `compare_txn.cpp` and the `compare_dynamic_static_txn` / `build/generated_txn.h` Makefile targets that only existed to feed it; remove the matching RUN lines from run_makefile_dynamic.lit and run_strix_makefile_dynamic.lit. The `run_dynamic` build/run path stays intact. - Add `test/aiecc/cpp_static_vs_dynamic_txn.mlir` (driver + static N=4096 MLIR), `test/aiecc/Inputs/static_vs_dynamic_txn/passthrough_dynamic.mlir` (dynamic mirror with `%n : i32`), and three small C++ harness files (`compare_main.cpp`, `gen_static.cpp`, `gen_dynamic.cpp`). The two generated headers each define `generate_txn_sequence`, so they're wrapped in separate translation units before being linked into a single comparator that asserts the two TXN word streams are bit-identical. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jgmelber and others added 15 commits March 26, 2026 08:11

clang-format

1bb1992

clang-format and black formatting

6c28cc9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reset cmake/modulesXilinx submodule to match main

0066a60

Stale submodule pointer from before rebase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jgmelber force-pushed the dynamic-runtime-sequences branch from 2c839d4 to 93acc43 Compare March 26, 2026 19:36

jgmelber force-pushed the dynamic-runtime-sequences branch from 47bb7b4 to c97ce07 Compare March 26, 2026 22:06

Fix Python formatting for CI (black)

5286ec4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jgmelber force-pushed the dynamic-runtime-sequences branch from 3e32ea1 to 5286ec4 Compare March 26, 2026 22:24

Reset cmake/modulesXilinx submodule to match main

934d90b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jgmelber force-pushed the dynamic-runtime-sequences branch from 53af3d9 to 934d90b Compare March 26, 2026 22:29

Format Python files to match CI black version

84a7704

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

andrej requested changes Mar 26, 2026

View reviewed changes

jgmelber and others added 5 commits May 5, 2026 15:04

Merge remote-tracking branch 'origin/main' into dynamic-runtime-seque…

93c6450

…nces

jgmelber and others added 8 commits May 5, 2026 15:41

Format C++ and Python files for CI (clang-format, black)

2943c94

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix CI format check: clang-format and black

240b7b3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix remaining clang-format issues

88c66ee

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix Python formatting for CI (black 26.3.1)

bdf4d07

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reformat test files with black 26.3.1 to match CI

1c58970

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jgmelber changed the title ~~Runtime TXN generation library and compiler support~~ Runtime TXN generation May 5, 2026

jgmelber and others added 9 commits May 5, 2026 16:40

Fix buffer_resolution.py FileCheck: update rtp_write assembly format

f6e8a06

The NpuWriteRTPOp now prints typed operands (0 : ui32, 7 : i32) instead of bare integers. Update CHECK lines to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add lit test for dynamic GEMM TXN generation

b8cafe9

Runs the auto-generated TXN path (aiecc --aie-generate-txn-cpp) with M=128, K=128, N=128 on NPU2 Strix hardware. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[WIP] testing and using blockwrites

e0d55eb

Apply clang-format to prior changes

758ab3d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

hunhoffe added this to the IRON 1.4 milestone May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime TXN generation#3002

Runtime TXN generation#3002
jgmelber wants to merge 41 commits into
mainfrom
dynamic-runtime-sequences

jgmelber commented Mar 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

andrej left a comment •

edited

Loading

Uh oh!

andrej Mar 26, 2026

Uh oh!

andrej Mar 26, 2026

Uh oh!

andrej Mar 26, 2026

Uh oh!

andrej Mar 26, 2026

Uh oh!

andrej Mar 26, 2026

Uh oh!

andrej Mar 26, 2026

Uh oh!

andrej Mar 26, 2026

Uh oh!

andrej Mar 26, 2026

Uh oh!

andrej Mar 26, 2026

Uh oh!

andrej Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		//===- TxnEncoding.h - Standalone TXN instruction encoding -------*- C++
		//-*-===//

		//===- AIETargetCppTxn.cpp - EmitC-based C++ TXN translation ------*- C++
		//-*-===//

Conversation

jgmelber commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Examples

Test coverage

Test plan

Uh oh!

github-actions Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Created: 2026-05-05 23:18

Generated by llvm-cov -- llvm version 18.1.3

Uh oh!

andrej left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jgmelber commented Mar 26, 2026 •

edited

Loading

github-actions Bot commented Mar 26, 2026 •

edited

Loading

andrej left a comment •

edited

Loading