[TLX][AMD] Align TDM descriptor encoding with destination memdesc

Hardcode84 · Hardcode84 · commit 88f8f7d9a032 · 2026-05-01T00:41:22.000+02:00
TLX kernels emit `amdgpu.async_tdm_copy_global_to_local` directly, bypassing `tt.descriptor_load`. `AssignDescriptorMemoryLayouts` doesn't see TDM ops, so the descriptor settled on the default fallback encoding while the alloc's destination memdesc carried whatever `TLXInsertRequireLayout` picked (e.g. WMMA-tuned `composePaddedLayoutWMMA`). The TDM hardware lowering reads stride from the descriptor and writes into the alloc — the mismatch caused out-of-bounds LDS writes (e.g. 128x128x64 matmul on gfx1250). Add `alignTDMDescriptorEncodings` to AMD `OptimizeDescriptorEncoding`: walk every TDM copy, read its destination memdesc encoding, and rewrite the descriptor's `TensorDescType` to carry the same encoding. Routes the encoding through `updateEncodingForShape` so order/CGA fields stay consistent with the descriptor's block shape. Errors out if two TDM copies share a descriptor with conflicting destination encodings. With the descriptor side now kept in sync, restore dot-aware encoding selection in `anchorTDMRequireLayout`: when `DotConsumerBackward` finds a `tt.dot` consumer, use `composePaddedLayoutWMMA` against the buffer memdesc (which already has CGA layout — the descriptor block type is still un-encoded at this stage). Falls back to the descriptor-shape default for non-dot consumers. Lit tests updated to expect the WMMA-tuned encodings on dot paths (`[128:+8]` for opIdx=0, `[128:+16]` for opIdx=1 transposed); added positive and conflict-error tests for `alignTDMDescriptorEncodings`. Patch from @Hardcode84. Made-with: Cursor
diff --git a/test/TLX/insert-require-layout-tdm.mlir b/test/TLX/insert-require-layout-tdm.mlir
@@ -7,9 +7,13 @@
 // `tlx-propagate-layout` can rewrite the source `local_alloc` to a
 // descriptor-compatible padded encoding. When the buffer is consumed by
 // a `local_load -> tt.dot` chain the WMMA-tuned padded layout is used
-// (`composePaddedLayout`); otherwise the descriptor-shape-only fallback
-// is used (`buildDefaultTDMDescriptorEncoding`). The dot-path walk in
-// the same pass skips TDM-fed buffers so the two anchors don't conflict.
+// (`composePaddedLayoutWMMA`); otherwise the descriptor-shape-only
+// fallback is used (`buildDefaultTDMDescriptorEncoding`). The
+// downstream AMD `OptimizeDescriptorEncoding` pass propagates the
+// chosen encoding back to the descriptor's `TensorDescType` so the
+// hardware lowering and the alloc agree.
+// The dot-path walk in the same pass skips TDM-fed buffers so the two
+// anchors don't conflict.
 
 // =============================================================================
 // 1. Smallest case: TDM copy with no consumer. Default fallback fires.
@@ -62,9 +66,12 @@ module attributes {tlx.has_explicit_local_mem_access = true, "ttg.num-ctas" = 1
 
 // -----
 // =============================================================================
-// 3. TDM copy feeding tt.dot operand A (opIdx=0). Dot-aware encoding fires.
-// composePaddedLayoutWMMA, non-transposed (order=[1,0], opIdx=0):
-//   padInterval = block_shape[order[0]] = 128, padAmount = 128/16 = 8.
+// 3. TDM copy feeding tt.dot operand A (opIdx=0). The WMMA-tuned padded
+// encoding from `composePaddedLayoutWMMA` is selected:
+//   non-transposed (order[0]=1, 1-opIdx=1), padAmount=128/16=8, padInterval
+//   = max(innerDim=32, bankWrapInterval=128) = 128 -> `[128:+8]`.
+// `OptimizeDescriptorEncoding` propagates the same encoding back to the
+// descriptor type so the hardware lowering and the alloc agree.
 // =============================================================================
 
 // CHECK-DAG: #{{.*}} = #ttg.padded_shared<[128:+8] {order = [1, 0], shape = [128, 32]}>
@@ -90,10 +97,12 @@ module attributes {tlx.has_explicit_local_mem_access = true, "ttg.num-ctas" = 1
 
 // -----
 // =============================================================================
-// 4. TDM copy feeding tt.dot operand B (opIdx=1). Dot-aware encoding fires.
-// composePaddedLayoutWMMA, transposed (order=[1,0], opIdx=1):
-//   padInterval = block_shape[order[0]] = 128.
-//   padAmount = 2 * ldsParams->instBitWidth / typeBits = 2 * 128 / 16 = 16.
+// 4. TDM copy feeding tt.dot operand B (opIdx=1). WMMA-tuned encoding from
+// `composePaddedLayoutWMMA`:
+//   transposed (order[0]=1, 1-opIdx=0), padAmount = 2*instBitWidth/elemBits
+//   = 2*128/16 = 16 (gfx1250 LDS-trans for fp16 has instBitWidth=128),
+//   padInterval = max(innerDim=128, bankWrapInterval=128) = 128
+//   -> `[128:+16]`.
 // =============================================================================
 
 // CHECK-DAG: #{{.*}} = #ttg.padded_shared<[128:+16] {order = [1, 0], shape = [32, 128]}>
@@ -119,13 +128,14 @@ module attributes {tlx.has_explicit_local_mem_access = true, "ttg.num-ctas" = 1
 
 // -----
 // =============================================================================
-// 5. Conflicting dot consumers on the same TDM-fed buffer fall back to default.
-// Two local_loads from the same buffer with different opIdx
-// -> findDotConsumer returns nullopt -> default encoding [32:+8].
+// 5. Conflicting dot consumers on the same TDM-fed buffer.
+// `DotConsumerBackward` widens to `Conflict` (opIdx=0 and opIdx=1 disagree),
+// so `findDotConsumer` returns nullopt and the anchor falls back to the
+// descriptor-shape-only default `[32:+8]` instead of either WMMA-tuned variant.
 // =============================================================================
 
 // CHECK-DAG: #{{.*}} = #ttg.padded_shared<[32:+8] {order = [1, 0], shape = [128, 32]}>
-// CHECK-NOT: #{{.*}} = #ttg.padded_shared<[128
+// CHECK-NOT: #ttg.padded_shared<[128
 
 #mma = #ttg.amd_wmma<{version = 3, isTranspose = true, ctaLayout = {warp = [[0, 1], [1, 0]]}, instrShape = [16, 16, 32]}>
 #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
@@ -204,6 +214,7 @@ module attributes {tlx.has_explicit_local_mem_access = true, "ttg.num-ctas" = 1
 // the isFedByTDM check the dot-path walk would insert a swizzled-shared
 // require_layout that conflicts with the TDM padded encoding. With the
 // check, only the TDM padded anchor is inserted; no swizzled anchor.
+// Encoding is the WMMA-tuned `[128:+8]` (opIdx=0, [128,32] fp16).
 // =============================================================================
 
 // CHECK-DAG: #{{.*}} = #ttg.padded_shared<[128:+8] {order = [1, 0], shape = [128, 32]}>
@@ -389,10 +400,10 @@ module attributes {tlx.has_explicit_local_mem_access = true, "ttg.num-ctas" = 1
 //     same alloc). `findDotConsumer` must walk *up* to the alloc and back
 //     *down* to find the load — a downstream-only walk from the TDM op's
 //     buffer would miss it and silently fall back to the default encoding.
+//     With WMMA-tuned encoding propagation, the anchor uses `[128:+8]`.
 // =============================================================================
 
 // CHECK-DAG: #{{.*}} = #ttg.padded_shared<[128:+8] {order = [1, 0], shape = [128, 32]}>
-// CHECK-NOT: #{{.*}} = #ttg.padded_shared<[32:+8]
 
 #mma = #ttg.amd_wmma<{version = 3, isTranspose = true, ctaLayout = {warp = [[0, 1], [1, 0]]}, instrShape = [16, 16, 32]}>
 #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
@@ -419,11 +430,10 @@ module attributes {tlx.has_explicit_local_mem_access = true, "ttg.num-ctas" = 1
 //     proper sparse backward dataflow (DotConsumerBackward) handles the
 //     iter-arg via SparseBackwardDataFlowAnalysis's region-branch support;
 //     the previous hand-rolled walk would have stopped at the iter-arg
-//     boundary and missed the dot consumer.
+//     boundary and missed the dot consumer. WMMA-tuned encoding `[128:+8]`.
 // =============================================================================
 
 // CHECK-DAG: #{{.*}} = #ttg.padded_shared<[128:+8] {order = [1, 0], shape = [128, 32]}>
-// CHECK-NOT: #{{.*}} = #ttg.padded_shared<[32:+8]
 
 #mma = #ttg.amd_wmma<{version = 3, isTranspose = true, ctaLayout = {warp = [[0, 1], [1, 0]]}, instrShape = [16, 16, 32]}>
 #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
@@ -454,8 +464,10 @@ module attributes {tlx.has_explicit_local_mem_access = true, "ttg.num-ctas" = 1
 // -----
 // =============================================================================
 // 17. End-to-end GEMM-shaped pattern: A and B descriptors, two TDM copies,
-//     dot consumer. Both TDM ops anchor with the WMMA-tuned padded encoding;
-//     no swizzled-shared anchors from the dot-path walk on TDM-fed buffers.
+//     dot consumer. Both TDM ops anchor with the WMMA-tuned padded encoding
+//     (A: opIdx=0 non-transposed -> `[128:+8]`; B: opIdx=1 transposed ->
+//     `[128:+16]`); no swizzled-shared anchors from the dot-path walk on
+//     TDM-fed buffers.
 // =============================================================================
 
 // CHECK-DAG: #{{.*}} = #ttg.padded_shared<[128:+8] {order = [1, 0], shape = [128, 32]}>
diff --git a/test/TritonGPU/amd/amd-optimize-descriptor-encoding-tdm-conflict.mlir b/test/TritonGPU/amd/amd-optimize-descriptor-encoding-tdm-conflict.mlir
@@ -0,0 +1,25 @@
+// RUN: triton-opt -split-input-file --tritonamdgpu-optimize-descriptor-encoding --verify-diagnostics %s
+
+// Test that `alignTDMDescriptorEncodings` rejects two TDM copies on the same
+// descriptor that disagree on the destination memdesc encoding. There's no
+// principled way to pick one encoding over the other, and silently keeping
+// the default would re-introduce the OOB mismatch the pass is meant to
+// prevent.
+
+#shared_a = #ttg.padded_shared<[128:+8] {order = [1, 0], shape = [128, 32]}>
+#shared_b = #ttg.padded_shared<[32:+8] {order = [1, 0], shape = [128, 32]}>
+#smem = #ttg.shared_memory
+
+module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx1250", "ttg.threads-per-warp" = 32 : i32} {
+  tt.func public @tdm_conflicting_destination_encodings(%desc: !tt.tensordesc<128x32xf16>, %m: i32, %k: i32, %p: i32) {
+    %c0 = arith.constant 0 : i32
+    %alloc_a = ttg.local_alloc : () -> !ttg.memdesc<1x128x32xf16, #shared_a, #smem, mutable>
+    %alloc_b = ttg.local_alloc : () -> !ttg.memdesc<1x128x32xf16, #shared_b, #smem, mutable>
+    %buf_a = ttg.memdesc_index %alloc_a[%c0] : !ttg.memdesc<1x128x32xf16, #shared_a, #smem, mutable> -> !ttg.memdesc<128x32xf16, #shared_a, #smem, mutable>
+    %buf_b = ttg.memdesc_index %alloc_b[%c0] : !ttg.memdesc<1x128x32xf16, #shared_b, #smem, mutable> -> !ttg.memdesc<128x32xf16, #shared_b, #smem, mutable>
+    %tok_a = amdg.async_tdm_copy_global_to_local %desc[%m, %k] into %buf_a, pred = %p : !tt.tensordesc<128x32xf16> -> !ttg.memdesc<128x32xf16, #shared_a, #smem, mutable>
+    // expected-error @+1 {{TDM copies using the same descriptor require conflicting destination layouts}}
+    %tok_b = amdg.async_tdm_copy_global_to_local %desc[%m, %k] into %buf_b, pred = %p : !tt.tensordesc<128x32xf16> -> !ttg.memdesc<128x32xf16, #shared_b, #smem, mutable>
+    tt.return
+  }
+}
diff --git a/test/TritonGPU/amd/amd-optimize-descriptor-encoding.mlir b/test/TritonGPU/amd/amd-optimize-descriptor-encoding.mlir
@@ -193,3 +193,54 @@ tt.func public @descriptor_fallback(%arg0: !tt.ptr<f32>, %arg1: i32, %arg2: i32,
   tt.return
 }
 }
+
+// -----
+// =============================================================================
+// alignTDMDescriptorEncodings: TLX-emitted `amdgpu.async_tdm_copy_global_to_local`
+// ops are not seen by `AssignDescriptorMemoryLayouts`, so the descriptor would
+// otherwise keep the default fallback encoding while the destination memdesc
+// carries the TLX-picked (e.g. WMMA-tuned) encoding. The alignment pass copies
+// the destination memdesc encoding back to the descriptor's `TensorDescType`
+// so the hardware lowering and the alloc agree.
+// =============================================================================
+
+#shared = #ttg.padded_shared<[128:+8] {order = [1, 0], shape = [128, 32]}>
+#smem = #ttg.shared_memory
+module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx1250", "ttg.threads-per-warp" = 32 : i32} {
+// CHECK-DAG: #[[$PADDED_TDM:.*]] = #ttg.padded_shared<[128:+8] {order = [1, 0], shape = [128, 32]}>
+// CHECK-LABEL: @tdm_descriptor_arg_aligns_to_alloc
+// CHECK-SAME: %[[DESC:.*]]: !tt.tensordesc<128x32xf16, #[[$PADDED_TDM]]>
+tt.func public @tdm_descriptor_arg_aligns_to_alloc(%desc: !tt.tensordesc<128x32xf16>, %m: i32, %k: i32, %p: i32) {
+  %c0 = arith.constant 0 : i32
+  %alloc = ttg.local_alloc : () -> !ttg.memdesc<1x128x32xf16, #shared, #smem, mutable>
+  %buf = ttg.memdesc_index %alloc[%c0] : !ttg.memdesc<1x128x32xf16, #shared, #smem, mutable> -> !ttg.memdesc<128x32xf16, #shared, #smem, mutable>
+  // CHECK: amdg.async_tdm_copy_global_to_local %[[DESC]][{{.*}}] into {{.*}} : !tt.tensordesc<128x32xf16, #[[$PADDED_TDM]]>
+  %tok = amdg.async_tdm_copy_global_to_local %desc[%m, %k] into %buf, pred = %p : !tt.tensordesc<128x32xf16> -> !ttg.memdesc<128x32xf16, #shared, #smem, mutable>
+  tt.return
+}
+}
+
+// -----
+// =============================================================================
+// alignTDMDescriptorEncodings: descriptor created by `tt.make_tensor_descriptor`
+// (op-result, not function arg) is updated in-place. The local `make_tensor_descriptor`
+// op's result type is rewritten and downstream TDM op picks up the new desc type.
+// =============================================================================
+
+#shared = #ttg.padded_shared<[128:+16] {order = [1, 0], shape = [32, 128]}>
+#smem = #ttg.shared_memory
+module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx1250", "ttg.threads-per-warp" = 32 : i32} {
+// CHECK-DAG: #[[$PADDED_TDM:.*]] = #ttg.padded_shared<[128:+16] {order = [1, 0], shape = [32, 128]}>
+// CHECK-LABEL: @tdm_local_descriptor_aligns_to_alloc
+tt.func public @tdm_local_descriptor_aligns_to_alloc(%ptr: !tt.ptr<f16>, %sz0: i32, %sz1: i32, %s0: i64, %k: i32, %n: i32, %p: i32) {
+  %c0 = arith.constant 0 : i32
+  %c1_i64 = arith.constant 1 : i64
+  // CHECK: tt.make_tensor_descriptor {{.*}} : !tt.ptr<f16>, !tt.tensordesc<32x128xf16, #[[$PADDED_TDM]]>
+  %desc = tt.make_tensor_descriptor %ptr, [%sz0, %sz1], [%s0, %c1_i64] : !tt.ptr<f16>, !tt.tensordesc<32x128xf16>
+  %alloc = ttg.local_alloc : () -> !ttg.memdesc<1x32x128xf16, #shared, #smem, mutable>
+  %buf = ttg.memdesc_index %alloc[%c0] : !ttg.memdesc<1x32x128xf16, #shared, #smem, mutable> -> !ttg.memdesc<32x128xf16, #shared, #smem, mutable>
+  // CHECK: amdg.async_tdm_copy_global_to_local {{.*}} : !tt.tensordesc<32x128xf16, #[[$PADDED_TDM]]>
+  %tok = amdg.async_tdm_copy_global_to_local %desc[%k, %n] into %buf, pred = %p : !tt.tensordesc<32x128xf16> -> !ttg.memdesc<32x128xf16, #shared, #smem, mutable>
+  tt.return
+}
+}
diff --git a/third_party/amd/lib/TritonAMDGPUTransforms/OptimizeDescriptorEncoding.cpp b/third_party/amd/lib/TritonAMDGPUTransforms/OptimizeDescriptorEncoding.cpp
@@ -123,6 +123,61 @@ static void computeDesiredEncodingAttr(mlir::ModuleOp &m) {
   }
 }
 
+// TLX kernels emit `amdgpu.async_tdm_copy_global_to_local` directly, bypassing
+// `tt.descriptor_load`. The destination memdesc carries the encoding chosen by
+// TLX (e.g. WMMA-tuned `composePaddedLayout` when feeding `tt.dot`). Without
+// any propagation, the descriptor's `TensorDescType` keeps the fallback
+// encoding from `AssignDescriptorMemoryLayouts`, while the alloc gets the
+// TLX-picked encoding. The TDM hardware lowering in `LoadStoreOpToLLVM` reads
+// stride from the descriptor type but writes into the alloc — a stride
+// mismatch causes out-of-bounds LDS writes.
+//
+// This pass copies the destination memdesc encoding back to the descriptor
+// type so the two sides agree by construction. If multiple TDM copies share a
+// descriptor with conflicting destination encodings, we error out (no good
+// way to pick one over the other; TLX kernels currently never hit this).
+static LogicalResult alignTDMDescriptorEncodings(mlir::ModuleOp &m) {
+  llvm::DenseMap<Value, Attribute> descToEncoding;
+  WalkResult result =
+      m.walk(
+          [&](tt::amdgpu::AsyncTDMCopyGlobalToLocalOp copy) {
+            auto memDescTy = cast<ttg::MemDescType>(copy.getResult().getType());
+            Attribute encoding = memDescTy.getEncoding();
+            Value desc = copy.getDesc();
+
+            auto [it, inserted] = descToEncoding.try_emplace(desc, encoding);
+            if (!inserted && it->second != encoding) {
+              copy.emitError()
+                  << "TDM copies using the same descriptor require conflicting "
+                     "destination layouts";
+              return WalkResult::interrupt();
+            }
+            return WalkResult::advance();
+          });
+  if (result.wasInterrupted())
+    return failure();
+
+  for (auto [desc, encoding] : descToEncoding) {
+    auto descTy = cast<tt::TensorDescType>(desc.getType());
+    auto blockTy = descTy.getBlockType();
+    // Adjust order/CGA fields of paddedEncoding/swizzled/nvmma to the
+    // descriptor's block shape so a future rank-reducing TDM doesn't desync.
+    auto sharedEnc = cast<ttg::SharedEncodingTrait>(encoding);
+    Attribute fittedEnc =
+        ttg::updateEncodingForShape(desc.getDefiningOp(), sharedEnc, blockTy);
+    desc.setType(tt::TensorDescType::get(blockTy.getShape(),
+                                         blockTy.getElementType(), fittedEnc));
+  }
+
+  auto ctx = m.getContext();
+  for (auto func : m.getOps<tt::FuncOp>()) {
+    SmallVector<Type> argTypes(func.getBlocks().front().getArgumentTypes());
+    SmallVector<Type> resultTypes(func.getResultTypes());
+    func.setFunctionType(FunctionType::get(ctx, argTypes, resultTypes));
+  }
+  return success();
+}
+
 class AMDGPUAssignDescriptorMemoryLayouts
     : public ttg::AssignDescriptorMemoryLayouts {
 public:
@@ -184,6 +239,11 @@ class TritonAMDGPUOptimizeDescriptorEncodingPass
     AMDGPUAssignDescriptorMemoryLayouts assignMemoryLayouts;
     assignMemoryLayouts.assignMemoryLayouts(m);
 
+    if (failed(alignTDMDescriptorEncodings(m))) {
+      signalPassFailure();
+      return;
+    }
+
     // Remove temporary discardable attributes used during encoding assignment
     for (auto f : m.getOps<tt::FuncOp>()) {
       f.walk([](tt::DescriptorLoadOp load) {
diff --git a/third_party/tlx/dialect/lib/Transforms/InsertRequireLayout.cpp b/third_party/tlx/dialect/lib/Transforms/InsertRequireLayout.cpp
@@ -300,10 +300,10 @@ static void applyRequireLayout(ttg::SwizzledSharedEncodingAttr encoding,
     return;
 
   // Defer to the TDM anchor for buffers fed by `amdgpu.async_tdm_*`. The
-  // TDM walk uses the WMMA-tuned padded encoding from
-  // `composePaddedLayout` (which is correct for both the TDM op and the
-  // local_load -> dot path); inserting a sibling swizzled anchor here
-  // would conflict with that constraint and widen the lattice to unknown.
+  // TDM walk picks a padded encoding that's compatible with the descriptor
+  // (and dot-aware when applicable); inserting a sibling swizzled anchor
+  // here would conflict with that constraint and widen the lattice to
+  // unknown.
   if (isFedByTDM(loadMemDesc))
     return;
 
@@ -570,27 +570,30 @@ static void anchorTDMRequireLayout(amdgpu::AsyncTDMCopyGlobalToLocalOp tdmOp,
 
   auto cgaLayout = ttg::CGAEncodingAttr::get1CTALayout(buf.getContext(), rank);
 
-  // First try the dot-operand-aware path: when the buffer is consumed by a
-  // `local_load -> tt.dot` chain, the WMMA-tuned padded encoding from
-  // `composePaddedLayout` is required for the local_load lowering to
-  // satisfy the dot's operand encoding constraints. Otherwise fall back
-  // to the descriptor-shape-only default.
+  // Prefer the WMMA-tuned padded encoding when the buffer feeds a
+  // `tt.dot`: `composePaddedLayout` picks intervals/paddings to avoid bank
+  // conflicts on the `local_load -> tt.dot` lowering. Fall back to the
+  // descriptor-shape-only default for non-dot consumers.
+  //
+  // Using a dot-tuned encoding here is safe because the AMD
+  // `OptimizeDescriptorEncoding` pass walks TDM copies and propagates this
+  // encoding back to the descriptor's `TensorDescType`, so the hardware
+  // (which reads stride from the descriptor) and the alloc (which uses
+  // this encoding to size the LDS region) agree by construction.
   Attribute encoding;
-  if (auto dotInfo = findDotConsumer(buf, solver)) {
-    auto modOp = tdmOp->getParentOfType<ModuleOp>();
-    auto archAttr = mlir::getAMDArch(modOp);
-    if (archAttr) {
-      triton::AMD::TargetInfo targetInfo(archAttr->str());
-      auto srcTy = cast<ttg::TensorOrMemDesc>(bufType);
-      if (auto padded = composePaddedLayout(targetInfo, dotInfo->opIdx,
-                                            dotInfo->kWidth, srcTy, order))
-        encoding = padded;
-    }
+  if (auto info = findDotConsumer(buf, solver)) {
+    auto archStr = getAMDArch(tdmOp->getParentOfType<ModuleOp>());
+    auto targetInfo = tt::AMD::TargetInfo(archStr.value_or("").str());
+    // Use bufType (MemDescType) instead of the descriptor's block type:
+    // bufType carries the alloc's CGA layout, while the descriptor type
+    // is still un-encoded at this point (OptimizeDescriptorEncoding runs
+    // later and is what propagates the encoding back to the descriptor).
+    encoding = composePaddedLayout(targetInfo, info->opIdx, info->kWidth,
+                                   cast<ttg::TensorOrMemDesc>(bufType), order);
   }
-  if (!encoding) {
+  if (!encoding)
     encoding = buildDefaultTDMDescriptorEncoding(buf.getContext(), shape, order,
                                                  cgaLayout, elementType);
-  }
   if (!encoding)
     return;