Enable prefetch iteration #382

t4c1 · 2025-05-19T10:10:17Z

Enables iteration of prefetch atom to cover prefetch tile. In other words relaxes the requirement for the prefetch tile size to match prefetch atom size.

This is done by using the same path for prefetch that nvidia code uses - going through copy implementation.

This PR also:

includes a lot of bugfixes for prefetch atom layouts
adds some missing copy traits
removes some duplicated prefetch implementations (_V atoms duplicating what is already in _N)

…efetch implementations

include/cute/atom/copy_traits.hpp

include/cute/atom/copy_traits_xe.hpp

include/cute/arch/xe_copy_1B.hpp

joeatodd · 2025-05-22T08:52:25Z

include/cute/atom/copy_traits_xe.hpp

+    CopyOp::PREFETCH::copy(base_addr + l * traits.stride_l * dtype_size,
+                           (traits.width * dtype_size_bits) / sizeof_bits_v<int8_t>, traits.height,
+                           (traits.pitch * dtype_size_bits) / sizeof_bits_v<int8_t>,
+                           intel::coord_t{(int)(x * dtype_size_bits / inst_size_bits), y});


Will dtype_size_bits / inst_size_bits always be 1 by construction here? Will this work for U4?

Some prefetches use instructions with different size than dtype, so it will not always be 1.

I put this in:

static_assert(dtype_size_bits / inst_size_bits == 1, "Non-1 case");

and ran everything I could think of cmake test_examples cmake test_unit ninja copy_debug but I never hit it. Maybe we just don't currently test a code path that uses it this way, but in that case we probably should have it tested at least once.

I agree, but that should be a separate PR.

So, this test is doing mixed precision (uint4, f32). For the narrow type it uses XE_2D_U4x32x64_LD_N and I confirmed that it uses cute::XE_2D_U8x8x32_LD_N for narrow prefetch.

Shouldn't this be a non-1 case? But it still doesn't hit my static_assert above.

No, it should not for prefetch, but it should for load. This is for the cases where the difference in type is between the cutlass atom XE_2D_U4x32x64_LD_N and the underlying instruction __builtin_IB_subgroup_block_read_flat_u8_m32k32v1

As far as I can tell, for prefetch, there are no such cases. And the current prefetch_selector implementation will never return a prefetch instruction with a sub-byte type. Am I missing something?

I am concerned that all the changes in this function are untested and currently untestable. What is the purpose of introducing this just now if it's not used?

I am not introducing anything new here - this code already existed. I am just replacing sizeof_bits_v<dtype> with dtype_size_bits as we do not have dtype available now.

t4c1 and others added 6 commits May 19, 2025 10:30

working, slower npot prefetch

9dea3d5

cleanup

496ed2b

fix u8 prefetch layouts

0de9565

Merge branch 'sycl-develop' into npot_prefetch

af92057

implement missing copy traits

8f59745

more fixes to layouts including missing copy_traits and duplicated pr…

eb905d2

…efetch implementations

aacostadiaz reviewed May 20, 2025

View reviewed changes

include/cute/atom/copy_traits.hpp Outdated Show resolved Hide resolved

t4c1 and others added 5 commits May 21, 2025 12:09

fix prefetch for int4 and some other small fixes

ace6a3e

revert some changes

6578bdc

Merge branch 'sycl-develop' into npot_prefetch

f285a07

fix erronous revert

4d50128

applyed review suggestion for is_prefetch

4e992c8

joeatodd reviewed May 22, 2025

View reviewed changes

address review comment about unused variable

80f0851

mehdi-goli approved these changes May 27, 2025

View reviewed changes

t4c1 and others added 14 commits May 28, 2025 09:29

Merge branch 'sycl-develop' into npot_prefetch

b6dd21a

add missing prefetch copy traits

b095c26

more missing prefetch copy traits

696b7dc

fix copy-paste errors

ad84fd0

Merge branch 'sycl-develop' into npot_prefetch

4576dad

another copy-paste error

18ea014

removed unused variables

5f3198b

Merge branch 'sycl-develop' into npot_prefetch

0832a7b

Merge remote-tracking branch 'origin/sycl-develop' into npot_prefetch

3733ee0

fix merge

3d0b699

extend CI timeout

9a449e4

remove unused variable

b3bb1c4

Merge branch 'sycl-develop' into npot_prefetch

8e275c1

Merge branch 'sycl-develop' into npot_prefetch

6eea5b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable prefetch iteration #382

Enable prefetch iteration #382

Uh oh!

t4c1 commented May 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joeatodd May 22, 2025

Uh oh!

t4c1 May 22, 2025

Uh oh!

joeatodd May 23, 2025

Uh oh!

t4c1 May 26, 2025

Uh oh!

joeatodd May 27, 2025

Uh oh!

t4c1 May 27, 2025

Uh oh!

joeatodd May 28, 2025

Uh oh!

t4c1 May 28, 2025

Uh oh!

Uh oh!

Enable prefetch iteration #382

Are you sure you want to change the base?

Enable prefetch iteration #382

Uh oh!

Conversation

t4c1 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

t4c1 commented May 19, 2025 •

edited

Loading