Skip to content

Enable prefetch iteration #382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: sycl-develop
Choose a base branch
from

Conversation

t4c1
Copy link
Collaborator

@t4c1 t4c1 commented May 19, 2025

Enables iteration of prefetch atom to cover prefetch tile. In other words relaxes the requirement for the prefetch tile size to match prefetch atom size.

This is done by using the same path for prefetch that nvidia code uses - going through copy implementation.

This PR also:

  • includes a lot of bugfixes for prefetch atom layouts
  • adds some missing copy traits
  • removes some duplicated prefetch implementations (_V atoms duplicating what is already in _N)

CopyOp::PREFETCH::copy(base_addr + l * traits.stride_l * dtype_size,
(traits.width * dtype_size_bits) / sizeof_bits_v<int8_t>, traits.height,
(traits.pitch * dtype_size_bits) / sizeof_bits_v<int8_t>,
intel::coord_t{(int)(x * dtype_size_bits / inst_size_bits), y});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will dtype_size_bits / inst_size_bits always be 1 by construction here? Will this work for U4?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some prefetches use instructions with different size than dtype, so it will not always be 1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this in:

    static_assert(dtype_size_bits / inst_size_bits == 1, "Non-1 case");

and ran everything I could think of cmake test_examples cmake test_unit ninja copy_debug but I never hit it. Maybe we just don't currently test a code path that uses it this way, but in that case we probably should have it tested at least once.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but that should be a separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this test is doing mixed precision (uint4, f32). For the narrow type it uses XE_2D_U4x32x64_LD_N and I confirmed that it uses cute::XE_2D_U8x8x32_LD_N for narrow prefetch.

Shouldn't this be a non-1 case? But it still doesn't hit my static_assert above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it should not for prefetch, but it should for load. This is for the cases where the difference in type is between the cutlass atom XE_2D_U4x32x64_LD_N and the underlying instruction __builtin_IB_subgroup_block_read_flat_u8_m32k32v1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, for prefetch, there are no such cases. And the current prefetch_selector implementation will never return a prefetch instruction with a sub-byte type. Am I missing something?

I am concerned that all the changes in this function are untested and currently untestable. What is the purpose of introducing this just now if it's not used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not introducing anything new here - this code already existed. I am just replacing sizeof_bits_v<dtype> with dtype_size_bits as we do not have dtype available now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants