-
Notifications
You must be signed in to change notification settings - Fork 30
Enable prefetch iteration #382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl-develop
Are you sure you want to change the base?
Conversation
CopyOp::PREFETCH::copy(base_addr + l * traits.stride_l * dtype_size, | ||
(traits.width * dtype_size_bits) / sizeof_bits_v<int8_t>, traits.height, | ||
(traits.pitch * dtype_size_bits) / sizeof_bits_v<int8_t>, | ||
intel::coord_t{(int)(x * dtype_size_bits / inst_size_bits), y}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will dtype_size_bits / inst_size_bits
always be 1 by construction here? Will this work for U4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some prefetches use instructions with different size than dtype, so it will not always be 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this in:
static_assert(dtype_size_bits / inst_size_bits == 1, "Non-1 case");
and ran everything I could think of cmake test_examples
cmake test_unit
ninja copy_debug
but I never hit it. Maybe we just don't currently test a code path that uses it this way, but in that case we probably should have it tested at least once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but that should be a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this test is doing mixed precision (uint4, f32
). For the narrow type it uses XE_2D_U4x32x64_LD_N
and I confirmed that it uses cute::XE_2D_U8x8x32_LD_N
for narrow prefetch.
Shouldn't this be a non-1 case? But it still doesn't hit my static_assert
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it should not for prefetch, but it should for load. This is for the cases where the difference in type is between the cutlass atom XE_2D_U4x32x64_LD_N
and the underlying instruction __builtin_IB_subgroup_block_read_flat_u8_m32k32v1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, for prefetch, there are no such cases. And the current prefetch_selector
implementation will never return a prefetch instruction with a sub-byte type. Am I missing something?
I am concerned that all the changes in this function are untested and currently untestable. What is the purpose of introducing this just now if it's not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not introducing anything new here - this code already existed. I am just replacing sizeof_bits_v<dtype>
with dtype_size_bits
as we do not have dtype available now.
Enables iteration of prefetch atom to cover prefetch tile. In other words relaxes the requirement for the prefetch tile size to match prefetch atom size.
This is done by using the same path for prefetch that nvidia code uses - going through copy implementation.
This PR also: