Skip to content

Use +sme for Apple #303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Use +sme for Apple #303

wants to merge 1 commit into from

Conversation

cielavenir
Copy link
Contributor

@cielavenir cielavenir commented Nov 9, 2024

Today I went to biccamera ( 😂 ) and checked hw.optional. Then I found FEAT_SME but not FEAT_SVE. 1

This means that for apple the +sve code has to be compiled with +sme instead.

This is potentially quite breaking change, so I'd like this to be tested from those who have M4 Mac.

Call for tester(s): if you have M4 mac, please try running the test on your machine~~

Footnotes

  1. Why Apple says something without FEAT_SVE armv9?

@@ -87,7 +91,7 @@ cdecl(gf_vect_mad_sve):
/* vector length agnostic */
.Lloopsve_vl:
whilelo p0.b, x_pos, x_len
b.none .return_pass
b.eq .return_pass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://llvm.org/doxygen/AArch64AsmParser_8cpp_source.html , b.none is the same as b.eq when +sve is specified.

@pablodelara
Copy link
Contributor

@liuqinfei could you look into this issue? Thanks again! ;)

@liuqinfei
Copy link
Contributor

@liuqinfei could you look into this issue? Thanks again! ;)

In fact, i don't have an Apple computer that supports SVE on hand. So I can't verify this patch. Maybe you can supply your verifications on the machines with and without SVE. @cielavenir

@cielavenir
Copy link
Contributor Author

I don't have either

I just checked compilation

Thus we need to call for tester(s) with M4 Mac, otherwise we need to wait for the next github RUNNER (not image) update.

@pablodelara
Copy link
Contributor

We are looking into releasing 2.31.1 as soon as next week, with just bug fixes. If we have someone that can test this, then we can include it in the next release.

@pablodelara
Copy link
Contributor

Let's hold this PR for next release, once more testing is done

@tipabu
Copy link
Contributor

tipabu commented Apr 24, 2025

I've got an M4 MacBook Air -- doesn't seem to work for me:

tburke@2025-air isa-l % git clean -fdx && ./autogen.sh && ./configure --prefix ~/.local/ && make test
...
  CPPAS    mem/aarch64/mem_zero_detect_neon.lo
  CPPAS    mem/aarch64/mem_multibinary_arm.lo
  CC       mem/aarch64/mem_aarch64_dispatcher.lo
  CCLD     libisal.la
copying selected object files to avoid basename conflicts...
  CCLD     erasure_code/gf_vect_mul_base_test
erasure_code/gf_vect_mul_base_test
gf_vect_mul_base_test:
Random tests  done: Pass
Completed run: erasure_code/gf_vect_mul_base_test
  CC       erasure_code/gf_vect_dot_prod_base_test.o
  CCLD     erasure_code/gf_vect_dot_prod_base_test
erasure_code/gf_vect_dot_prod_base_test
gf_vect_dot_prod_base: 250x8192 done all: Pass
Completed run: erasure_code/gf_vect_dot_prod_base_test
  CC       erasure_code/gf_vect_dot_prod_test.o
  CCLD     erasure_code/gf_vect_dot_prod_test
erasure_code/gf_vect_dot_prod_test
make: *** [erasure_code/gf_vect_dot_prod_test.run] Illegal instruction: 4

Oddly enough, running via lldb (to try to get a better handle on where things went wrong) doesn't trip the same error:

tburke@2025-air isa-l % ./libtool --mode=execute lldb -o run erasure_code/gf_vect_dot_prod_test
(lldb) target create "/Users/tburke/Code/isa-l/erasure_code/.libs/gf_vect_dot_prod_test"
Current executable set to '/Users/tburke/Code/isa-l/erasure_code/.libs/gf_vect_dot_prod_test' (arm64).
(lldb) run
gf_vect_dot_prod: 16x8192 done all: Pass
Process 7194 launched: '/Users/tburke/Code/isa-l/erasure_code/.libs/gf_vect_dot_prod_test' (arm64)
Process 7194 exited with status = 0 (0x00000000)

On master, make test has everything pass.

@cielavenir
Copy link
Contributor Author

@tipabu thank you for testing. maybe current code being accepted with +sme might be a assembler bug....

@pablodelara
Copy link
Contributor

@tipabu, what about "master" branch?

@tipabu
Copy link
Contributor

tipabu commented Apr 28, 2025

@pablodelara, on master (91da2ad add RISCV CI) all tests pass and perf suite runs fine.

@cielavenir cielavenir marked this pull request as draft April 28, 2025 23:59
@pablodelara
Copy link
Contributor

Thanks @tipabu. So it looks like this PR is not needed...

@cielavenir
Copy link
Contributor Author

@tipabu actually according to https://qiita.com/zacky1972/items/b7b5dd456fe021b30eb2, I need to wrap the function with smstart sm and smstop sm. I implemented that. If you have time could you try again?

(compilation is tested in https://github.com/cielavenir/isa-l/actions/runs/14731321078)

@tipabu
Copy link
Contributor

tipabu commented Apr 29, 2025

@cielavenir Tests now pass! And looking at someone's investigation, we shouldn't need to worry about losing sve checks for Macs; no Apple silicon supports it.

So it looks like this PR is not needed...

@pablodelara Only insofar as Macs were always getting the neon implementation.

@cielavenir
Copy link
Contributor Author

great thank you~

@cielavenir cielavenir marked this pull request as ready for review April 29, 2025 23:06
@pablodelara
Copy link
Contributor

@cielavenir can you clean up the commits (so there is no "Merge branch 'master'"...)? Good opportunity to rebase against latest 'master' branch

Signed-off-by: Taiju Yamada <[email protected]>
@cielavenir
Copy link
Contributor Author

@pablodelara rebased.

@tipabu
Copy link
Contributor

tipabu commented Apr 30, 2025

So one concern: This seems to be slightly slower than master. On this branch:

tburke@2025-air isa-l % make erasure_code/erasure_code_perf.run
erasure_code/erasure_code_perf
Testing with 8 data buffers and 6 parity buffers (num errors = 4, in [ 4 0 5 1 ])
erasure_code_perf: 14x9344 4
erasure_code_encode_warm: runtime =    3062483 usecs, bandwidth 49464 MB in 3.0625 sec = 16151.90 MB/s
erasure_code_decode_warm: runtime =    3001748 usecs, bandwidth 59388 MB in 3.0017 sec = 19784.75 MB/s
done all: Pass
Completed run: erasure_code/erasure_code_perf

Whereas on master:

tburke@2025-air isa-l % make erasure_code/erasure_code_perf.run
erasure_code/erasure_code_perf
Testing with 8 data buffers and 6 parity buffers (num errors = 4, in [ 4 0 5 1 ])
erasure_code_perf: 14x9344 4
erasure_code_encode_warm: runtime =    3039658 usecs, bandwidth 49832 MB in 3.0397 sec = 16394.04 MB/s
erasure_code_decode_warm: runtime =    3027461 usecs, bandwidth 65886 MB in 3.0275 sec = 21763.11 MB/s
done all: Pass
Completed run: erasure_code/erasure_code_perf

Multiple runs had similar results (±100MB/s on encode, ±200MB/s on decode, give or take).

@cielavenir
Copy link
Contributor Author

@tipabu on https://github.com/cielavenir/isa-l/tree/featSME_CI branch, I changed to call smstart/smstop only in the dispatched function. Setting smstart to the each subroutine called by ec_encode_data_sve could have overhead issue.

if this is faster, I will rebase featSME branch again.

@tipabu
Copy link
Contributor

tipabu commented May 1, 2025

@cielavenir I see roughly the same on cielavenir@aad8c5c:

tburke@2025-air isa-l % make erasure_code/erasure_code_perf.run
erasure_code/erasure_code_perf
Testing with 8 data buffers and 6 parity buffers (num errors = 4, in [ 4 0 5 1 ])
erasure_code_perf: 14x9344 4
erasure_code_encode_warm: runtime =    3062620 usecs, bandwidth 49503 MB in 3.0626 sec = 16163.74 MB/s
erasure_code_decode_warm: runtime =    3006801 usecs, bandwidth 59369 MB in 3.0068 sec = 19745.01 MB/s
done all: Pass
Completed run: erasure_code/erasure_code_perf

@cielavenir
Copy link
Contributor Author

Now I'm not sure if this is smstart overhead or sme impl is not faster than neon impl..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants