Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aarch64 - update EqDbl to use SIMD/FP instructions #7914

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

swalk-cavium
Copy link
Contributor

This change updates the EqDbl and NeqDbl sequences to use different instructions.
It reduces the number of instructions needed and removes the dependency on the
status register.

Before

(29) t8:Bool = EqDbl t7:Dbl, t5:Dbl
    Main:   
          0x51003aec  d53b4210              mrs x16, nzcv
          0x51003af0  1e602020              fcmp d1, d0
          0x51003af4  da9f13f2              csetm x18, eq 
          0x51003af8  9e670240              fmov d0, x18 
          0x51003afc  d51b4210              msr nzcv, x16
          0x51003b00  9e660000              fmov x0, d0
          0x51003b04  53001c00              uxtb w0, w0
          0x51003b08  12000000              and w0, w0, #0x1

After

(29) t8:Bool = EqDbl t7:Dbl, t5:Dbl
    Main:   
          0x1460407c  5e60e420              fcmeq d0, d1, d0
          0x14604080  9e660000              fmov x0, d0
          0x14604084  12000000              and w0, w0, #0x1

This also updates the vixl disassembler to recognize two new instruction formats:
Advanced SIMD Scalar Three Same, and Advanced SIMD Scalar two-register Misc.
Whenever the enumeration clashed with an existing definition a prefix was added.
See STS, and STM respectively.

The regression suite was run with 6 option sets. No additional failures were observed.

@swalk-cavium
Copy link
Contributor Author

@mxw - Hi Max, Can you take a look at this one? Thanks.

@mxw mxw self-assigned this Jul 24, 2017
Copy link
Contributor

@mxw mxw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have the one comment. I'll leave it to one of @cmuellner, @dave-estes, @jim-saxman, and @apinski-cavium to vet the meat of the PR.

@@ -1545,7 +1533,7 @@ void lower(const VLS& e, movtdb& i, Vlabel b, size_t z) {
lower_impl(e.unit, b, z, [&] (Vout& v) {
auto d = v.makeReg();
v << copy{i.s, d};
v << movtqb{d, i.d};
v << copy{d, i.d};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of copying through a tmp reg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxw - The code generator wouldn't accept it without the extra copy. Something to do with changing register classes I suspect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the register widths are propagated through copies, and they already don't match since this is a truncation operation...

Can you paste the exact error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxw - Looks like I didn't capture that in my daily log. I'll have to try and recreate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxw - I was unable to recreate the issue. I'll reduce the copy and retest.

@hhvm-bot
Copy link
Contributor

@mxw has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@jim-saxman
Copy link
Contributor

Hi @swalk-cavium, This patch doesn't create any more unit test failures, and it passed all of OSS performance test suite.

@hhvm-bot
Copy link
Contributor

@swalk-cavium updated the pull request - view changes - changes since last import

@swalk-cavium
Copy link
Contributor Author

@mxw - MOP results the same with only 1 copy, so updated pull request.

@facebook-github-bot
Copy link
Contributor

@mxw has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mxw
Copy link
Contributor

mxw commented Aug 17, 2017

Before I accept this, I'd like someone else to give a second opinion on the change (not just on test results) to confirm it doesn't affect the behavior of the instruction in any way.

@swalk-cavium
Copy link
Contributor Author

@mxw - Hi Max, Here's how I tested the sequences before incorporating in hhvm
https://pastebin.com/PC6Ym7ZY

@swalk-cavium
Copy link
Contributor Author

@mxw - Here's the table I used for the unordered comparison case.
https://pastebin.com/psgHNebB

SNaN - signalling NaN,
QNaN - quiet NaN

Copy link
Contributor

@dave-estes-QCOM dave-estes-QCOM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a slick optimization. It's not super-intuitive, but a few comments will sort that out quickly.

auto d = v.makeReg();
v << copy{i.s, d};
v << movtqb{d, i.d};
v << copy{i.s, i.d};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to do this if we end up reverting f9ecdf1? Might want to split this change out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dave-estes - Hi Dave, I'm not sure. If your new version goes in first, I'll have to retest.

@@ -1352,6 +1352,22 @@ void Assembler::frintz(const FPRegister& fd,
}


void Assembler::fcmeq(const FPRegister& fd,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more readable if this was named Assembler::fcmeqz().

@@ -909,29 +909,17 @@ void Vgen::emit(const unpcklpd& i) {
///////////////////////////////////////////////////////////////////////////////

void Vgen::emit(const cmpsd& i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think renaming the 2 argument fcmeq (see next comment) will help, but I also think a line or two of comments will make this more clear too.

@swalk-cavium
Copy link
Contributor Author

@mxw - I think I might have to update this to make it check the hardware capabilities. I just noticed at least one of the forms says, ARMv8.2. Something similar to what we did for the LSE Atomics.

@swalk-cavium
Copy link
Contributor Author

@mxw - Um, disregard the previous comment. SIMD instructions are required in the
ARM SBSA (Server Base System Architecture).

@facebook-github-bot
Copy link
Contributor

@swalk-cavium updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@mxw has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mxw
Copy link
Contributor

mxw commented Aug 23, 2017

@swalk-cavium—Cool, thanks for this change. I'll land it as soon as internal tests pass.

@facebook-github-bot
Copy link
Contributor

@mxw has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@swalk-cavium
Copy link
Contributor Author

@mxw - Hi Max, Any word on your testing infrastructure issue? Can this one land?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants