Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with instruction level parallelism #39

Conversation

wind0204
Copy link
Contributor

@wind0204 wind0204 commented Dec 22, 2023

Please see if this version has better performance than the non-parallel version if it interested you.

I decided to write a parallel version after seeing that PMOVMSKB(r32,xmm)'s latency is 3 cycles and the throughput is 1/2 per cycle on AMD Zen2.--It's funny that the throughput on Zen1 is twice as much as Zen2-- You should also try changing NR_VEC from 2 to 3.

The instruction tables: https://www.agner.org/optimize/instruction_tables.pdf

@wind0204
Copy link
Contributor Author

wind0204 commented Dec 22, 2023

I guess I optimized a lot by replacing PMOVMSKB with MOVMSKPS, the throughput can be doubled. (1/2 -> 1/1) -- in commit 5230e71

@wind0204
Copy link
Contributor Author

The learning material I'm using now is this.

@iseahound
Copy link
Owner

One thing I can say for sure is that the number of cycles does seem to be 3. That's why I'm using 3 instances of parallel pixelsearch in pixelsearch4x.c

Not at a computer at the moment, it's being repaired

@wind0204
Copy link
Contributor Author

I can check if the code does work when my wife is asleep. ;-p
Or I have to launch a windows VM, However I can't be bothered. 😄

@iseahound
Copy link
Owner

I can do the testing/benchmarking at a later date. For parallel pixelsearch, previous testing from remembrance showed a 1.5x improvement (3 pixelsearch in parallel) compared to invoking pixelsearch 3 times

@iseahound
Copy link
Owner

I don't use ptest because it's not available on all archetechtures (see steam hardware survey) and slower.
https://stackoverflow.com/questions/43712243/can-ptest-be-used-to-test-if-two-registers-are-both-zero-or-some-other-condition

@wind0204
Copy link
Contributor Author

wind0204 commented Dec 22, 2023

I don't use ptest because it's not available on all archetechtures (see steam hardware survey) and slower. https://stackoverflow.com/questions/43712243/can-ptest-be-used-to-test-if-two-registers-are-both-zero-or-some-other-condition

I agree, I would support them all even if the percentage was 0.01%

Are you considering writing more than one version of the code and determine which version to use via running CPUID instruction at the beginning of the code? That'd be nice to people with more recent hardware.

@wind0204
Copy link
Contributor Author

wind0204 commented Dec 22, 2023

I don't use ptest because it's not available on all archetechtures (see steam hardware survey) and slower. https://stackoverflow.com/questions/43712243/can-ptest-be-used-to-test-if-two-registers-are-both-zero-or-some-other-condition

I think there can be some performance gain on utilizing PTEST though, You can quickly POR each vector which has the throughput of ±1/0.25 per cycle and issue only one PTEST which has throughput of 1 per cycle. Without PTEST you would have to MOVMSKPS each vector in every loop and each MOVMSKPS needs 1 cycle.

@wind0204
Copy link
Contributor Author

wind0204 commented Dec 23, 2023

I don't use ptest because it's not available on all archetechtures (see steam hardware survey) and slower. https://stackoverflow.com/questions/43712243/can-ptest-be-used-to-test-if-two-registers-are-both-zero-or-some-other-condition

I think there can be some performance gain on utilizing PTEST though, You can quickly POR each vector which has the throughput of ±1/0.25 per cycle and issue only one PTEST which has throughput of 1 per cycle. Without PTEST you would have to MOVMSKPS each vector in every loop and each MOVMSKPS needs 1 cycle.

I woke up this morning with an idea that I can still consume almost same number of cycles without requiring SSE4. :D --POR all of them and then MOVMSKPS and then test the result with TEST or something--

@iseahound
Copy link
Owner

sorry, rewriting the git history to fix line endings, so this pull request is out of sync. Noticed that when I went to git blame I wasn't seeing the full history. Cleanup work to get development started again.

Also, I can't reopen this pull request because github won't let me.

@wind0204
Copy link
Contributor Author

Also, I can't reopen this pull request because github won't let me.

Oh, I'll re-open it then.

@iseahound
Copy link
Owner

might want to hold off on that. I'll be rewriting history again, so all changes will break until i'm satisfied with the repo

@wind0204
Copy link
Contributor Author

Oh, I'll do it when you ask me to then

@wind0204
Copy link
Contributor Author

wind0204 commented Jan 1, 2024

Are you considering writing more than one version of the code and determine which version to use via running CPUID instruction at the beginning of the code? That'd be nice to people with more recent hardware.

Today I learned: CPUID is a serializing instruction, which means it stalls the parallel execution of modern processors. We should use it sparingly.

@iseahound
Copy link
Owner

The CPUID instruction is already inside the ImagePut library. I don't think I'm calling it correctly however. It apparently needs to be called twice, and I'm only calling it (and saving the output inside a map) once.

@iseahound
Copy link
Owner

Repo should be fixed now. I removed all instances of linefeed and deleted empty commits. This fixes git blame only showing up to the commit where the entire file was changed.

@wind0204 wind0204 deleted the pr-experiment_with_instruction_level_parallelism branch January 2, 2024 01:14
@wind0204
Copy link
Contributor Author

wind0204 commented Jan 2, 2024

Here's the new PR: #40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants