-
-
Notifications
You must be signed in to change notification settings - Fork 94
Description
Describe the bug
Feature request, I'm happy to contribute some but I don't know if my solutions will be optimal. I compress a lot of data where typesize=12, and when using shuffle this falls back to unshuffle_generic, which is slow. It would be nice if there were 12-byte variants of all the platform-specific shuffle code. It might not be as fast as a power-of-2 typesize, but it's still much faster than generic.
To Reproduce
Decompress any data using shuffle with typesize=12, see that unshuffle_generic dominates the overall time.
Expected behavior
unshuffle for typesize=12 is approximately as fast as typesize=8 or typesize=16
Logs
If applicable, add logs to help explain your problem.
System information:
- OS: [e.g. OSX]
- Compiler [e.g. gcc, clang]
- Version [e.g. 2.0.1]
Additional context
I think it would be nice to support all possible typesizes up to a point, as for most the could be quite a significant speedup compared to the generic implementation.
Here's my attempt at avx512-unshuffle: #648