Skip to content

[GR-74060] Implement AVX2 and AVX512 paths for byte compress and shift operations#13182

Open
graalvmbot wants to merge 1 commit intomasterfrom
gbarany/GR-74060
Open

[GR-74060] Implement AVX2 and AVX512 paths for byte compress and shift operations#13182
graalvmbot wants to merge 1 commit intomasterfrom
gbarany/GR-74060

Conversation

@graalvmbot
Copy link
Collaborator

Implements fallback paths for vectorized byte shift and compress operations when no direct AVX512 hardware instructions are available. Shifting is done by extending to a larger element size, using an appropriate shift, and narrowing back. Compress is implemented in terms of 128-bit shuffles; for larger vectors, the final result is assembled on the stack out of the results of compressing 128-bit lanes.

https://github.com/debe/libfindchars benchmark results from my old-ish i9-10920X workstation:

AVX2

                                                           ----- C2 -----  --- Graal ----
Benchmark                            (config)   Mode  Cnt   Score   Error   Score   Error  Units
SweepBenchmark.regex                 2-15-0-2  thrpt    5  15.030 ± 0.029  17.965 ± 0.077  ops/s
SweepBenchmark.regex                 4-15-0-2  thrpt    5  15.044 ± 0.021  17.812 ± 0.032  ops/s
SweepBenchmark.regex                 8-15-0-2  thrpt    5  14.830 ± 0.060  18.079 ± 0.022  ops/s
SweepBenchmark.regex                12-15-0-2  thrpt    5  14.710 ± 0.073  18.079 ± 0.064  ops/s
SweepBenchmark.regex                  8-5-0-2  thrpt    5  24.186 ± 0.066  26.095 ± 0.011  ops/s
SweepBenchmark.regex                 8-30-0-2  thrpt    5   9.997 ± 1.032  11.452 ± 0.087  ops/s
SweepBenchmark.regex                 8-50-0-2  thrpt    5   7.317 ± 0.082   8.734 ± 0.143  ops/s
SweepBenchmark.regex                 8-15-1-2  thrpt    5   9.261 ± 0.016  12.804 ± 0.026  ops/s
SweepBenchmark.regex                 8-15-0-1  thrpt    5  14.948 ± 0.008  17.901 ± 0.037  ops/s
SweepBenchmark.regex                 8-15-0-4  thrpt    5  14.864 ± 0.070  17.946 ± 0.084  ops/s
SweepBenchmark.regex                 8-15-0-8  thrpt    5  14.856 ± 0.041  18.047 ± 0.074  ops/s
SweepBenchmark.regexWithConversion   2-15-0-2  thrpt    5  13.820 ± 0.105  17.439 ± 0.192  ops/s
SweepBenchmark.regexWithConversion   4-15-0-2  thrpt    5  13.796 ± 0.084  17.439 ± 0.124  ops/s
SweepBenchmark.regexWithConversion   8-15-0-2  thrpt    5  14.004 ± 0.059  17.331 ± 0.574  ops/s
SweepBenchmark.regexWithConversion  12-15-0-2  thrpt    5  14.041 ± 0.039  17.565 ± 0.099  ops/s
SweepBenchmark.regexWithConversion    8-5-0-2  thrpt    5  22.137 ± 0.121  25.915 ± 0.090  ops/s
SweepBenchmark.regexWithConversion   8-30-0-2  thrpt    5   9.390 ± 0.233  11.483 ± 0.158  ops/s
SweepBenchmark.regexWithConversion   8-50-0-2  thrpt    5   7.151 ± 0.043   8.274 ± 1.107  ops/s
SweepBenchmark.regexWithConversion   8-15-1-2  thrpt    5   7.795 ± 0.295  10.539 ± 0.031  ops/s 
SweepBenchmark.regexWithConversion   8-15-0-1  thrpt    5  13.578 ± 0.083  17.435 ± 0.144  ops/s
SweepBenchmark.regexWithConversion   8-15-0-4  thrpt    5  13.986 ± 0.117  17.745 ± 0.196  ops/s
SweepBenchmark.regexWithConversion   8-15-0-8  thrpt    5  14.006 ± 0.109  17.560 ± 0.156  ops/s
SweepBenchmark.simdC2Jit             2-15-0-2  thrpt    5  15.702 ± 0.184  92.406 ± 0.030  ops/s
SweepBenchmark.simdC2Jit             4-15-0-2  thrpt    5  16.182 ± 0.165  93.510 ± 0.133  ops/s
SweepBenchmark.simdC2Jit             8-15-0-2  thrpt    5  15.438 ± 0.171  80.587 ± 0.133  ops/s
SweepBenchmark.simdC2Jit            12-15-0-2  thrpt    5  15.331 ± 0.185  78.441 ± 0.167  ops/s
SweepBenchmark.simdC2Jit              8-5-0-2  thrpt    5  32.318 ± 0.325  87.664 ± 0.225  ops/s
SweepBenchmark.simdC2Jit             8-30-0-2  thrpt    5   8.511 ± 0.052  73.431 ± 0.124  ops/s
SweepBenchmark.simdC2Jit             8-50-0-2  thrpt    5   6.242 ± 0.040  65.357 ± 0.236  ops/s
SweepBenchmark.simdC2Jit             8-15-1-2  thrpt    5  11.690 ± 0.047  50.358 ± 0.656  ops/s
SweepBenchmark.simdC2Jit             8-15-0-1  thrpt    5  15.961 ± 0.134  93.151 ± 0.140  ops/s
SweepBenchmark.simdC2Jit             8-15-0-4  thrpt    5  15.987 ± 0.188  95.057 ± 0.169  ops/s
SweepBenchmark.simdC2Jit             8-15-0-8  thrpt    5  16.472 ± 0.141  93.240 ± 0.070  ops/s
SweepBenchmark.simdCompiled          2-15-0-2  thrpt    5  15.689 ± 0.190  95.202 ± 0.139  ops/s
SweepBenchmark.simdCompiled          4-15-0-2  thrpt    5  16.048 ± 0.190  93.917 ± 0.142  ops/s
SweepBenchmark.simdCompiled          8-15-0-2  thrpt    5  15.508 ± 0.142  74.007 ± 0.098  ops/s
SweepBenchmark.simdCompiled         12-15-0-2  thrpt    5  15.565 ± 0.055  72.782 ± 0.113  ops/s
SweepBenchmark.simdCompiled           8-5-0-2  thrpt    5  33.033 ± 0.366  85.474 ± 0.216  ops/s
SweepBenchmark.simdCompiled          8-30-0-2  thrpt    5   9.088 ± 0.091  70.767 ± 0.153  ops/s
SweepBenchmark.simdCompiled          8-50-0-2  thrpt    5   6.199 ± 0.051  63.415 ± 0.250  ops/s
SweepBenchmark.simdCompiled          8-15-1-2  thrpt    5  14.284 ± 0.104  48.948 ± 8.457  ops/s
SweepBenchmark.simdCompiled          8-15-0-1  thrpt    5  15.825 ± 0.181  95.638 ± 0.074  ops/s
SweepBenchmark.simdCompiled          8-15-0-4  thrpt    5  15.710 ± 0.181  94.250 ± 0.093  ops/s
SweepBenchmark.simdCompiled          8-15-0-8  thrpt    5  16.157 ± 0.217  94.970 ± 0.156  ops/s

AVX512, no VBMI2

                                                           ----- C2 -----  ---- Graal ----
Benchmark                            (config)   Mode  Cnt   Score   Error    Score   Error  Units
SweepBenchmark.regex                 2-15-0-2  thrpt    5  13.867 ± 0.072   17.972 ± 0.019  ops/s
SweepBenchmark.regex                 4-15-0-2  thrpt    5  13.910 ± 0.073   18.985 ± 0.081  ops/s
SweepBenchmark.regex                 8-15-0-2  thrpt    5  13.466 ± 0.110   19.600 ± 0.100  ops/s
SweepBenchmark.regex                12-15-0-2  thrpt    5  14.032 ± 0.057   17.703 ± 0.085  ops/s
SweepBenchmark.regex                20-15-0-2  thrpt    5  14.432 ± 0.153   17.970 ± 0.091  ops/s
SweepBenchmark.regex                  8-5-0-2  thrpt    5  20.954 ± 0.048   27.721 ± 0.031  ops/s
SweepBenchmark.regex                 8-30-0-2  thrpt    5  10.757 ± 0.031   12.027 ± 0.174  ops/s
SweepBenchmark.regex                 8-50-0-2  thrpt    5   7.457 ± 0.108    9.093 ± 0.054  ops/s
SweepBenchmark.regex                 8-15-1-2  thrpt    5   9.365 ± 0.025   12.176 ± 0.019  ops/s
SweepBenchmark.regex                 8-15-2-2  thrpt    5   6.195 ± 0.020    8.592 ± 0.018  ops/s
SweepBenchmark.regex                 8-15-3-2  thrpt    5   5.641 ± 0.010    8.134 ± 0.016  ops/s
SweepBenchmark.regex                 8-15-0-1  thrpt    5  13.649 ± 0.105   19.248 ± 0.083  ops/s
SweepBenchmark.regex                 8-15-0-4  thrpt    5  13.462 ± 0.029   17.964 ± 0.037  ops/s
SweepBenchmark.regex                 8-15-0-8  thrpt    5  13.150 ± 0.096   18.890 ± 0.080  ops/s
SweepBenchmark.regexWithConversion   2-15-0-2  thrpt    5  13.744 ± 0.168   17.801 ± 0.078  ops/s
SweepBenchmark.regexWithConversion   4-15-0-2  thrpt    5  13.723 ± 0.232   17.546 ± 0.054  ops/s
SweepBenchmark.regexWithConversion   8-15-0-2  thrpt    5  13.748 ± 0.169   17.334 ± 0.268  ops/s
SweepBenchmark.regexWithConversion  12-15-0-2  thrpt    5  13.379 ± 0.097   17.484 ± 0.040  ops/s
SweepBenchmark.regexWithConversion  20-15-0-2  thrpt    5  13.746 ± 0.166   17.539 ± 0.057  ops/s
SweepBenchmark.regexWithConversion    8-5-0-2  thrpt    5  21.639 ± 0.147   25.946 ± 0.040  ops/s
SweepBenchmark.regexWithConversion   8-30-0-2  thrpt    5  10.268 ± 0.283   11.591 ± 0.043  ops/s
SweepBenchmark.regexWithConversion   8-50-0-2  thrpt    5   7.309 ± 0.231    8.498 ± 0.148  ops/s
SweepBenchmark.regexWithConversion   8-15-1-2  thrpt    5   7.923 ± 0.229   10.909 ± 0.054  ops/s
SweepBenchmark.regexWithConversion   8-15-2-2  thrpt    5   5.568 ± 0.074    7.703 ± 0.016  ops/s
SweepBenchmark.regexWithConversion   8-15-3-2  thrpt    5   5.091 ± 0.253    6.704 ± 0.015  ops/s
SweepBenchmark.regexWithConversion   8-15-0-1  thrpt    5  13.401 ± 0.078   17.532 ± 0.063  ops/s
SweepBenchmark.regexWithConversion   8-15-0-4  thrpt    5  13.439 ± 0.087   17.458 ± 0.192  ops/s
SweepBenchmark.regexWithConversion   8-15-0-8  thrpt    5  13.364 ± 0.055   17.533 ± 0.075  ops/s
SweepBenchmark.simdC2Jit             2-15-0-2  thrpt    5  14.730 ± 0.179  105.139 ± 1.267  ops/s
SweepBenchmark.simdC2Jit             4-15-0-2  thrpt    5  14.871 ± 0.153  105.135 ± 0.088  ops/s
SweepBenchmark.simdC2Jit             8-15-0-2  thrpt    5  14.684 ± 0.133  105.114 ± 0.257  ops/s
SweepBenchmark.simdC2Jit            12-15-0-2  thrpt    5  14.239 ± 0.127   84.657 ± 0.089  ops/s
SweepBenchmark.simdC2Jit            20-15-0-2  thrpt    5  14.238 ± 0.125   84.225 ± 0.079  ops/s
SweepBenchmark.simdC2Jit              8-5-0-2  thrpt    5  32.286 ± 0.332  118.308 ± 0.197  ops/s
SweepBenchmark.simdC2Jit             8-30-0-2  thrpt    5   8.011 ± 0.064   92.980 ± 0.178  ops/s
SweepBenchmark.simdC2Jit             8-50-0-2  thrpt    5   5.489 ± 0.009   80.916 ± 0.819  ops/s
SweepBenchmark.simdC2Jit             8-15-1-2  thrpt    5  13.505 ± 0.139   64.742 ± 1.431  ops/s
SweepBenchmark.simdC2Jit             8-15-2-2  thrpt    5  13.564 ± 0.136   65.330 ± 0.638  ops/s
SweepBenchmark.simdC2Jit             8-15-3-2  thrpt    5  11.590 ± 0.052   60.246 ± 0.588  ops/s
SweepBenchmark.simdC2Jit             8-15-0-1  thrpt    5  14.887 ± 0.143  105.403 ± 0.254  ops/s
SweepBenchmark.simdC2Jit             8-15-0-4  thrpt    5  14.805 ± 0.151  105.299 ± 0.129  ops/s
SweepBenchmark.simdC2Jit             8-15-0-8  thrpt    5  14.694 ± 0.133  105.443 ± 0.035  ops/s
SweepBenchmark.simdCompiled          2-15-0-2  thrpt    5  14.680 ± 0.123  104.267 ± 0.052  ops/s
SweepBenchmark.simdCompiled          4-15-0-2  thrpt    5  14.740 ± 0.128  104.226 ± 0.049  ops/s
SweepBenchmark.simdCompiled          8-15-0-2  thrpt    5  14.806 ± 0.105  104.264 ± 0.238  ops/s
SweepBenchmark.simdCompiled         12-15-0-2  thrpt    5  14.080 ± 0.102   84.752 ± 0.145  ops/s
SweepBenchmark.simdCompiled         20-15-0-2  thrpt    5  14.146 ± 0.133   84.702 ± 0.036  ops/s
SweepBenchmark.simdCompiled           8-5-0-2  thrpt    5  32.280 ± 0.268  118.720 ± 0.753  ops/s
SweepBenchmark.simdCompiled          8-30-0-2  thrpt    5   7.986 ± 0.053   92.323 ± 0.075  ops/s
SweepBenchmark.simdCompiled          8-50-0-2  thrpt    5   5.435 ± 0.008   80.346 ± 0.212  ops/s
SweepBenchmark.simdCompiled          8-15-1-2  thrpt    5  13.528 ± 0.238   65.521 ± 1.039  ops/s
SweepBenchmark.simdCompiled          8-15-2-2  thrpt    5  13.436 ± 0.118   59.264 ± 2.396  ops/s
SweepBenchmark.simdCompiled          8-15-3-2  thrpt    5  12.888 ± 0.124   58.988 ± 0.403  ops/s
SweepBenchmark.simdCompiled          8-15-0-1  thrpt    5  14.762 ± 0.096  104.201 ± 0.026  ops/s
SweepBenchmark.simdCompiled          8-15-0-4  thrpt    5  14.696 ± 0.141  104.151 ± 0.159  ops/s
SweepBenchmark.simdCompiled          8-15-0-8  thrpt    5  14.765 ± 0.126  104.177 ± 1.330  ops/s

The cases where we beat C2 by 5-10x seem to be cases where C2 fails to inline VectorSupport::compressExpandOp into ByteVector::compressTemplate (callee is too large), at least according to the agent that I asked to look at its logs. Without that, we're broadly on par.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants