implement bit unpacking optimizations for arm64 by achille-roussel · Pull Request #2 · parquet-go/bitpack

achille-roussel · 2025-10-24T22:47:43Z

Measurable improvements:

ARM64 vs Purego Benchmark Results

  Highly Optimized BitWidths (1-8, 16) - ~5x Faster

  | BitWidth | Speedup | Time Reduction | Throughput Increase           |
  |----------|---------|----------------|-------------------------------|
  | 1        | 4.88x   | -79.49%        | +387.63% (3.85 → 18.76 GiB/s) |
  | 2        | 4.78x   | -79.09%        | +378.25% (3.73 → 17.83 GiB/s) |
  | 3        | 5.30x   | -81.12%        | +429.67% (3.58 → 18.93 GiB/s) |
  | 4        | 5.19x   | -80.75%        | +418.90% (3.48 → 18.08 GiB/s) |
  | 5        | 5.54x   | -81.94%        | +453.82% (3.41 → 18.88 GiB/s) |
  | 6        | 4.85x   | -79.39%        | +385.28% (3.58 → 17.37 GiB/s) |
  | 7        | 5.20x   | -80.77%        | +420.11% (3.41 → 17.71 GiB/s) |
  | 8        | 4.17x   | -76.00%        | +316.78% (3.81 → 15.87 GiB/s) |
  | 16       | 4.85x   | -79.38%        | +385.06% (3.72 → 18.05 GiB/s) |

  Scalar Optimized BitWidths (10-15, 17-27) - ~15-23% Faster

  | BitWidth | Speedup    | Time Reduction | Throughput Increase |
  |----------|------------|----------------|---------------------|
  | 10       | 1.11x      | -9.60%         | +10.63%             |
  | 11       | 1.09x      | -8.50%         | +9.31%              |
  | 14       | 1.19x      | -15.67%        | +18.61%             |
  | 15       | 1.18x      | -15.09%        | +17.75%             |
  | 17       | 1.27x      | -21.43%        | +27.25%             |
  | 18       | 1.23x      | -18.54%        | +22.77%             |
  | 19       | 1.27x      | -20.96%        | +26.51%             |
  | 20       | 1.26x      | -20.33%        | +25.51%             |
  | 21-27    | 1.18-1.31x | -18% to -23%   | +22% to +31%        |

Full benchmark breakdown:

goos: darwin
goarch: arm64
pkg: github.com/parquet-go/bitpack
cpu: Apple M2 Pro
                           │ /tmp/bench_arm64_final.txt │       /tmp/bench_purego_final.txt        │
                           │           sec/op           │     sec/op      vs base                  │
UnpackInt32/bitWidth=1-12                  25.41n ±  4%    123.90n ±  2%   +387.60% (p=0.000 n=10)
UnpackInt32/bitWidth=2-12                  26.75n ±  2%    127.90n ±  8%   +378.13% (p=0.000 n=10)
UnpackInt32/bitWidth=3-12                  25.18n ±  4%    133.40n ± 77%   +429.68% (p=0.000 n=10)
UnpackInt32/bitWidth=4-12                  26.38n ±  0%    137.05n ±  6%   +419.52% (p=0.000 n=10)
UnpackInt32/bitWidth=5-12                  25.25n ±  2%    139.85n ±  4%   +453.86% (p=0.000 n=10)
UnpackInt32/bitWidth=6-12                  27.46n ±  2%    133.25n ±  1%   +385.25% (p=0.000 n=10)
UnpackInt32/bitWidth=7-12                  26.92n ± 13%    140.05n ±  2%   +420.15% (p=0.000 n=10)
UnpackInt32/bitWidth=8-12                  30.05n ± 10%    125.25n ±  2%   +316.74% (p=0.000 n=10)
UnpackInt32/bitWidth=9-12                  170.6n ± 11%     150.3n ±  7%    -11.90% (p=0.029 n=10)
UnpackInt32/bitWidth=10-12                 135.1n ±  8%     149.4n ±  3%    +10.62% (p=0.002 n=10)
UnpackInt32/bitWidth=11-12                 138.8n ±  3%     151.8n ± 19%     +9.29% (p=0.000 n=10)
UnpackInt32/bitWidth=12-12                 132.6n ± 11%     145.0n ±  7%     +9.39% (p=0.041 n=10)
UnpackInt32/bitWidth=13-12                 139.9n ± 21%     156.0n ±  5%          ~ (p=0.063 n=10)
UnpackInt32/bitWidth=14-12                 129.7n ±  6%     153.8n ±  2%    +18.58% (p=0.000 n=10)
UnpackInt32/bitWidth=15-12                 133.7n ±  5%     157.4n ±  0%    +17.77% (p=0.000 n=10)
UnpackInt32/bitWidth=16-12                 26.43n ±  2%    128.15n ±  6%   +384.96% (p=0.000 n=10)
UnpackInt32/bitWidth=17-12                 132.0n ±  1%     168.0n ±  2%    +27.27% (p=0.000 n=10)
UnpackInt32/bitWidth=18-12                 133.1n ±  4%     163.4n ±  3%    +22.76% (p=0.000 n=10)
UnpackInt32/bitWidth=19-12                 141.4n ±  4%     178.9n ±  2%    +26.52% (p=0.000 n=10)
UnpackInt32/bitWidth=20-12                 132.4n ±  1%     166.2n ±  7%    +25.52% (p=0.000 n=10)
UnpackInt32/bitWidth=21-12                 155.7n ±  2%     190.5n ±  1%    +22.35% (p=0.000 n=10)
UnpackInt32/bitWidth=22-12                 154.0n ±  1%     194.3n ±  2%    +26.17% (p=0.000 n=10)
UnpackInt32/bitWidth=23-12                 150.8n ± 16%     191.4n ±  2%    +26.93% (p=0.000 n=10)
UnpackInt32/bitWidth=24-12                 130.6n ±  0%     166.7n ± 11%    +27.69% (p=0.000 n=10)
UnpackInt32/bitWidth=25-12                 146.2n ±  3%     180.0n ±  1%    +23.08% (p=0.000 n=10)
UnpackInt32/bitWidth=26-12                 148.0n ±  1%     183.1n ±  1%    +23.68% (p=0.000 n=10)
UnpackInt32/bitWidth=27-12                 150.1n ±  3%     196.3n ±  2%    +30.78% (p=0.000 n=10)
UnpackInt32/bitWidth=28-12                 147.0n ±  1%     221.8n ± 11%    +50.88% (p=0.000 n=10)
UnpackInt32/bitWidth=29-12                 153.3n ±  2%     224.8n ± 33%    +46.72% (p=0.000 n=10)
UnpackInt32/bitWidth=30-12                 151.8n ±  1%     245.0n ± 19%    +61.48% (p=0.000 n=10)
UnpackInt32/bitWidth=31-12                 154.3n ±  1%     220.5n ± 17%    +42.90% (p=0.000 n=10)
UnpackInt32/bitWidth=32-12                 8.548n ±  1%   140.150n ±  8%  +1539.47% (p=0.000 n=10)
geomean                                    81.72n           162.3n          +98.55%

                           │ /tmp/bench_arm64_final.txt │      /tmp/bench_purego_final.txt      │
                           │            B/s             │      B/s       vs base                │
UnpackInt32/bitWidth=1-12                18.763Gi ±  3%   3.848Gi ±  2%  -79.49% (p=0.000 n=10)
UnpackInt32/bitWidth=2-12                17.825Gi ±  2%   3.727Gi ±  7%  -79.09% (p=0.000 n=10)
UnpackInt32/bitWidth=3-12                18.934Gi ±  4%   3.575Gi ± 43%  -81.12% (p=0.000 n=10)
UnpackInt32/bitWidth=4-12                18.078Gi ±  0%   3.484Gi ±  6%  -80.73% (p=0.000 n=10)
UnpackInt32/bitWidth=5-12                18.884Gi ±  2%   3.410Gi ±  4%  -81.94% (p=0.000 n=10)
UnpackInt32/bitWidth=6-12                17.366Gi ±  2%   3.578Gi ±  1%  -79.39% (p=0.000 n=10)
UnpackInt32/bitWidth=7-12                17.712Gi ± 12%   3.405Gi ±  2%  -80.77% (p=0.000 n=10)
UnpackInt32/bitWidth=8-12                15.868Gi ±  9%   3.807Gi ±  2%  -76.01% (p=0.000 n=10)
UnpackInt32/bitWidth=9-12                 2.796Gi ± 13%   3.172Gi ±  6%  +13.46% (p=0.029 n=10)
UnpackInt32/bitWidth=10-12                3.530Gi ±  8%   3.191Gi ±  3%   -9.61% (p=0.002 n=10)
UnpackInt32/bitWidth=11-12                3.435Gi ±  4%   3.142Gi ± 16%   -8.52% (p=0.000 n=10)
UnpackInt32/bitWidth=12-12                3.596Gi ± 10%   3.287Gi ±  6%        ~ (p=0.052 n=10)
UnpackInt32/bitWidth=13-12                3.409Gi ± 17%   3.058Gi ±  5%        ~ (p=0.063 n=10)
UnpackInt32/bitWidth=14-12                3.676Gi ±  6%   3.100Gi ±  2%  -15.69% (p=0.000 n=10)
UnpackInt32/bitWidth=15-12                3.567Gi ±  5%   3.029Gi ±  0%  -15.08% (p=0.000 n=10)
UnpackInt32/bitWidth=16-12               18.047Gi ±  2%   3.721Gi ±  5%  -79.38% (p=0.000 n=10)
UnpackInt32/bitWidth=17-12                3.612Gi ±  1%   2.838Gi ±  2%  -21.42% (p=0.000 n=10)
UnpackInt32/bitWidth=18-12                3.583Gi ±  4%   2.918Gi ±  2%  -18.54% (p=0.000 n=10)
UnpackInt32/bitWidth=19-12                3.372Gi ±  4%   2.666Gi ±  2%  -20.95% (p=0.000 n=10)
UnpackInt32/bitWidth=20-12                3.600Gi ±  1%   2.869Gi ±  7%  -20.32% (p=0.000 n=10)
UnpackInt32/bitWidth=21-12                3.063Gi ±  2%   2.503Gi ±  1%  -18.26% (p=0.000 n=10)
UnpackInt32/bitWidth=22-12                3.097Gi ±  1%   2.455Gi ±  2%  -20.74% (p=0.000 n=10)
UnpackInt32/bitWidth=23-12                3.163Gi ± 14%   2.492Gi ±  2%  -21.23% (p=0.000 n=10)
UnpackInt32/bitWidth=24-12                3.652Gi ±  0%   2.861Gi ± 10%  -21.67% (p=0.000 n=10)
UnpackInt32/bitWidth=25-12                3.261Gi ±  3%   2.650Gi ±  1%  -18.75% (p=0.000 n=10)
UnpackInt32/bitWidth=26-12                3.221Gi ±  1%   2.605Gi ±  1%  -19.14% (p=0.000 n=10)
UnpackInt32/bitWidth=27-12                3.177Gi ±  3%   2.429Gi ±  2%  -23.56% (p=0.000 n=10)
UnpackInt32/bitWidth=28-12                3.245Gi ±  1%   2.150Gi ± 13%  -33.74% (p=0.000 n=10)
UnpackInt32/bitWidth=29-12                3.111Gi ±  2%   2.125Gi ± 25%  -31.71% (p=0.000 n=10)
UnpackInt32/bitWidth=30-12                3.142Gi ±  1%   1.946Gi ± 19%  -38.05% (p=0.000 n=10)
UnpackInt32/bitWidth=31-12                3.090Gi ±  1%   2.162Gi ± 15%  -30.02% (p=0.000 n=10)
UnpackInt32/bitWidth=32-12               55.779Gi ±  1%   3.403Gi ±  7%  -93.90% (p=0.000 n=10)
geomean                                   5.835Gi         2.939Gi        -49.63%

Signed-off-by: Achille Roussel <achille.roussel@gmail.com>

fpetkovski

Really interesting, were you able to do this with a LLM or do you know arm assembly?

achille-roussel · 2025-10-25T07:02:16Z

Completely generated by Claude, with a lot of guidance in the process to get there.

fpetkovski · 2025-10-25T13:12:37Z

unpack_int64_arm64.go

+
+func unpackInt64(dst []int64, src []byte, bitWidth uint) {
+	// For ARM64, we'll use NEON instructions
+	// TODO: Implement NEON optimizations - using default for now


Ah I missed this part, looks like like the asm implementation is not yet hooked in.

I don't have the implementation for 64 bits integers, but the one for 32 bits is wired. I'll follow up for the other.

implement bit unpacking optimizations for arm64

fde8b38

Signed-off-by: Achille Roussel <achille.roussel@gmail.com>

achille-roussel requested a review from fpetkovski October 24, 2025 22:47

achille-roussel self-assigned this Oct 24, 2025

achille-roussel added the enhancement New feature or request label Oct 24, 2025

achille-roussel added 3 commits October 24, 2025 16:09

add NEON implementation for 8 and 16 bits values

f44e135

Signed-off-by: Achille Roussel <achille.roussel@gmail.com>

add NEON optimizations for 1, 2, 4 bits

226acf8

Signed-off-by: Achille Roussel <achille.roussel@gmail.com>

add NEON optimizations for 3, 5, 6, 7 bits

f304527

Signed-off-by: Achille Roussel <achille.roussel@gmail.com>

achille-roussel mentioned this pull request Oct 24, 2025

use parquet-go/bitpack parquet-go/parquet-go#336

Merged

fpetkovski approved these changes Oct 25, 2025

View reviewed changes

fpetkovski merged commit dc7abba into main Oct 25, 2025
4 checks passed

fpetkovski reviewed Oct 25, 2025

View reviewed changes

achille-roussel mentioned this pull request Oct 25, 2025

implement 64 bit unpacking optimizations for arm64 #3

Merged

achille-roussel deleted the unpack-arm64 branch October 26, 2025 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement bit unpacking optimizations for arm64#2

implement bit unpacking optimizations for arm64#2
fpetkovski merged 4 commits intomainfrom
unpack-arm64

achille-roussel commented Oct 24, 2025 •

edited

Loading

Uh oh!

fpetkovski left a comment

Uh oh!

achille-roussel commented Oct 25, 2025

Uh oh!

Uh oh!

fpetkovski Oct 25, 2025

Uh oh!

achille-roussel Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

achille-roussel commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fpetkovski left a comment

Choose a reason for hiding this comment

Uh oh!

achille-roussel commented Oct 25, 2025

Uh oh!

Uh oh!

fpetkovski Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

achille-roussel Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

achille-roussel commented Oct 24, 2025 •

edited

Loading