Skip to content

implement bit unpacking optimizations for arm64#2

Merged
fpetkovski merged 4 commits intomainfrom
unpack-arm64
Oct 25, 2025
Merged

implement bit unpacking optimizations for arm64#2
fpetkovski merged 4 commits intomainfrom
unpack-arm64

Conversation

@achille-roussel
Copy link
Contributor

@achille-roussel achille-roussel commented Oct 24, 2025

Measurable improvements:

ARM64 vs Purego Benchmark Results

  Highly Optimized BitWidths (1-8, 16) - ~5x Faster

  | BitWidth | Speedup | Time Reduction | Throughput Increase           |
  |----------|---------|----------------|-------------------------------|
  | 1        | 4.88x   | -79.49%        | +387.63% (3.85 → 18.76 GiB/s) |
  | 2        | 4.78x   | -79.09%        | +378.25% (3.73 → 17.83 GiB/s) |
  | 3        | 5.30x   | -81.12%        | +429.67% (3.58 → 18.93 GiB/s) |
  | 4        | 5.19x   | -80.75%        | +418.90% (3.48 → 18.08 GiB/s) |
  | 5        | 5.54x   | -81.94%        | +453.82% (3.41 → 18.88 GiB/s) |
  | 6        | 4.85x   | -79.39%        | +385.28% (3.58 → 17.37 GiB/s) |
  | 7        | 5.20x   | -80.77%        | +420.11% (3.41 → 17.71 GiB/s) |
  | 8        | 4.17x   | -76.00%        | +316.78% (3.81 → 15.87 GiB/s) |
  | 16       | 4.85x   | -79.38%        | +385.06% (3.72 → 18.05 GiB/s) |

  Scalar Optimized BitWidths (10-15, 17-27) - ~15-23% Faster

  | BitWidth | Speedup    | Time Reduction | Throughput Increase |
  |----------|------------|----------------|---------------------|
  | 10       | 1.11x      | -9.60%         | +10.63%             |
  | 11       | 1.09x      | -8.50%         | +9.31%              |
  | 14       | 1.19x      | -15.67%        | +18.61%             |
  | 15       | 1.18x      | -15.09%        | +17.75%             |
  | 17       | 1.27x      | -21.43%        | +27.25%             |
  | 18       | 1.23x      | -18.54%        | +22.77%             |
  | 19       | 1.27x      | -20.96%        | +26.51%             |
  | 20       | 1.26x      | -20.33%        | +25.51%             |
  | 21-27    | 1.18-1.31x | -18% to -23%   | +22% to +31%        |

Full benchmark breakdown:

goos: darwin
goarch: arm64
pkg: github.com/parquet-go/bitpack
cpu: Apple M2 Pro
                           │ /tmp/bench_arm64_final.txt │       /tmp/bench_purego_final.txt        │
                           │           sec/op           │     sec/op      vs base                  │
UnpackInt32/bitWidth=1-12                  25.41n ±  4%    123.90n ±  2%   +387.60% (p=0.000 n=10)
UnpackInt32/bitWidth=2-12                  26.75n ±  2%    127.90n ±  8%   +378.13% (p=0.000 n=10)
UnpackInt32/bitWidth=3-12                  25.18n ±  4%    133.40n ± 77%   +429.68% (p=0.000 n=10)
UnpackInt32/bitWidth=4-12                  26.38n ±  0%    137.05n ±  6%   +419.52% (p=0.000 n=10)
UnpackInt32/bitWidth=5-12                  25.25n ±  2%    139.85n ±  4%   +453.86% (p=0.000 n=10)
UnpackInt32/bitWidth=6-12                  27.46n ±  2%    133.25n ±  1%   +385.25% (p=0.000 n=10)
UnpackInt32/bitWidth=7-12                  26.92n ± 13%    140.05n ±  2%   +420.15% (p=0.000 n=10)
UnpackInt32/bitWidth=8-12                  30.05n ± 10%    125.25n ±  2%   +316.74% (p=0.000 n=10)
UnpackInt32/bitWidth=9-12                  170.6n ± 11%     150.3n ±  7%    -11.90% (p=0.029 n=10)
UnpackInt32/bitWidth=10-12                 135.1n ±  8%     149.4n ±  3%    +10.62% (p=0.002 n=10)
UnpackInt32/bitWidth=11-12                 138.8n ±  3%     151.8n ± 19%     +9.29% (p=0.000 n=10)
UnpackInt32/bitWidth=12-12                 132.6n ± 11%     145.0n ±  7%     +9.39% (p=0.041 n=10)
UnpackInt32/bitWidth=13-12                 139.9n ± 21%     156.0n ±  5%          ~ (p=0.063 n=10)
UnpackInt32/bitWidth=14-12                 129.7n ±  6%     153.8n ±  2%    +18.58% (p=0.000 n=10)
UnpackInt32/bitWidth=15-12                 133.7n ±  5%     157.4n ±  0%    +17.77% (p=0.000 n=10)
UnpackInt32/bitWidth=16-12                 26.43n ±  2%    128.15n ±  6%   +384.96% (p=0.000 n=10)
UnpackInt32/bitWidth=17-12                 132.0n ±  1%     168.0n ±  2%    +27.27% (p=0.000 n=10)
UnpackInt32/bitWidth=18-12                 133.1n ±  4%     163.4n ±  3%    +22.76% (p=0.000 n=10)
UnpackInt32/bitWidth=19-12                 141.4n ±  4%     178.9n ±  2%    +26.52% (p=0.000 n=10)
UnpackInt32/bitWidth=20-12                 132.4n ±  1%     166.2n ±  7%    +25.52% (p=0.000 n=10)
UnpackInt32/bitWidth=21-12                 155.7n ±  2%     190.5n ±  1%    +22.35% (p=0.000 n=10)
UnpackInt32/bitWidth=22-12                 154.0n ±  1%     194.3n ±  2%    +26.17% (p=0.000 n=10)
UnpackInt32/bitWidth=23-12                 150.8n ± 16%     191.4n ±  2%    +26.93% (p=0.000 n=10)
UnpackInt32/bitWidth=24-12                 130.6n ±  0%     166.7n ± 11%    +27.69% (p=0.000 n=10)
UnpackInt32/bitWidth=25-12                 146.2n ±  3%     180.0n ±  1%    +23.08% (p=0.000 n=10)
UnpackInt32/bitWidth=26-12                 148.0n ±  1%     183.1n ±  1%    +23.68% (p=0.000 n=10)
UnpackInt32/bitWidth=27-12                 150.1n ±  3%     196.3n ±  2%    +30.78% (p=0.000 n=10)
UnpackInt32/bitWidth=28-12                 147.0n ±  1%     221.8n ± 11%    +50.88% (p=0.000 n=10)
UnpackInt32/bitWidth=29-12                 153.3n ±  2%     224.8n ± 33%    +46.72% (p=0.000 n=10)
UnpackInt32/bitWidth=30-12                 151.8n ±  1%     245.0n ± 19%    +61.48% (p=0.000 n=10)
UnpackInt32/bitWidth=31-12                 154.3n ±  1%     220.5n ± 17%    +42.90% (p=0.000 n=10)
UnpackInt32/bitWidth=32-12                 8.548n ±  1%   140.150n ±  8%  +1539.47% (p=0.000 n=10)
geomean                                    81.72n           162.3n          +98.55%

                           │ /tmp/bench_arm64_final.txt │      /tmp/bench_purego_final.txt      │
                           │            B/s             │      B/s       vs base                │
UnpackInt32/bitWidth=1-12                18.763Gi ±  3%   3.848Gi ±  2%  -79.49% (p=0.000 n=10)
UnpackInt32/bitWidth=2-12                17.825Gi ±  2%   3.727Gi ±  7%  -79.09% (p=0.000 n=10)
UnpackInt32/bitWidth=3-12                18.934Gi ±  4%   3.575Gi ± 43%  -81.12% (p=0.000 n=10)
UnpackInt32/bitWidth=4-12                18.078Gi ±  0%   3.484Gi ±  6%  -80.73% (p=0.000 n=10)
UnpackInt32/bitWidth=5-12                18.884Gi ±  2%   3.410Gi ±  4%  -81.94% (p=0.000 n=10)
UnpackInt32/bitWidth=6-12                17.366Gi ±  2%   3.578Gi ±  1%  -79.39% (p=0.000 n=10)
UnpackInt32/bitWidth=7-12                17.712Gi ± 12%   3.405Gi ±  2%  -80.77% (p=0.000 n=10)
UnpackInt32/bitWidth=8-12                15.868Gi ±  9%   3.807Gi ±  2%  -76.01% (p=0.000 n=10)
UnpackInt32/bitWidth=9-12                 2.796Gi ± 13%   3.172Gi ±  6%  +13.46% (p=0.029 n=10)
UnpackInt32/bitWidth=10-12                3.530Gi ±  8%   3.191Gi ±  3%   -9.61% (p=0.002 n=10)
UnpackInt32/bitWidth=11-12                3.435Gi ±  4%   3.142Gi ± 16%   -8.52% (p=0.000 n=10)
UnpackInt32/bitWidth=12-12                3.596Gi ± 10%   3.287Gi ±  6%        ~ (p=0.052 n=10)
UnpackInt32/bitWidth=13-12                3.409Gi ± 17%   3.058Gi ±  5%        ~ (p=0.063 n=10)
UnpackInt32/bitWidth=14-12                3.676Gi ±  6%   3.100Gi ±  2%  -15.69% (p=0.000 n=10)
UnpackInt32/bitWidth=15-12                3.567Gi ±  5%   3.029Gi ±  0%  -15.08% (p=0.000 n=10)
UnpackInt32/bitWidth=16-12               18.047Gi ±  2%   3.721Gi ±  5%  -79.38% (p=0.000 n=10)
UnpackInt32/bitWidth=17-12                3.612Gi ±  1%   2.838Gi ±  2%  -21.42% (p=0.000 n=10)
UnpackInt32/bitWidth=18-12                3.583Gi ±  4%   2.918Gi ±  2%  -18.54% (p=0.000 n=10)
UnpackInt32/bitWidth=19-12                3.372Gi ±  4%   2.666Gi ±  2%  -20.95% (p=0.000 n=10)
UnpackInt32/bitWidth=20-12                3.600Gi ±  1%   2.869Gi ±  7%  -20.32% (p=0.000 n=10)
UnpackInt32/bitWidth=21-12                3.063Gi ±  2%   2.503Gi ±  1%  -18.26% (p=0.000 n=10)
UnpackInt32/bitWidth=22-12                3.097Gi ±  1%   2.455Gi ±  2%  -20.74% (p=0.000 n=10)
UnpackInt32/bitWidth=23-12                3.163Gi ± 14%   2.492Gi ±  2%  -21.23% (p=0.000 n=10)
UnpackInt32/bitWidth=24-12                3.652Gi ±  0%   2.861Gi ± 10%  -21.67% (p=0.000 n=10)
UnpackInt32/bitWidth=25-12                3.261Gi ±  3%   2.650Gi ±  1%  -18.75% (p=0.000 n=10)
UnpackInt32/bitWidth=26-12                3.221Gi ±  1%   2.605Gi ±  1%  -19.14% (p=0.000 n=10)
UnpackInt32/bitWidth=27-12                3.177Gi ±  3%   2.429Gi ±  2%  -23.56% (p=0.000 n=10)
UnpackInt32/bitWidth=28-12                3.245Gi ±  1%   2.150Gi ± 13%  -33.74% (p=0.000 n=10)
UnpackInt32/bitWidth=29-12                3.111Gi ±  2%   2.125Gi ± 25%  -31.71% (p=0.000 n=10)
UnpackInt32/bitWidth=30-12                3.142Gi ±  1%   1.946Gi ± 19%  -38.05% (p=0.000 n=10)
UnpackInt32/bitWidth=31-12                3.090Gi ±  1%   2.162Gi ± 15%  -30.02% (p=0.000 n=10)
UnpackInt32/bitWidth=32-12               55.779Gi ±  1%   3.403Gi ±  7%  -93.90% (p=0.000 n=10)
geomean                                   5.835Gi         2.939Gi        -49.63%

Signed-off-by: Achille Roussel <achille.roussel@gmail.com>
@achille-roussel achille-roussel self-assigned this Oct 24, 2025
@achille-roussel achille-roussel added the enhancement New feature or request label Oct 24, 2025
Signed-off-by: Achille Roussel <achille.roussel@gmail.com>
Signed-off-by: Achille Roussel <achille.roussel@gmail.com>
Signed-off-by: Achille Roussel <achille.roussel@gmail.com>
Copy link
Collaborator

@fpetkovski fpetkovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really interesting, were you able to do this with a LLM or do you know arm assembly?

@achille-roussel
Copy link
Contributor Author

Completely generated by Claude, with a lot of guidance in the process to get there.

@fpetkovski fpetkovski merged commit dc7abba into main Oct 25, 2025
4 checks passed

func unpackInt64(dst []int64, src []byte, bitWidth uint) {
// For ARM64, we'll use NEON instructions
// TODO: Implement NEON optimizations - using default for now
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I missed this part, looks like like the asm implementation is not yet hooked in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the implementation for 64 bits integers, but the one for 32 bits is wired. I'll follow up for the other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants