Open
Description
We have to encode 64 bit integers all the time (basically every field header requires it), so I tried unrolling the uint64 encoding logic to see what kind of performance gains we can get.
I tried 3 different ways of unrolling the loop:
- Just a naive way translating the while loop to if statements
- Same as 1, but eliminating intermediate local variable writes
- Rather than appending to the buffer for each byte, use the
Array#pack
optimization
The code is below:
Click me for benchmark
# frozen_string_literal: true
require "benchmark/ips"
def uint64_enc1 x, buff
while x != 0
byte = x & 0x7F
x >>= 7
byte |= 0x80 if x > 0
buff << byte
end
end
def uint64_enc2 x, buff
if (x > 0)
byte = (x >> 0) & 0x7F
if (x >> 7) > 0
buff << (byte | 0x80)
byte = (x >> 7) & 0x7F
if (x >> 14) > 0
buff << (byte | 0x80)
byte = (x >> 14) & 0x7F
if (x >> 21) > 0
buff << (byte | 0x80)
byte = (x >> 21) & 0x7F
if (x >> 28) > 0
buff << (byte | 0x80)
byte = (x >> 28) & 0x7F
if (x >> 35) > 0
buff << (byte | 0x80)
byte = (x >> 35) & 0x7F
if (x >> 42) > 0
buff << (byte | 0x80)
byte = (x >> 42) & 0x7F
if (x >> 49) > 0
buff << (byte | 0x80)
byte = (x >> 49) & 0x7F
if (x >> 56) > 0
buff << (byte | 0x80)
byte = (x >> 56) & 0x7F
if (x >> 63) > 0
buff << (byte | 0x80)
buff << 1
else
buff << byte
end
else
buff << byte
end
else
buff << byte
end
else
buff << byte
end
else
buff << byte
end
else
buff << byte
end
else
buff << byte
end
else
buff << byte
end
else
buff << byte
end
end
end
def uint64_enc3 x, buff
if (x > 0)
if (x >> 7) > 0
buff << (((x >> 0) & 0x7F) | 0x80)
if (x >> 14) > 0
buff << (((x >> 7) & 0x7F) | 0x80)
if (x >> 21) > 0
buff << (((x >> 14) & 0x7F) | 0x80)
if (x >> 28) > 0
buff << (((x >> 21) & 0x7F) | 0x80)
if (x >> 35) > 0
buff << (((x >> 28) & 0x7F) | 0x80)
if (x >> 42) > 0
buff << (((x >> 35) & 0x7F) | 0x80)
if (x >> 49) > 0
buff << (((x >> 42) & 0x7F) | 0x80)
if (x >> 56) > 0
buff << (((x >> 49) & 0x7F) | 0x80)
if (x >> 63) > 0
buff << (((x >> 56) & 0x7F) | 0x80)
buff << 1
else
buff << (x >> 56)
end
else
buff << (x >> 49)
end
else
buff << (x >> 42)
end
else
buff << (x >> 35)
end
else
buff << (x >> 28)
end
else
buff << (x >> 21)
end
else
buff << (x >> 14)
end
else
buff << (x >> 7)
end
else
buff << x
end
end
end
def uint64_enc4 x, buff
if (x > 0)
if (x >> 7) > 0
byte0 = (((x >> 0) & 0x7F) | 0x80)
if (x >> 14) > 0
byte1 = (((x >> 7) & 0x7F) | 0x80)
if (x >> 21) > 0
byte2 = (((x >> 14) & 0x7F) | 0x80)
if (x >> 28) > 0
byte3 = (((x >> 21) & 0x7F) | 0x80)
if (x >> 35) > 0
byte4 = (((x >> 28) & 0x7F) | 0x80)
if (x >> 42) > 0
byte5 = (((x >> 35) & 0x7F) | 0x80)
if (x >> 49) > 0
byte6 = (((x >> 42) & 0x7F) | 0x80)
if (x >> 56) > 0
byte7 = (((x >> 49) & 0x7F) | 0x80)
if (x >> 63) > 0
byte8 = (((x >> 56) & 0x7F) | 0x80)
buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, byte8, 1].pack("CCCCCCCCCC")
else
buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, (x >> 56)].pack("CCCCCCCCC")
end
else
buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, (x >> 49)].pack("CCCCCCCC")
end
else
buff << [byte0, byte1, byte2, byte3, byte4, byte5, (x >> 42)].pack("CCCCCCC")
end
else
buff << [byte0, byte1, byte2, byte3, byte4, (x >> 35)].pack("CCCCCC")
end
else
buff << [byte0, byte1, byte2, byte3, (x >> 28)].pack("CCCCC")
end
else
buff << [byte0, byte1, byte2, (x >> 21)].pack("CCCC")
end
else
buff << [byte0, byte1, (x >> 14)].pack("CCC")
end
else
buff << [byte0, (x >> 7)].pack("CC")
end
else
buff << [(x >> 0)].pack("C")
end
end
end
Benchmark.ips do |x|
x.report("uint64_enc1 (loop)") {
uint64_enc1(0xFFFF_FFFF, "".b)
}
x.report("uint64_enc2 (unroll)") {
uint64_enc2(0xFFFF_FFFF, "".b)
}
x.report("uint64_enc3 (unroll better)") {
uint64_enc3(0xFFFF_FFFF, "".b)
}
x.report("uint64_enc4 (Array#pack)") {
uint64_enc4(0xFFFF_FFFF, "".b)
}
end
Results are below.
Interpreter:
$ ruby test.rb
ruby 3.4.0dev (2024-05-23T18:29:33Z specialize-array-p.. d32ef53305) [arm64-darwin23]
Warming up --------------------------------------
uint64_enc1 (loop) 345.396k i/100ms
uint64_enc2 (unroll) 380.889k i/100ms
uint64_enc3 (unroll better)
415.911k i/100ms
uint64_enc4 (Array#pack)
319.319k i/100ms
Calculating -------------------------------------
uint64_enc1 (loop) 3.413M (± 2.4%) i/s - 17.270M in 5.063010s
uint64_enc2 (unroll) 3.743M (± 1.2%) i/s - 19.044M in 5.088180s
uint64_enc3 (unroll better)
4.119M (± 0.6%) i/s - 20.796M in 5.048463s
uint64_enc4 (Array#pack)
3.144M (± 0.7%) i/s - 15.966M in 5.078173s
YJIT:
$ ruby --yjit test.rb
ruby 3.4.0dev (2024-05-23T18:29:33Z specialize-array-p.. d32ef53305) +YJIT [arm64-darwin23]
Warming up --------------------------------------
uint64_enc1 (loop) 1.067M i/100ms
uint64_enc2 (unroll) 1.095M i/100ms
uint64_enc3 (unroll better)
1.089M i/100ms
uint64_enc4 (Array#pack)
640.782k i/100ms
Calculating -------------------------------------
uint64_enc1 (loop) 11.533M (± 0.5%) i/s - 58.679M in 5.087999s
uint64_enc2 (unroll) 11.700M (± 0.6%) i/s - 59.114M in 5.052717s
uint64_enc3 (unroll better)
11.760M (± 0.5%) i/s - 59.869M in 5.090842s
uint64_enc4 (Array#pack)
6.785M (± 1.4%) i/s - 33.961M in 5.006533s
Unrolling the loop seems to help the interpreter, but doesn't seem to do much for YJIT. I'm a little bit surprised by how slow the Array#pack
solution is on YJIT, I expected it to fare better since I merged ruby/ruby#8734
Metadata
Metadata
Assignees
Labels
No labels
Activity