Unrolling uint64 encoding

We have to encode 64 bit integers all the time (basically every field header requires it), so I tried unrolling the uint64 encoding logic to see what kind of performance gains we can get.

I tried 3 different ways of unrolling the loop:

1. Just a naive way translating the while loop to if statements
2. Same as 1, but eliminating intermediate local variable writes
3. Rather than appending to the buffer for each byte, use the `Array#pack` optimization

The code is below:

<details>
  <summary>Click me for benchmark</summary>

```ruby
# frozen_string_literal: true

require "benchmark/ips"

def uint64_enc1 x, buff
  while x != 0
    byte = x & 0x7F
    x >>= 7
    byte |= 0x80 if x > 0
    buff << byte
  end
end

def uint64_enc2 x, buff
  if (x > 0)
    byte = (x >> 0) & 0x7F

    if (x >> 7) > 0
      buff << (byte | 0x80)

      byte = (x >> 7) & 0x7F

      if (x >> 14) > 0
        buff << (byte | 0x80)

        byte = (x >> 14) & 0x7F

        if (x >> 21) > 0
          buff << (byte | 0x80)

          byte = (x >> 21) & 0x7F

          if (x >> 28) > 0
            buff << (byte | 0x80)

            byte = (x >> 28) & 0x7F

            if (x >> 35) > 0
              buff << (byte | 0x80)

              byte = (x >> 35) & 0x7F

              if (x >> 42) > 0
                buff << (byte | 0x80)

                byte = (x >> 42) & 0x7F

                if (x >> 49) > 0
                  buff << (byte | 0x80)

                  byte = (x >> 49) & 0x7F

                  if (x >> 56) > 0
                    buff << (byte | 0x80)

                    byte = (x >> 56) & 0x7F

                    if (x >> 63) > 0
                      buff << (byte | 0x80)

                      buff << 1
                    else
                      buff << byte
                    end
                  else
                    buff << byte
                  end
                else
                  buff << byte
                end
              else
                buff << byte
              end
            else
              buff << byte
            end
          else
            buff << byte
          end
        else
          buff << byte
        end
      else
        buff << byte
      end
    else
      buff << byte
    end
  end
end

def uint64_enc3 x, buff
  if (x > 0)
    if (x >> 7) > 0
      buff << (((x >> 0) & 0x7F) | 0x80)

      if (x >> 14) > 0
        buff << (((x >> 7) & 0x7F) | 0x80)

        if (x >> 21) > 0
          buff << (((x >> 14) & 0x7F) | 0x80)

          if (x >> 28) > 0
            buff << (((x >> 21) & 0x7F) | 0x80)

            if (x >> 35) > 0
              buff << (((x >> 28) & 0x7F) | 0x80)

              if (x >> 42) > 0
                buff << (((x >> 35) & 0x7F) | 0x80)

                if (x >> 49) > 0
                  buff << (((x >> 42) & 0x7F) | 0x80)

                  if (x >> 56) > 0
                    buff << (((x >> 49) & 0x7F) | 0x80)

                    if (x >> 63) > 0
                      buff << (((x >> 56) & 0x7F) | 0x80)

                      buff << 1
                    else
                      buff << (x >> 56)
                    end
                  else
                    buff << (x >> 49)
                  end
                else
                  buff << (x >> 42)
                end
              else
                buff << (x >> 35)
              end
            else
              buff << (x >> 28)
            end
          else
            buff << (x >> 21)
          end
        else
          buff << (x >> 14)
        end
      else
        buff << (x >> 7)
      end
    else
      buff << x
    end
  end
end

def uint64_enc4 x, buff
  if (x > 0)
    if (x >> 7) > 0
      byte0 = (((x >> 0) & 0x7F) | 0x80)

      if (x >> 14) > 0
        byte1 = (((x >> 7) & 0x7F) | 0x80)

        if (x >> 21) > 0
          byte2 = (((x >> 14) & 0x7F) | 0x80)

          if (x >> 28) > 0
            byte3 = (((x >> 21) & 0x7F) | 0x80)

            if (x >> 35) > 0
              byte4 = (((x >> 28) & 0x7F) | 0x80)

              if (x >> 42) > 0
                byte5 = (((x >> 35) & 0x7F) | 0x80)

                if (x >> 49) > 0
                  byte6 = (((x >> 42) & 0x7F) | 0x80)

                  if (x >> 56) > 0
                    byte7 = (((x >> 49) & 0x7F) | 0x80)

                    if (x >> 63) > 0
                      byte8 = (((x >> 56) & 0x7F) | 0x80)

                      buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, byte8, 1].pack("CCCCCCCCCC")
                    else
                      buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, (x >> 56)].pack("CCCCCCCCC")
                    end
                  else
                    buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, (x >> 49)].pack("CCCCCCCC")
                  end
                else
                  buff << [byte0, byte1, byte2, byte3, byte4, byte5, (x >> 42)].pack("CCCCCCC")
                end
              else
                buff << [byte0, byte1, byte2, byte3, byte4, (x >> 35)].pack("CCCCCC")
              end
            else
              buff << [byte0, byte1, byte2, byte3, (x >> 28)].pack("CCCCC")
            end
          else
            buff << [byte0, byte1, byte2, (x >> 21)].pack("CCCC")
          end
        else
          buff << [byte0, byte1, (x >> 14)].pack("CCC")
        end
      else
        buff << [byte0, (x >> 7)].pack("CC")
      end
    else
      buff << [(x >> 0)].pack("C")
    end
  end
end

Benchmark.ips do |x|
  x.report("uint64_enc1 (loop)") {
    uint64_enc1(0xFFFF_FFFF, "".b)
  }
  x.report("uint64_enc2 (unroll)") {
    uint64_enc2(0xFFFF_FFFF, "".b)
  }
  x.report("uint64_enc3 (unroll better)") {
    uint64_enc3(0xFFFF_FFFF, "".b)
  }
  x.report("uint64_enc4 (Array#pack)") {
    uint64_enc4(0xFFFF_FFFF, "".b)
  }
end
```

</details>

Results are below.

Interpreter:

```
$ ruby test.rb
ruby 3.4.0dev (2024-05-23T18:29:33Z specialize-array-p.. d32ef53305) [arm64-darwin23]
Warming up --------------------------------------
  uint64_enc1 (loop)   345.396k i/100ms
uint64_enc2 (unroll)   380.889k i/100ms
uint64_enc3 (unroll better)
                       415.911k i/100ms
uint64_enc4 (Array#pack)
                       319.319k i/100ms
Calculating -------------------------------------
  uint64_enc1 (loop)      3.413M (± 2.4%) i/s -     17.270M in   5.063010s
uint64_enc2 (unroll)      3.743M (± 1.2%) i/s -     19.044M in   5.088180s
uint64_enc3 (unroll better)
                          4.119M (± 0.6%) i/s -     20.796M in   5.048463s
uint64_enc4 (Array#pack)
                          3.144M (± 0.7%) i/s -     15.966M in   5.078173s
```

YJIT:

```
$ ruby --yjit test.rb
ruby 3.4.0dev (2024-05-23T18:29:33Z specialize-array-p.. d32ef53305) +YJIT [arm64-darwin23]
Warming up --------------------------------------
  uint64_enc1 (loop)     1.067M i/100ms
uint64_enc2 (unroll)     1.095M i/100ms
uint64_enc3 (unroll better)
                         1.089M i/100ms
uint64_enc4 (Array#pack)
                       640.782k i/100ms
Calculating -------------------------------------
  uint64_enc1 (loop)     11.533M (± 0.5%) i/s -     58.679M in   5.087999s
uint64_enc2 (unroll)     11.700M (± 0.6%) i/s -     59.114M in   5.052717s
uint64_enc3 (unroll better)
                         11.760M (± 0.5%) i/s -     59.869M in   5.090842s
uint64_enc4 (Array#pack)
                          6.785M (± 1.4%) i/s -     33.961M in   5.006533s
```

Unrolling the loop seems to help the interpreter, but doesn't seem to do much for YJIT.  I'm a little bit surprised by how slow the `Array#pack` solution is on YJIT, I expected it to fare better since I merged https://github.com/ruby/ruby/pull/8734

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unrolling uint64 encoding #97

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unrolling uint64 encoding #97

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions