Skip to content

Unrolling uint64 encoding #97

Open
@tenderlove

Description

@tenderlove

We have to encode 64 bit integers all the time (basically every field header requires it), so I tried unrolling the uint64 encoding logic to see what kind of performance gains we can get.

I tried 3 different ways of unrolling the loop:

  1. Just a naive way translating the while loop to if statements
  2. Same as 1, but eliminating intermediate local variable writes
  3. Rather than appending to the buffer for each byte, use the Array#pack optimization

The code is below:

Click me for benchmark
# frozen_string_literal: true

require "benchmark/ips"

def uint64_enc1 x, buff
  while x != 0
    byte = x & 0x7F
    x >>= 7
    byte |= 0x80 if x > 0
    buff << byte
  end
end

def uint64_enc2 x, buff
  if (x > 0)
    byte = (x >> 0) & 0x7F

    if (x >> 7) > 0
      buff << (byte | 0x80)

      byte = (x >> 7) & 0x7F

      if (x >> 14) > 0
        buff << (byte | 0x80)

        byte = (x >> 14) & 0x7F

        if (x >> 21) > 0
          buff << (byte | 0x80)

          byte = (x >> 21) & 0x7F

          if (x >> 28) > 0
            buff << (byte | 0x80)

            byte = (x >> 28) & 0x7F

            if (x >> 35) > 0
              buff << (byte | 0x80)

              byte = (x >> 35) & 0x7F

              if (x >> 42) > 0
                buff << (byte | 0x80)

                byte = (x >> 42) & 0x7F

                if (x >> 49) > 0
                  buff << (byte | 0x80)

                  byte = (x >> 49) & 0x7F

                  if (x >> 56) > 0
                    buff << (byte | 0x80)

                    byte = (x >> 56) & 0x7F

                    if (x >> 63) > 0
                      buff << (byte | 0x80)

                      buff << 1
                    else
                      buff << byte
                    end
                  else
                    buff << byte
                  end
                else
                  buff << byte
                end
              else
                buff << byte
              end
            else
              buff << byte
            end
          else
            buff << byte
          end
        else
          buff << byte
        end
      else
        buff << byte
      end
    else
      buff << byte
    end
  end
end

def uint64_enc3 x, buff
  if (x > 0)
    if (x >> 7) > 0
      buff << (((x >> 0) & 0x7F) | 0x80)

      if (x >> 14) > 0
        buff << (((x >> 7) & 0x7F) | 0x80)

        if (x >> 21) > 0
          buff << (((x >> 14) & 0x7F) | 0x80)

          if (x >> 28) > 0
            buff << (((x >> 21) & 0x7F) | 0x80)

            if (x >> 35) > 0
              buff << (((x >> 28) & 0x7F) | 0x80)

              if (x >> 42) > 0
                buff << (((x >> 35) & 0x7F) | 0x80)

                if (x >> 49) > 0
                  buff << (((x >> 42) & 0x7F) | 0x80)

                  if (x >> 56) > 0
                    buff << (((x >> 49) & 0x7F) | 0x80)

                    if (x >> 63) > 0
                      buff << (((x >> 56) & 0x7F) | 0x80)

                      buff << 1
                    else
                      buff << (x >> 56)
                    end
                  else
                    buff << (x >> 49)
                  end
                else
                  buff << (x >> 42)
                end
              else
                buff << (x >> 35)
              end
            else
              buff << (x >> 28)
            end
          else
            buff << (x >> 21)
          end
        else
          buff << (x >> 14)
        end
      else
        buff << (x >> 7)
      end
    else
      buff << x
    end
  end
end

def uint64_enc4 x, buff
  if (x > 0)
    if (x >> 7) > 0
      byte0 = (((x >> 0) & 0x7F) | 0x80)

      if (x >> 14) > 0
        byte1 = (((x >> 7) & 0x7F) | 0x80)

        if (x >> 21) > 0
          byte2 = (((x >> 14) & 0x7F) | 0x80)

          if (x >> 28) > 0
            byte3 = (((x >> 21) & 0x7F) | 0x80)

            if (x >> 35) > 0
              byte4 = (((x >> 28) & 0x7F) | 0x80)

              if (x >> 42) > 0
                byte5 = (((x >> 35) & 0x7F) | 0x80)

                if (x >> 49) > 0
                  byte6 = (((x >> 42) & 0x7F) | 0x80)

                  if (x >> 56) > 0
                    byte7 = (((x >> 49) & 0x7F) | 0x80)

                    if (x >> 63) > 0
                      byte8 = (((x >> 56) & 0x7F) | 0x80)

                      buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, byte8, 1].pack("CCCCCCCCCC")
                    else
                      buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, byte7, (x >> 56)].pack("CCCCCCCCC")
                    end
                  else
                    buff << [byte0, byte1, byte2, byte3, byte4, byte5, byte6, (x >> 49)].pack("CCCCCCCC")
                  end
                else
                  buff << [byte0, byte1, byte2, byte3, byte4, byte5, (x >> 42)].pack("CCCCCCC")
                end
              else
                buff << [byte0, byte1, byte2, byte3, byte4, (x >> 35)].pack("CCCCCC")
              end
            else
              buff << [byte0, byte1, byte2, byte3, (x >> 28)].pack("CCCCC")
            end
          else
            buff << [byte0, byte1, byte2, (x >> 21)].pack("CCCC")
          end
        else
          buff << [byte0, byte1, (x >> 14)].pack("CCC")
        end
      else
        buff << [byte0, (x >> 7)].pack("CC")
      end
    else
      buff << [(x >> 0)].pack("C")
    end
  end
end

Benchmark.ips do |x|
  x.report("uint64_enc1 (loop)") {
    uint64_enc1(0xFFFF_FFFF, "".b)
  }
  x.report("uint64_enc2 (unroll)") {
    uint64_enc2(0xFFFF_FFFF, "".b)
  }
  x.report("uint64_enc3 (unroll better)") {
    uint64_enc3(0xFFFF_FFFF, "".b)
  }
  x.report("uint64_enc4 (Array#pack)") {
    uint64_enc4(0xFFFF_FFFF, "".b)
  }
end

Results are below.

Interpreter:

$ ruby test.rb
ruby 3.4.0dev (2024-05-23T18:29:33Z specialize-array-p.. d32ef53305) [arm64-darwin23]
Warming up --------------------------------------
  uint64_enc1 (loop)   345.396k i/100ms
uint64_enc2 (unroll)   380.889k i/100ms
uint64_enc3 (unroll better)
                       415.911k i/100ms
uint64_enc4 (Array#pack)
                       319.319k i/100ms
Calculating -------------------------------------
  uint64_enc1 (loop)      3.413M (± 2.4%) i/s -     17.270M in   5.063010s
uint64_enc2 (unroll)      3.743M (± 1.2%) i/s -     19.044M in   5.088180s
uint64_enc3 (unroll better)
                          4.119M (± 0.6%) i/s -     20.796M in   5.048463s
uint64_enc4 (Array#pack)
                          3.144M (± 0.7%) i/s -     15.966M in   5.078173s

YJIT:

$ ruby --yjit test.rb
ruby 3.4.0dev (2024-05-23T18:29:33Z specialize-array-p.. d32ef53305) +YJIT [arm64-darwin23]
Warming up --------------------------------------
  uint64_enc1 (loop)     1.067M i/100ms
uint64_enc2 (unroll)     1.095M i/100ms
uint64_enc3 (unroll better)
                         1.089M i/100ms
uint64_enc4 (Array#pack)
                       640.782k i/100ms
Calculating -------------------------------------
  uint64_enc1 (loop)     11.533M (± 0.5%) i/s -     58.679M in   5.087999s
uint64_enc2 (unroll)     11.700M (± 0.6%) i/s -     59.114M in   5.052717s
uint64_enc3 (unroll better)
                         11.760M (± 0.5%) i/s -     59.869M in   5.090842s
uint64_enc4 (Array#pack)
                          6.785M (± 1.4%) i/s -     33.961M in   5.006533s

Unrolling the loop seems to help the interpreter, but doesn't seem to do much for YJIT. I'm a little bit surprised by how slow the Array#pack solution is on YJIT, I expected it to fare better since I merged ruby/ruby#8734

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions