Is there fast CRC16 in pyboard #12891

skylin008 · 2023-11-05T15:16:50Z

skylin008
Nov 5, 2023

I used the pyboard to communication with RS485 device, when calucation the CRC value, the time beyond 2mS. Follows as my code:
`def calculateCRC(msgIn):
CRC16 = 0xffff
poly = 0xa001

for i in range(len(msgIn)):

    CRC16 = ord(msgIn[i]) ^ CRC16

    for _ in range(8):
        if (CRC16 & 0x0001 == 1):
            CRC16 >>= 1
            CRC16 ^= poly
        else:
            CRC16 >>= 1

return str(hex(CRC16 >> 8)),str(hex(CRC16 & 0x00ff))

`

Is there any algorithm to be suit pyboard?Thanks!

GitHubsSilverBullet · 2023-11-05T16:00:38Z

GitHubsSilverBullet
Nov 5, 2023

I'd rewrite that and use the @micropython.viper decorator.
Wouldn't be surprised if you get a 10×…20× speedup.

1 reply

GitHubsSilverBullet Nov 5, 2023

Oh well, just had some fun. There's probably still room for improvement.

from time import ticks_ms, ticks_diff

def calculateCRC(msgIn):
    CRC16 = 0xffff
    poly = 0xa001
    for i in range(len(msgIn)):
        CRC16 =  msgIn[i] ^ CRC16
        for _ in range(8):
            if (CRC16 & 0x0001 == 1):
                CRC16 >>= 1
                CRC16 ^= poly
            else:
                CRC16 >>= 1
    return str(hex(CRC16 >> 8)),str(hex(CRC16 & 0x00ff))

@micropython.native
def calculateCRC_n(msgIn):
    CRC16 = 0xffff
    poly = 0xa001
    for i in range(len(msgIn)):
        CRC16 =  msgIn[i] ^ CRC16
        for _ in range(8):
            if (CRC16 & 0x0001 == 1):
                CRC16 >>= 1
                CRC16 ^= poly
            else:
                CRC16 >>= 1
    return str(hex(CRC16 >> 8)),str(hex(CRC16 & 0x00ff))

@micropython.viper
def calculateCRC_v(msgIn: ptr8, num: int) -> int:
    CRC16: int = 0xffff
    poly: int = 0xa001
    i: int  = 0
    while i < num:
        CRC16 =  msgIn[i] ^ CRC16
        j: int = 8
        while j > 0:
            if (CRC16 & 0x0001 == 1):
                CRC16 >>= 1
                CRC16 ^= poly
            else:
                CRC16 >>= 1
            j -= 1
        i += 1
    return CRC16

buffer = bytearray(10000)
for i in range(3):
    t0 = ticks_ms()
    result = calculateCRC(buffer)
    t1 = ticks_ms()
    td = ticks_diff(t1,t0)
    print(result,td)

for i in range(3):
    t0 = ticks_ms()
    result = calculateCRC_n(buffer)
    t1 = ticks_ms()
    td = ticks_diff(t1,t0)
    print(result,td)

for i in range(3):
    t0 = ticks_ms()
    result = calculateCRC_v(buffer, len(buffer))
    result = str(hex(result >> 8)),str(hex(result & 0x00ff))
    t1 = ticks_ms()
    td = ticks_diff(t1,t0)
    print(result,td)

('0xc1', '0x76') 967
('0xc1', '0x76') 966
('0xc1', '0x76') 966
('0xc1', '0x76') 440
('0xc1', '0x76') 440
('0xc1', '0x76') 440
('0xc1', '0x76') 30
('0xc1', '0x76') 29
('0xc1', '0x76') 30

rkompass · 2023-11-05T16:28:07Z

rkompass
Nov 5, 2023

In the sdcard driver by @mendenm there is a fast crc16 implementation that uses a 256 byte-wide table and @thumb.asm.

1 reply

rkompass Nov 5, 2023

Correction: The above linked crc16 uses another poly (CCIT).
For a perhaps faster table-based approach, where the poly is user-defined see this older discussion.
Certainly the crc-computing part may be accelerated further by using viper too, or asm.thumb.
As for the pybard: It has hardware crc32, with a fixed polynomial, but not crc16, according to this doc. Related MP code is here.

mendenm · 2023-11-05T18:14:45Z

mendenm
Nov 5, 2023

One thing i may implement on the Pico/rp2 based machines is the use of a DMA channel to compute crc16. I need to look at which CRC16 they compute, and then see if it is fast, including the DMA setup overhead time. The DMA on rp2040 can do this as the data passes thrugh, and if one just sets one up to read the data block and pass it into a single byte somewhere, it would be an awesomely fast CRC16.

…

On Nov 5, 2023, at 12:46 PM, Raul Kompaß ***@***.***> wrote: Correction: The above linked crc16 uses another poly (CCIT). For a perhaps faster table-based approach, where the poly is user-defined see this older discussion. Certainly the crc-computing part may be accelerated further by using viper too, or asm.thumb. As for the pybard: It has hardware crc32, with a fixed polynomial, but not crc16, according to this doc. Related MP code is here. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

2 replies

GitHubsSilverBullet Nov 5, 2023

Viper code takes 3µs/byte. How fast do you need it to be?

rkompass Nov 6, 2023

Here is @GitHubsSilverBullet 's comparison extended for the table approach and asm_thumb. Was curious to see how far it would go.
Found: assembler + table together lead to another 20x speedup.

from time import ticks_us, ticks_diff
from array import array

PRESET = const(0xffff)
POLY = const(0xa001)

def crc16(data): # data.. bytearray or bytes object of data
    crc = PRESET
    for d in data:
        crc ^= d
        for _ in range(8):
            if crc & 1:
                crc = crc>>1 ^ POLY
            else:
                crc >>= 1
    return crc

@micropython.native
def crc16_n(data):
    crc = PRESET
    for d in data:
        crc ^= d
        for _ in range(8):
            if crc & 1:
                crc = crc>>1 ^ POLY
            else:
                crc >>= 1
    return crc

@micropython.viper
def crc16_v(data: ptr8, num: int) -> int:
    crc: int = PRESET
    i: int  = 0
    while i < num:
        crc =  data[i] ^ crc
        j: int = 8
        while j > 0:
            if crc & 1:
                crc = crc>>1 ^ POLY
            else:
                crc = crc>>1
            j -= 1
        i += 1
    return crc

# --- internal ---
# r6 .. 1
# r5 .. loop index (8, 7, .., 1)
# r4 .. working and test register
# --- arguments ---
# r3 .. poly  
# r2 .. n (length of data)
# r1 .. data address
# r0 .. crc
@micropython.asm_thumb
def crc16_a(r0, r1, r2, r3) -> int:
    mov(r6, 1)         # helper value for bit shift
    label(loop)       # --- outer loop --
    ldrb(r4, [r1, 0])    # r4 = data[0]
    add(r1, 1)           # increment data pointer
    eor(r0, r4)          # crc ^=  data[0]
    mov(r5, 8)           # r5 = 8, prepare inner loop
    label(loop_8)        # --- inner loop ---
    mov(r4, 1)
    and_(r4, r0)           # if crc & 1:
    beq(lelse)
    lsr(r0, r6)              # crc >>= 1
    eor(r0, r3)              # crc ^= poly
    sub(r5, 1)
    bne(loop_8)
    b(fin_8)
    label(lelse)           # else:
    lsr(r0, r6)              # crc >>= 1
    sub(r5, 1)
    bne(loop_8)
    label(fin_8)         # --- inner end ---
    sub(r2, 1)         # n -= 1
    bne(loop)         # --- test and outer end ---

# --------------------------------------------------------
# Table based implementations

# create a single entry to the CRC table
def _initial(c):
    crc = c
    for j in range(8):
        if crc & 0x01:
            crc = (crc >> 1) ^ POLY
        else:
            crc = crc >> 1

    return crc

# Create the table
_tab = array("H", [_initial(i) for i in range(256)])

def crc16_t(data):
    crc = PRESET
    for d in data:
        crc = (crc >> 8) ^ _tab[(crc ^ d) & 0xff]
    return crc

@micropython.native
def crc16_tn(data):
    crc = PRESET
    for d in data:
        crc = (crc >> 8) ^ _tab[(crc ^ d) & 0xff]
    return crc

@micropython.viper
def crc16_tv(data: ptr8, num: int, tab: ptr16) -> int:
    crc: int = PRESET
    i: int  = 0
    idx:int = 0
    while i < num:
        idx = (crc ^ data[i]) & 0xff
        crc = (crc >> 8) ^ tab[idx]
        i += 1
    return crc

# --- internal ---
# r7 .. 8
# r6 .. 0xff
# r5 .. idx
# r4 .. working and test register
# --- arguments ---
# r3 .. table address
# r2 .. n (length of data)
# r1 .. data address
# r0 .. crc
@micropython.asm_thumb
def crc16_ta(r0, r1, r2, r3) -> int:
    mov(r7, 8)
    mov(r6, 0xff)
    label(loop)
    ldrb(r5, [r1, 0])  # idx = data[0]
    add(r1, 1)         # increment data pointer
    eor(r5, r0)        # idx ^= crc
    and_(r5, r6)       # idx ^= 0xff
    add(r5, r5, r5)    # double to make idx ptr16
    add(r5, r5, r3)    # table data address
    ldrh(r5, [r5, 0])  # fetch table entry: r5 = tab[idx]
    lsr(r0, r7)        # crc >>= 8
    eor(r0, r5)        # crc ^= tab[idx]
    sub(r2, 1)         # n -= 1
    bne(loop)          # --- test and outer end ---

# -------------- actual tests -----------------------

buffer = bytearray((i%256 for i in range(20_000)))

for i in range(3):
    t0 = ticks_us()
    result = crc16(buffer)
    t1 = ticks_us()
    td = ticks_diff(t1,t0)
    print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')

for i in range(3):
    t0 = ticks_us()
    result = crc16_n(buffer)
    t1 = ticks_us()
    td = ticks_diff(t1,t0)
    print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')

for i in range(3):
    t0 = ticks_us()
    result = crc16_v(buffer, len(buffer))
    t1 = ticks_us()
    td = ticks_diff(t1,t0)
    print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')

for i in range(3):
    t0 = ticks_us()
    result = crc16_a(PRESET, buffer, len(buffer), POLY)
    t1 = ticks_us()
    td = ticks_diff(t1,t0)
    print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')

for i in range(3):
    t0 = ticks_us()
    result = crc16_t(buffer)
    t1 = ticks_us()
    td = ticks_diff(t1,t0)
    print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')

for i in range(3):
    t0 = ticks_us()
    result = crc16_tn(buffer)
    t1 = ticks_us()
    td = ticks_diff(t1,t0)
    print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')

for i in range(3):
    t0 = ticks_us()
    result = crc16_tv(buffer, len(buffer), _tab)
    t1 = ticks_us()
    td = ticks_diff(t1,t0)
    print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')

for i in range(3):
    t0 = ticks_us()
    result = crc16_ta(PRESET, buffer, len(buffer), _tab)
    t1 = ticks_us()
    td = ticks_diff(t1,t0)
    print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')

Results on my Blackpill (96MHz):

crc16: 0x9cb8, 85.5907µs per byte
crc16: 0x9cb8, 85.58925µs per byte
crc16: 0x9cb8, 85.59105µs per byte
crc16: 0x9cb8, 42.5017µs per byte
crc16: 0x9cb8, 42.5013µs per byte
crc16: 0x9cb8, 42.5032µs per byte
crc16: 0x9cb8, 3.473µs per byte
crc16: 0x9cb8, 3.47325µs per byte
crc16: 0x9cb8, 3.47295µs per byte
crc16: 0x9cb8, 1.01305µs per byte
crc16: 0x9cb8, 1.01295µs per byte
crc16: 0x9cb8, 1.013µs per byte
crc16: 0x9cb8, 14.7685µs per byte
crc16: 0x9cb8, 14.76845µs per byte
crc16: 0x9cb8, 14.7679µs per byte
crc16: 0x9cb8, 10.3521µs per byte
crc16: 0x9cb8, 10.3505µs per byte
crc16: 0x9cb8, 10.35255µs per byte
crc16: 0x9cb8, 0.6587µs per byte
crc16: 0x9cb8, 0.65865µs per byte
crc16: 0x9cb8, 0.6587µs per byte
crc16: 0x9cb8, 0.1688µs per byte
crc16: 0x9cb8, 0.1688µs per byte
crc16: 0x9cb8, 0.1688µs per byte

There is a memory-penalty of 512 bytes for the table-based approach, of course.

mendenm · 2023-11-07T00:43:24Z

mendenm
Nov 7, 2023

Nice, @rkompass, you even figured out how to do the asm version in a couple less instructions than I did, by handling the shifting/byte swapping more cleanly, although this is the other common crc16 polynomial. I will have to see if the ccitt/sdcard one can be done with the same improved swapping. Marcus

…

On Nov 6, 2023, at 6:21 PM, Raul Kompaß ***@***.***> wrote: Here is @GitHubsSilverBullet 's comparison extended for the table approach and asm_thumb. Was curious to see how far it would go. Found: assembler + table together lead to another 20x speedup. from time import ticks_us, ticks_diff from array import array PRESET = const(0xffff) POLY = const(0xa001) def crc16(data): # data.. bytearray or bytes object of data crc = PRESET for d in data: crc ^= d for _ in range(8): if crc & 1: crc = crc>>1 ^ POLY else: crc >>= 1 return crc @micropython.native def crc16_n(data): crc = PRESET for d in data: crc ^= d for _ in range(8): if crc & 1: crc = crc>>1 ^ POLY else: crc >>= 1 return crc @micropython.viper def crc16_v(data: ptr8, num: int) -> int: crc: int = PRESET i: int = 0 while i < num: crc = data[i] ^ crc j: int = 8 while j > 0: if crc & 1: crc = crc>>1 ^ POLY else: crc = crc>>1 j -= 1 i += 1 return crc # --- internal --- # r6 .. 1 # r5 .. loop index (8, 7, .., 1) # r4 .. working and test register # --- arguments --- # r3 .. poly # r2 .. n (length of data) # r1 .. data address # r0 .. crc @micropython.asm_thumb def crc16_a(r0, r1, r2, r3) -> int: mov(r6, 1) # helper value for bit shift label(loop) # --- outer loop -- ldrb(r4, [r1, 0]) # r4 = data[0] add(r1, 1) # increment data pointer eor(r0, r4) # crc ^= data[0] mov(r5, 8) # r5 = 8, prepare inner loop label(loop_8) # --- inner loop --- mov(r4, 1) and_(r4, r0) # if crc & 1: beq(lelse) lsr(r0, r6) # crc >>= 1 eor(r0, r3) # crc ^= poly sub(r5, 1) bne(loop_8) b(fin_8) label(lelse) # else: lsr(r0, r6) # crc >>= 1 sub(r5, 1) bne(loop_8) label(fin_8) # --- inner end --- sub(r2, 1) # n -= 1 bne(loop) # --- test and outer end --- # -------------------------------------------------------- # Table based implementations # create a single entry to the CRC table def _initial(c): crc = c for j in range(8): if crc & 0x01: crc = (crc >> 1) ^ POLY else: crc = crc >> 1 return crc # Create the table _tab = array("H", [_initial(i) for i in range(256)]) def crc16_t(data): crc = PRESET for d in data: crc = (crc >> 8) ^ _tab[(crc ^ d) & 0xff] return crc @micropython.native def crc16_tn(data): crc = PRESET for d in data: crc = (crc >> 8) ^ _tab[(crc ^ d) & 0xff] return crc @micropython.viper def crc16_tv(data: ptr8, num: int, tab: ptr16) -> int: crc: int = PRESET i: int = 0 idx:int = 0 while i < num: idx = (crc ^ data[i]) & 0xff crc = (crc >> 8) ^ tab[idx] i += 1 return crc # --- internal --- # r7 .. 8 # r6 .. 0xff # r5 .. idx # r4 .. working and test register # --- arguments --- # r3 .. table address # r2 .. n (length of data) # r1 .. data address # r0 .. crc @micropython.asm_thumb def crc16_ta(r0, r1, r2, r3) -> int: mov(r7, 8) mov(r6, 0xff) label(loop) ldrb(r5, [r1, 0]) # idx = data[0] add(r1, 1) # increment data pointer eor(r5, r0) # idx ^= crc and_(r5, r6) # idx ^= 0xff add(r5, r5, r5) # double to make idx ptr16 add(r5, r5, r3) # table data address ldrh(r5, [r5, 0]) # fetch table entry: r5 = tab[idx] lsr(r0, r7) # crc >>= 8 eor(r0, r5) # crc ^= tab[idx] sub(r2, 1) # n -= 1 bne(loop) # --- test and outer end --- # -------------- actual tests ----------------------- buffer = bytearray((i%256 for i in range(20_000))) for i in range(3): t0 = ticks_us() result = crc16(buffer) t1 = ticks_us() td = ticks_diff(t1,t0) print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte') for i in range(3): t0 = ticks_us() result = crc16_n(buffer) t1 = ticks_us() td = ticks_diff(t1,t0) print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte') for i in range(3): t0 = ticks_us() result = crc16_v(buffer, len(buffer)) t1 = ticks_us() td = ticks_diff(t1,t0) print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte') for i in range(3): t0 = ticks_us() result = crc16_a(PRESET, buffer, len(buffer), POLY) t1 = ticks_us() td = ticks_diff(t1,t0) print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte') for i in range(3): t0 = ticks_us() result = crc16_t(buffer) t1 = ticks_us() td = ticks_diff(t1,t0) print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte') for i in range(3): t0 = ticks_us() result = crc16_tn(buffer) t1 = ticks_us() td = ticks_diff(t1,t0) print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte') for i in range(3): t0 = ticks_us() result = crc16_tv(buffer, len(buffer), _tab) t1 = ticks_us() td = ticks_diff(t1,t0) print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte') for i in range(3): t0 = ticks_us() result = crc16_ta(PRESET, buffer, len(buffer), _tab) t1 = ticks_us() td = ticks_diff(t1,t0) print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte') Results on my Blackpill (96MHz): crc16: 0x9cb8, 85.5907µs per byte crc16: 0x9cb8, 85.58925µs per byte crc16: 0x9cb8, 85.59105µs per byte crc16: 0x9cb8, 42.5017µs per byte crc16: 0x9cb8, 42.5013µs per byte crc16: 0x9cb8, 42.5032µs per byte crc16: 0x9cb8, 3.473µs per byte crc16: 0x9cb8, 3.47325µs per byte crc16: 0x9cb8, 3.47295µs per byte crc16: 0x9cb8, 1.01305µs per byte crc16: 0x9cb8, 1.01295µs per byte crc16: 0x9cb8, 1.013µs per byte crc16: 0x9cb8, 14.7685µs per byte crc16: 0x9cb8, 14.76845µs per byte crc16: 0x9cb8, 14.7679µs per byte crc16: 0x9cb8, 10.3521µs per byte crc16: 0x9cb8, 10.3505µs per byte crc16: 0x9cb8, 10.35255µs per byte crc16: 0x9cb8, 0.6587µs per byte crc16: 0x9cb8, 0.65865µs per byte crc16: 0x9cb8, 0.6587µs per byte crc16: 0x9cb8, 0.1688µs per byte crc16: 0x9cb8, 0.1688µs per byte crc16: 0x9cb8, 0.1688µs per byte There is a memory-penalty of 512 bytes for the table-based approach, of course. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

rkompass Nov 7, 2023

Your code was a clearly a helper. I'm beginning to get an idea now why sometimes bit-reverse is involved in crc computation. Could it be that the ccitt/sdcard computation you implemented is only compatible with the codes presented here after adding such a reverse?
asm_thumb has a bit-reverse instruction btw..

mendenm · 2023-11-07T10:22:06Z

mendenm
Nov 7, 2023

@rkompass yes, I was trying to figure out if the CRC that you are using really amount to a bit or byte reversed version of the SD card CCITT one. I haven't had time to look into it. Marcus

…

On Nov 7, 2023, at 4:30 AM, Raul Kompaß ***@***.***> wrote: Your code was a clearly a helper. I'm beginning to get an idea now why sometimes bit-reverse is involved in crc computation. Could it be that the ccitt/sdcard computation you implemented is only compatible with the codes presented here after adding such a reverse? asm_thumb has a bit-reverse instruction btw.. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

rkompass Nov 7, 2023

Found that it is compatible by bit-reversing the bytes to be processed, the polynomial, the initial value and the crc result:

# Table based implementations of crc  - comparison
from array import array

# ------- implementation by rkompass ------

# create a single entry of CRC table
def _initial(crc, poly): # by roberthh: https://forum.micropython.org/viewtopic.php?t=12837
    for j in range(8):
        if crc & 1:
            crc = (crc >> 1) ^ poly
        else:
            crc = crc >> 1
    return crc

# --- internal ---
# r7 .. 8
# r6 .. 0xff
# r5 .. idx
# r4 .. working and test register
# --- arguments ---
# r3 .. table address
# r2 .. n (length of data)
# r1 .. data address
# r0 .. crc
@micropython.asm_thumb
def crc16_ta(r0, r1, r2, r3) -> int:
    mov(r7, 8)
    mov(r6, 0xff)
    label(loop)
    ldrb(r5, [r1, 0])  # idx = data[0]
    add(r1, 1)         # increment data pointer
    eor(r5, r0)        # idx ^= crc
    and_(r5, r6)       # idx ^= 0xff
    add(r5, r5, r5)    # double to make idx ptr16
    add(r5, r5, r3)    # table data address
    ldrh(r5, [r5, 0])  # fetch table entry: r5 = tab[idx]
    lsr(r0, r7)        # crc >>= 8
    eor(r0, r5)        # crc ^= tab[idx]
    sub(r2, 1)         # n -= 1
    bne(loop)          # --- test and outer end ---


# ------- alternative implementation by mendenm ------.

# create a single entry of CRC table
def _initial2(crc, poly):
    crc <<= 8;
    for bit in range(8):
        crc = crc << 1
        if ((crc & 0x10000) != 0):  crc ^= poly
    return crc & 0xffff

# @micropython.viper
# def crc16_tv2(crc: int, data: ptr8, n: int, tab: ptr16) -> int:
#     i:int = 0
#     while i < n:
#         crc = ((crc << 8) & 0xffff) ^ tab[((crc>>8) ^ data[i]) & 0xff]
#         i += 1
#     return crc

# r3 .. count
# r2 .. CRC
# r1 .. table (address)
# r0 .. data (address)
@micropython.asm_thumb
def crc16_ta2(r0, r1, r2, r3) -> int:
    mov(r4, 0)
    mvn(r4, r4) #all ones now
    mov(r7,16)
    lsr(r4, r7) #R4 = half-word of ones
    mov(r5, 0xff) #used for byte masking
    label(loop)
    mov(r6, r2) #copy current CRC
    mov(r7, 8) 
    lsr(r6, r7) #crc >> 8
    ldrb(r7, [r0,0]) #fetch new byte dp[idx]
    add(r0,1) #push to next byte address
    eor(r6, r7) #r6 = (crc>>8) ^ dp[idx]
    and_(r6, r5) #mask byte ( (crc>>8) ^ dp[idx]) & 0xff
    add(r6, r6, r6) #double for table offset
    add(r6, r6, r1) #table data address
    ldrh(r6,[r6,0]) #fetch table syndrome
    mov(r7,8)
    lsl(r2, r7) # CRC << 8
    and_(r2, r4) #(crc << 8) & 0xffff)
    eor(r2, r6) #new CRC
    sub(r3, 1)
    bne(loop)
    mov(r0, r2)

# --  Bit reverse functions by Peter Hinch: https://github.com/peterhinch/micropython-samples/blob/master/reverse/reverse.py ------

def rbit8(v):           # bit reverse of 8 bit value
    v = (v & 0x0f) << 4 | (v & 0xf0) >> 4
    v = (v & 0x33) << 2 | (v & 0xcc) >> 2
    return (v & 0x55) << 1 | (v & 0xaa) >> 1

def rbit16(v):          # bit reverse of 16 bit value
    v = (v & 0x00ff) << 8 | (v & 0xff00) >> 8
    v = (v & 0x0f0f) << 4 | (v & 0xf0f0) >> 4
    v = (v & 0x3333) << 2 | (v & 0xcccc) >> 2
    return (v & 0x5555) << 1 | (v & 0xaaaa) >> 1


# --------  Simple comparison ----------

data = b'\xFF\x03\x02\x15\x28\x9F\x1F'   # crc == 0 if data == b'\xFF\x03\x02\x15\x28\x9F\x1E'
PRESET = 0x0000
POLY = 0x1021

_tab2 = array("H", (_initial2(byt, POLY) for byt in range(256)))
# crc2 = crc16_tv2(PRESET, data, len(data), _tab2)
crc2 = crc16_ta2(data, _tab2, PRESET, len(data))
print("\nCRC2: ", hex(crc2), len(data))
print(f'Binary repr: {crc2:016b}')

_tab = array("H", [_initial(i, rbit16(POLY)) for i in range(256)]) # Create the table, based on _initial(), POLY
crc = rbit16(crc16_ta(rbit16(PRESET), bytes((rbit8(b) for b in data)), len(data), _tab))
print("\nbit-reversed CRC of bit-reversed data: ", hex(crc), len(data))
print(f'Binary repr: {crc:016b}')

result:

CRC2:  0x4c9c 7
Binary repr: 0100110010011100

bit-reversed CRC of bit-reversed data:  0x4c9c 7
Binary repr: 0100110010011100

-- certainly something that every informatics student is aware of (?!?).

Now the question is: If we want to be able to represent every crc-16 variant, do we need the reverse? Apparently. Can the bit-reverse of the bytes be done fast enough, or should there rather be switch of the inner loop?
See these crc-16 variants.

mendenm · 2023-11-07T16:44:14Z

mendenm
Nov 7, 2023

@rkompass The fast way to bit-reverse bytes is a 256 byte lookup table, if you don't have a built-in function to do it. The rbit8 and rbit16 functions below can only be considered placeholders; they are very slow.

…

On Nov 7, 2023, at 11:40 AM, Raul Kompaß ***@***.***> wrote: Found that it is compatible by bit-reversing the bytes to be processed, the polynomial, the initial value and the crc result: # Table based implementations of crc - comparison from array import array # ------- implementation by rkompass ------ # create a single entry of CRC table def _initial(crc, poly): # by roberthh: https://forum.micropython.org/viewtopic.php?t=12837 for j in range(8): if crc & 1: crc = (crc >> 1) ^ poly else: crc = crc >> 1 return crc # --- internal --- # r7 .. 8 # r6 .. 0xff # r5 .. idx # r4 .. working and test register # --- arguments --- # r3 .. table address # r2 .. n (length of data) # r1 .. data address # r0 .. crc @micropython.asm_thumb def crc16_ta(r0, r1, r2, r3) -> int: mov(r7, 8) mov(r6, 0xff) label(loop) ldrb(r5, [r1, 0]) # idx = data[0] add(r1, 1) # increment data pointer eor(r5, r0) # idx ^= crc and_(r5, r6) # idx ^= 0xff add(r5, r5, r5) # double to make idx ptr16 add(r5, r5, r3) # table data address ldrh(r5, [r5, 0]) # fetch table entry: r5 = tab[idx] lsr(r0, r7) # crc >>= 8 eor(r0, r5) # crc ^= tab[idx] sub(r2, 1) # n -= 1 bne(loop) # --- test and outer end --- # ------- alternative implementation by mendenm ------. # create a single entry of CRC table def _initial2(crc, poly): crc <<= 8; for bit in range(8): crc = crc << 1 if ((crc & 0x10000) != 0): crc ^= poly return crc & 0xffff # @micropython.viper # def crc16_tv2(crc: int, data: ptr8, n: int, tab: ptr16) -> int: # i:int = 0 # while i < n: # crc = ((crc << 8) & 0xffff) ^ tab[((crc>>8) ^ data[i]) & 0xff] # i += 1 # return crc # r3 .. count # r2 .. CRC # r1 .. table (address) # r0 .. data (address) @micropython.asm_thumb def crc16_ta2(r0, r1, r2, r3) -> int: mov(r4, 0) mvn(r4, r4) #all ones now mov(r7,16) lsr(r4, r7) #R4 = half-word of ones mov(r5, 0xff) #used for byte masking label(loop) mov(r6, r2) #copy current CRC mov(r7, 8) lsr(r6, r7) #crc >> 8 ldrb(r7, [r0,0]) #fetch new byte dp[idx] add(r0,1) #push to next byte address eor(r6, r7) #r6 = (crc>>8) ^ dp[idx] and_(r6, r5) #mask byte ( (crc>>8) ^ dp[idx]) & 0xff add(r6, r6, r6) #double for table offset add(r6, r6, r1) #table data address ldrh(r6,[r6,0]) #fetch table syndrome mov(r7,8) lsl(r2, r7) # CRC << 8 and_(r2, r4) #(crc << 8) & 0xffff) eor(r2, r6) #new CRC sub(r3, 1) bne(loop) mov(r0, r2) # -- Bit reverse functions by Peter Hinch: https://github.com/peterhinch/micropython-samples/blob/master/reverse/reverse.py ------ def rbit8(v): # bit reverse of 8 bit value v = (v & 0x0f) << 4 | (v & 0xf0) >> 4 v = (v & 0x33) << 2 | (v & 0xcc) >> 2 return (v & 0x55) << 1 | (v & 0xaa) >> 1 def rbit16(v): # bit reverse of 16 bit value v = (v & 0x00ff) << 8 | (v & 0xff00) >> 8 v = (v & 0x0f0f) << 4 | (v & 0xf0f0) >> 4 v = (v & 0x3333) << 2 | (v & 0xcccc) >> 2 return (v & 0x5555) << 1 | (v & 0xaaaa) >> 1 # -------- Simple comparison ---------- data = b'\xFF\x03\x02\x15\x28\x9F\x1F' # crc == 0 if data == b'\xFF\x03\x02\x15\x28\x9F\x1E' PRESET = 0x0000 POLY = 0x1021 _tab2 = array("H", (_initial2(byt, POLY) for byt in range(256))) # crc2 = crc16_tv2(PRESET, data, len(data), _tab2) crc2 = crc16_ta2(data, _tab2, PRESET, len(data)) print("\nCRC2: ", hex(crc2), len(data)) print(f'Binary repr: {crc2:016b}') _tab = array("H", [_initial(i, rbit16(POLY)) for i in range(256)]) # Create the table, based on _initial(), POLY crc = rbit16(crc16_ta(rbit16(PRESET), bytes((rbit8(b) for b in data)), len(data), _tab)) print("\nbit-reversed CRC of bit-reversed data: ", hex(crc), len(data)) print(f'Binary repr: {crc:016b}') result: CRC2: 0x4c9c 7 Binary repr: 0100110010011100 bit-reversed CRC of bit-reversed data: 0x4c9c 7 Binary repr: 0100110010011100 -- certainly something that every informatics student is aware of (?!?). Now the question is: If we want to be able to represent every crc-16 variant, do we need the reverse? Apparently. Can the bit-reverse of the bytes be done fast enough, or should there rather be switch of the inner loop? See these crc-16 variants. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

2 replies

rkompass Nov 7, 2023

There is

rbit(Rd, Rn) Rd = bit_reverse(Rn)
bit_reverse(Rn) returns the bit-reversed contents of Rn. If Rn contains bits b31 b30..b0 Rd will be set to b0 b1 b2..b31

rkompass Nov 7, 2023

So this is 2 instructions at most (reversing the 32-bit word and shifting 24 bits for byte-alignment).

mendenm · 2023-11-07T16:46:47Z

mendenm
Nov 7, 2023

@rkompass I suspect that the slightly longer asm_thumb code I write saves more time & effort than having to bit-reverse everything for the other (faster) polynomial. I had sort of thought, though, that the bit reversal was going to be the solution, since some devices shift in LSB first, and others MSB first.

…

On Nov 7, 2023, at 11:40 AM, Raul Kompaß ***@***.***> wrote: Found that it is compatible by bit-reversing the bytes to be processed, the polynomial, the initial value and the crc result: # Table based implementations of crc - comparison from array import array # ------- implementation by rkompass ------ # create a single entry of CRC table def _initial(crc, poly): # by roberthh: https://forum.micropython.org/viewtopic.php?t=12837 for j in range(8): if crc & 1: crc = (crc >> 1) ^ poly else: crc = crc >> 1 return crc # --- internal --- # r7 .. 8 # r6 .. 0xff # r5 .. idx # r4 .. working and test register # --- arguments --- # r3 .. table address # r2 .. n (length of data) # r1 .. data address # r0 .. crc @micropython.asm_thumb def crc16_ta(r0, r1, r2, r3) -> int: mov(r7, 8) mov(r6, 0xff) label(loop) ldrb(r5, [r1, 0]) # idx = data[0] add(r1, 1) # increment data pointer eor(r5, r0) # idx ^= crc and_(r5, r6) # idx ^= 0xff add(r5, r5, r5) # double to make idx ptr16 add(r5, r5, r3) # table data address ldrh(r5, [r5, 0]) # fetch table entry: r5 = tab[idx] lsr(r0, r7) # crc >>= 8 eor(r0, r5) # crc ^= tab[idx] sub(r2, 1) # n -= 1 bne(loop) # --- test and outer end --- # ------- alternative implementation by mendenm ------. # create a single entry of CRC table def _initial2(crc, poly): crc <<= 8; for bit in range(8): crc = crc << 1 if ((crc & 0x10000) != 0): crc ^= poly return crc & 0xffff # @micropython.viper # def crc16_tv2(crc: int, data: ptr8, n: int, tab: ptr16) -> int: # i:int = 0 # while i < n: # crc = ((crc << 8) & 0xffff) ^ tab[((crc>>8) ^ data[i]) & 0xff] # i += 1 # return crc # r3 .. count # r2 .. CRC # r1 .. table (address) # r0 .. data (address) @micropython.asm_thumb def crc16_ta2(r0, r1, r2, r3) -> int: mov(r4, 0) mvn(r4, r4) #all ones now mov(r7,16) lsr(r4, r7) #R4 = half-word of ones mov(r5, 0xff) #used for byte masking label(loop) mov(r6, r2) #copy current CRC mov(r7, 8) lsr(r6, r7) #crc >> 8 ldrb(r7, [r0,0]) #fetch new byte dp[idx] add(r0,1) #push to next byte address eor(r6, r7) #r6 = (crc>>8) ^ dp[idx] and_(r6, r5) #mask byte ( (crc>>8) ^ dp[idx]) & 0xff add(r6, r6, r6) #double for table offset add(r6, r6, r1) #table data address ldrh(r6,[r6,0]) #fetch table syndrome mov(r7,8) lsl(r2, r7) # CRC << 8 and_(r2, r4) #(crc << 8) & 0xffff) eor(r2, r6) #new CRC sub(r3, 1) bne(loop) mov(r0, r2) # -- Bit reverse functions by Peter Hinch: https://github.com/peterhinch/micropython-samples/blob/master/reverse/reverse.py ------ def rbit8(v): # bit reverse of 8 bit value v = (v & 0x0f) << 4 | (v & 0xf0) >> 4 v = (v & 0x33) << 2 | (v & 0xcc) >> 2 return (v & 0x55) << 1 | (v & 0xaa) >> 1 def rbit16(v): # bit reverse of 16 bit value v = (v & 0x00ff) << 8 | (v & 0xff00) >> 8 v = (v & 0x0f0f) << 4 | (v & 0xf0f0) >> 4 v = (v & 0x3333) << 2 | (v & 0xcccc) >> 2 return (v & 0x5555) << 1 | (v & 0xaaaa) >> 1 # -------- Simple comparison ---------- data = b'\xFF\x03\x02\x15\x28\x9F\x1F' # crc == 0 if data == b'\xFF\x03\x02\x15\x28\x9F\x1E' PRESET = 0x0000 POLY = 0x1021 _tab2 = array("H", (_initial2(byt, POLY) for byt in range(256))) # crc2 = crc16_tv2(PRESET, data, len(data), _tab2) crc2 = crc16_ta2(data, _tab2, PRESET, len(data)) print("\nCRC2: ", hex(crc2), len(data)) print(f'Binary repr: {crc2:016b}') _tab = array("H", [_initial(i, rbit16(POLY)) for i in range(256)]) # Create the table, based on _initial(), POLY crc = rbit16(crc16_ta(rbit16(PRESET), bytes((rbit8(b) for b in data)), len(data), _tab)) print("\nbit-reversed CRC of bit-reversed data: ", hex(crc), len(data)) print(f'Binary repr: {crc:016b}') result: CRC2: 0x4c9c 7 Binary repr: 0100110010011100 bit-reversed CRC of bit-reversed data: 0x4c9c 7 Binary repr: 0100110010011100 -- certainly something that every informatics student is aware of (?!?). Now the question is: If we want to be able to represent every crc-16 variant, do we need the reverse? Apparently. Can the bit-reverse of the bytes be done fast enough, or should there rather be switch of the inner loop? See these crc-16 variants. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

15 replies

rkompass Nov 8, 2023

@robert-hh: Your crc16_nibble() algorithm is very special: It's equivalent to the first (right shifting) crc calculations considered here for such polynomials:

0b1000100010001000, 0b1000000010001000, 0b1000000100001000, 0b1000001000001000, 0b1000010000001000 and 0b1000100000001000.
0o10201 is the second last of those.

These polynomials have to be right-shifted by 3 bits (so the 1-bit becomes bit 0) in the algorithm.

After some thinking I realized that the whole CRC-theory being based on modulo 2 polynomials differs from ordinary binary multiplication and division / remainder computation just by the omission of carry-over.
So employing the * poly multiplication only works correctly if there is no carry-over. As it is embedded in a 4-bit pruning the polynomials have to have a form where all 1-bits have to be at least four places away from each other.

We can conclude that the nibble algorithm is a special one, that cannot be generalized for the multitude of crc variants.
For the special crc that it is employed for it gives a 4-times speed advantage in python:
python version: crc16

python version: crc16
crc16: 0x0000, 81.53585µs per byte
crc16: 0x0000, 81.5333µs per byte
crc16: 0x0000, 81.53485µs per byte
python nibble version: crc16_nibble
crc16: 0x0000, 19.19325µs per byte
crc16: 0x0000, 19.19515µs per byte
crc16: 0x0000, 19.19375µs per byte

If we want to get rid of the need of using a 256x2 byte wide table we might use a 16x2 byte wide table doing the crc calculation in 4 bits at a time.

robert-hh Nov 8, 2023
Collaborator

@rkompass Thank you for the analysis. It is old code. I used it ~25years ago as C code. Not sure for what. it may be an implementation of a XModem protocol. Including the bit reversals and using the native decorators, the nibble version is 3 times slower than the table version, and about 50% faster than the bit shift version. Using viper code for the bit reversals it's 40% slower than the table version, but 4 times faster than the single bit shift. That matches your observation. Using plain python, the nibble variant is about 3 times slower than the table version, but 2 times faster that then single bit variant.
Having said that, we should as well look at a nibble table version with 2 times 16 value entries, requiring 64 byte instead of 512 byte for the table. Or have a binascii.crc16() method, which then could use any available port specific mechanism.

rkompass Nov 8, 2023

@robert-hh Obviously there are similar thoughts:
I just finished another round of coding, including viper and asm_thumb versions of the 4bit table crc, which requires 32 bytes.
I will present the table of benchmark results here.

Also related: My last suggestion in the sdcard discussion, where I suggested that

binascii.crc_hqx(data, value)
Compute a 16-bit CRC value of data, starting with value as the initial CRC, and return the result. This uses the CRC-CCITT polynomial x16 + x12 + x5 + 1, often represented as 0x1021. This CRC is used in the binhex4 format.

could/should be implemented.

My thoughts at present are:
Having this as part of binascii, i.e. coded internally in C, it would be present in optimal speed at every architecture.
The new sdcard driver could benefit from that, without bloating its code.

For other purposes it would be beneficial to have a reasonably fast crc16 library that is universal, like originally requested in this discussion. I imagine a class that has certain variants available by name, like in https://crccalc.com, that at initialization sets the polynomial, starting crc and direction of bit shift resp. reversal. and allows to configure those, of course, by desired values.

rkompass Nov 8, 2023

Here an overview of the benchmark results. The code is in the attached zip.

CRC-16 computation time in µs/byte on Blackpill board (96MHz), in 

                python     native      viper      asm_thumb
                -------------------------------------------
no table
(8bit-loop)      83.4       42.5        3.6         1.0

4-bit table
(32 bytes)       22.7       15.2        0.93        0.29

8-bit table
(512 bytes)      12.9       10.3        0.59        0.17

crc16_bench.zip

robert-hh Nov 8, 2023
Collaborator

Running the crc16_bench on a Teensy 4.0 return 10 times smaller numbers. But the ratio between python/native/viper/asm is the same.
Uncommenting the nibble version shows, that it's performance is similar to the 4 bit table version. Looking at the similar complex code, this is not surprising. Only that the range of polynomials is limited.

mendenm · 2023-11-08T16:54:26Z

mendenm
Nov 8, 2023

@rkompass @robert-hh This looks really promising for the 4-bit table! It's pretty fast, and certainly a lot more compact. I like it.

…

On Nov 8, 2023, at 10:18 AM, Raul Kompaß ***@***.***> wrote: @robert-hh Obviously there are similar thoughts: I just finished another round of coding, including viper and asm_thumb versions of the 4bit table crc, which requires 32 bytes. I will present the table of benchmark results here. Also related: My last suggestion in the sdcard discussion, where I suggested that binascii.crc_hqx(data, value) Compute a 16-bit CRC value of data, starting with value as the initial CRC, and return the result. This uses the CRC-CCITT polynomial x16 + x12 + x5 + 1, often represented as 0x1021. This CRC is used in the binhex4 format. could/should be implemented. My thoughts at present are: Having this as part of binascii, i.e. coded internally in C, it would be present in optimal speed at every architecture. The new sdcard driver could benefit from that, without bloating its code. For other purposes it would be beneficial to have a reasonably fast crc16 library that is universal, like originally requested in this discussion. I imagine a class that has certain variants available by name, like in https://crccalc.com, that at initialization sets the polynomial, starting crc and direction of bit shift resp. reversal. and allows to configure those, of course, by desired values. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

rkompass Nov 8, 2023

Hello Marcus,

yes, I also thought of the 4-bit table version as sufficient and lean tradeoff.
Given that with quite fast SPI at 12 MHz baudrate the sdcard reading takes with your driver ~1.5 µs / byte this makes a 60-70% (viper) or 25-35% (asm) overhead. Much less with slower SPI of course.
I'll do optimized 4-bit table viper and asm_thumb versions for your left-shift crc variant. The benchmark was for right-shift.

Raul

mendenm · 2023-11-08T18:51:43Z

mendenm
Nov 8, 2023

@rkompass @robert-hh OK, I had to play! I wrote a version of the SD-card left-shift for the DMA on the RP2040 at 125 MHz. The resulting crc is the same value I get for my sdcard crc16 module, when I use DMA CRC algorithm 2 (CCITT, not bit reversed) and seed 0. Note that this timing is for 20_000 byte arrays. It goes to 0.02 µs per byte for 100_000 bytes. In other words, as I implemented it here, the time is almost 100% in setting up the DMA channel in python. I suspect if one really needed to do this for something useful, you would keep a channel all set up, write the source address and transfer count register, and hit go. It would probably then be right down to the 8 ns per byte limited by the clock speed. I'm not sure why this would be useful, but when the DMA client for rp2040 gets finalized, I may add this to the heap of options. Now I have other things I need to do for real work... Marcus dma-based crc16 for sd cards crc16: 0x3359, 0.07150001µs per byte crc16: 0x3359, 0.0729µs per byte crc16: 0x3359, 0.07185µs per byte python version: crc16 crc16: 0x9cb8, 78.6903µs per byte crc16: 0x9cb8, 78.6903µs per byte crc16: 0x9cb8, 78.6894µs per byte native version: crc16_n crc16: 0x9cb8, 33.9129µs per byte crc16: 0x9cb8, 33.9116µs per byte crc16: 0x9cb8, 33.9125µs per byte viper version: crc16_v crc16: 0x9cb8, 2.50005µs per byte crc16: 0x9cb8, 2.49915µs per byte crc16: 0x9cb8, 2.49895µs per byte asm_thumb version: crc16_a crc16: 0x9cb8, 0.57935µs per byte crc16: 0x9cb8, 0.5795µs per byte crc16: 0x9cb8, 0.5792µs per byte 8bit-table-based python version: crc16_t crc16: 0x9cb8, 10.23065µs per byte crc16: 0x9cb8, 10.22925µs per byte crc16: 0x9cb8, 10.22905µs per byte 8bit-table-based native version: crc16_tn crc16: 0x9cb8, 7.37165µs per byte crc16: 0x9cb8, 7.3707µs per byte crc16: 0x9cb8, 7.37125µs per byte 8bit-table-based viper version: crc16_tv crc16: 0x9cb8, 0.4513µs per byte crc16: 0x9cb8, 0.4508µs per byte crc16: 0x9cb8, 0.45075µs per byte 8bit-table-based asm_thumb version: crc16_ta crc16: 0x9cb8, 0.1151µs per byte crc16: 0x9cb8, 0.11465µs per byte crc16: 0x9cb8, 0.11465µs per byte 4bit-table-based python version: crc16_t4 crc16: 0x9cb8, 19.97395µs per byte crc16: 0x9cb8, 19.9732µs per byte crc16: 0x9cb8, 19.97325µs per byte 4bit-table-based native version: crc16_t4n crc16: 0x9cb8, 12.63475µs per byte crc16: 0x9cb8, 12.6351µs per byte crc16: 0x9cb8, 12.63515µs per byte 4bit-table-based viper version: crc16_t4v crc16: 0x9cb8, 0.6752µs per byte crc16: 0x9cb8, 0.67485µs per byte crc16: 0x9cb8, 0.6749µs per byte 4bit-table-based asm_thumb version: crc16_t4 crc16: 0x9cb8, 0.19475µs per byte crc16: 0x9cb8, 0.1948µs per byte crc16: 0x9cb8, 0.19475µs per byte 8bit-table-based right-shift viper version: crc16_tv2 crc16: 0x9cb8, 0.4846µs per byte crc16: 0x9cb8, 0.48365µs per byte crc16: 0x9cb8, 0.4839µs per byte 8bit-table-based right-shift asm_thumb version: crc16_ta2 crc16: 0x9cb8, 0.1316µs per byte crc16: 0x9cb8, 0.1316µs per byte crc16: 0x9cb8, 0.1317µs per byte

…

On Nov 8, 2023, at 12:42 PM, Raul Kompaß ***@***.***> wrote: Hello Marcus, yes, I also thought of the 4-bit table version as sufficient and lean tradeoff. Given that with quite fast SPI at 12 MHz baudrate the sdcard reading takes with your driver ~1.5 µs / byte this makes a 60-70% (viper) or 25-35% (asm) overhead. Much less with slower SPI of course. I'll do optimized 4-bit table viper and asm_thumb versions for your left-shift crc variant. The benchmark was for right-shift. Raul — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

2 replies

rkompass Nov 8, 2023

Nice played, indeed 8 times faster than the asm_thumb version, which roughly corresponds with the number of instructions per byte.

The resulting crc is the same value I get for my sdcard crc16 module, when I use DMA CRC algorithm 2 (CCITT, not bit reversed) and seed 0.

O.k., the crc benchmark polynomial had a different setting, and of course it's shifting is different.

If you had the full DMA available you could do the CRC while retrieving the sdcard data through the SPI, of course. => no overhead at all.
I'm assuming that hard SPI uses the DMA, and using the CRC of the DMA could be made an option.
SPI itself has an 8 bit CRC, see here, which is a different one.

mendenm Nov 9, 2023

So the answer is yes, SPI on rp2040 uses DMA, and it would only take about 2 lines of code to make it compute the CRC. One would have to make a new property or something of SPI into which the CRC is saved, and which would do something intelligent on platforms for which it isn't. It would result in zero-overhead CRCs, since the DMA doesn't need extra cycles for this.

There is one minor issue that would require an extension to the soon-to-be-real DMA class. The CRC computation uses a 'sniffer' which is shared by all the DMAs. A new method to claim and release the sniffer would be necessary to make sure two channels aren't using it at the same time.

rkompass · 2023-11-09T18:14:10Z

rkompass
Nov 9, 2023

I was guessing that something like the following would hold:
The 8-bit table based right-shift and left-shift crc implementations are quite related to each other.
So much that there is no need to have two implementations of them.

Idea: if you have to do a bit-reversal in the byte to be processed, and that byte goes as index into a table, then alternatively also the
table may be built differently in such a way that it's indices are bit-reversed.

Actually just a byte-swap suffices to transforms one implementation into the other,
(see the comments to crc16_tv() and crc16_tv2() in the code below for a pointer), as the bit-swap of table indices already is
done by the different table initialization functions.

The CRC PRESET has to be byte-swapped and the resulting CRC. This is demonstrated in the following code:

# Equivalence of 8-bit table based left-shift and right-shift crc16

from array import array

# bit reverse of 8 bit value
def rbit8(v):           
    v = (v & 0x0f) << 4 | (v & 0xf0) >> 4
    v = (v & 0x33) << 2 | (v & 0xcc) >> 2
    return (v & 0x55) << 1 | (v & 0xaa) >> 1

def rbit16(v):          # bit reverse of 16 bit value
    v = (v & 0x00ff) << 8 | (v & 0xff00) >> 8
    v = (v & 0x0f0f) << 4 | (v & 0xf0f0) >> 4
    v = (v & 0x3333) << 2 | (v & 0xcccc) >> 2
    return (v & 0x5555) << 1 | (v & 0xaaaa) >> 1

# swap bytes of 2-byte number
def swbyte2(v):
    return v >> 8 | (v & 0xff) << 8

# bit reverse bytes of 2-byte number
def rbytes2(v):
    return rbit8(v >> 8) << 8 | rbit8(v & 0xff)

# create a single entry of CRC table, right shift version
def _initial(crc, poly): # by roberthh: https://forum.micropython.org/viewtopic.php?t=12837
    for j in range(8):
        if crc & 1:
            crc = (crc >> 1) ^ poly
        else:
            crc = crc >> 1
    return crc

# create a single entry of CRC table, left shift version
def _initial2(crc, poly):
    crc <<= 8
    for bit in range(8):
        crc = crc << 1
        if crc & 0x10000:
            crc ^= poly
    return crc & 0xffff

# crc computation, right shift version
@micropython.viper
def crc16_tv(crc: int, data: ptr8, n: int, tab: ptr16) -> int:
    i: int  = 0
    idx:int = 0
    while i < n:
        idx = (crc & 0xff) ^ data[i]              #  crc_lowbyte  ^ data
        crc = (crc >> 8) ^ tab[idx]               # (crc highbyte -> lowbyte) ^ tab
        i += 1
    return crc

# crc computation, left shift version, note that just the roles of high and low crc byte are exchanged
@micropython.viper
def crc16_tv2(crc: int, data: ptr8, n: int, tab: ptr16) -> int:
    i:int = 0
    idx:int = 0
    while i < n:
        idx = (crc >> 8) ^ data[i]               # (crc highbyte -> lowbyte) ^ data
        crc = ((crc & 0xff) << 8) ^ tab[idx]     # (crc lowbyte -> highbyte) ^ tab
        i += 1
    return crc

data = b'\xFF\x03\x02\x15\x28\x9F\x1F'   # crc == 0 if data == b'\xFF\x03\x02\x15\x28\x9F\x1E'
PRESET = 0x0000
POLY = 0x1021

print('\nmendenms left shift 8-bit table viper version:')
_tab = array("H", (_initial2(i, POLY) for i in range(256)))
crc = crc16_tv2(PRESET, data, len(data), _tab)
print("              CRC: ", hex(crc), len(data))
print(f'Binary repr: {crc:016b}')

print('\nright shift table viper version, just byte-reversed table:')
_tab = array("H", [swbyte2(_initial2(i, POLY)) for i in range(256)]) # Create the table, based on _initial(), POLY
crc = swbyte2(crc16_tv(swbyte2(PRESET), data, len(data), _tab))
print("byte-reversed CRC: ", hex(crc), len(data))
print(f'Binary repr: {crc:016b}')

print('\nright shift table viper version, just byte-reversed table:')
_tab = array("H", [rbytes2(_initial(rbit8(i), rbit16(POLY))) for i in range(256)]) # Create the table, based on _initial(), POLY
crc = swbyte2(crc16_tv(swbyte2(PRESET), data, len(data), _tab))
print("byte-reversed CRC: ", hex(crc), len(data))
print(f'Binary repr: {crc:016b}')


# _initial2(i, POLY) is equivalent to rbit16(_initial(rbit8(i), rbit16(POLY)) :
#
tab = array("H", (rbit16(_initial(rbit8(i), rbit16(POLY))) for i in range(256)))
tab2 = array("H", (_initial2(i, POLY) for i in range(256)))

print('\nindex                 tab               tab2')
for i in range(3):
    print(f'  {i:3d}         0b{tab[i]:016b}  0b{tab2[i]:016b}')
print('                     ...               ...')
for i in range(253,256):
    print(f'  {i:3d}         0b{tab[i]:016b}  0b{tab2[i]:016b}')
print('                     ---               ---')
for i in (0b10110111, 0b11101101):
    print(f'0b{i:08b}    0b{tab[i]:016b}  0b{tab2[i]:016b}')

Of course, also the table initialization functions are related:
Bit-reflections, done on all inputs and outputs transform one initialization function into the other.
_initial2(i, POLY) is equivalent to rbit16(_initial(rbit8(i), rbit16(POLY))

So there is also no need to have two implementations of the lookup-table initialization functions.

And of course: No need to do bit-reflections while computing the CRC, as long as it is done for speed, i.e. table-based.

With a nibble (4-bit) lookup table the thing seems a bit different. Here we have to distinguish whether the lower or higher nibble is digested first. For the general case there have to be different functions to be held, presumably. Or an option of nibble-order.

1 reply

rkompass Nov 9, 2023

Mathematical insight beats programming skill.
The above observation, which seems totally plausible, seems not to be present in the answers and expert example code, even e.g. on
stackoverflow.
It can't be I just discoverd something new.?. It's too simple for that.
But everyone seems to compute reflected bytes, if needed, for CRC. Despite using a lookup-table for speed.

mendenm · 2023-11-09T19:25:50Z

mendenm
Nov 9, 2023

@rkompass Have you tested the swapped algorithms with an odd number of bytes going in? Maybe the hanging byte doesn't work. The must be (I say with no confidence at all) a reason that they have been so universally written as they are. And I would think (again with no great confidence) the number theorists would understand all teh reflection formulae that go with these operations. Marcus

…

On Nov 9, 2023, at 2:17 PM, Raul Kompaß ***@***.***> wrote: Mathematical insight beats programming skill. The above observation, which seems totally plausible, seems not to be present in the answers and expert example code, even e.g. on stackoverflow. It can't be I just discoverd something new.?. It's too simple for that. But everyone seems to compute reflected bytes, if needed, for CRC. Despite using a lookup-table for speed. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

robert-hh · 2023-11-09T19:38:01Z

robert-hh
Nov 9, 2023
Collaborator

We should not over-engineer this topic.
My suggestion is to provide just two 16 entry table based versions for the straight and reflected approaches, with options to set the polynomial and the initial value. And then leave everything else to the user. One can add links to other variants in the documentation.

0 replies

mendenm · 2023-11-09T19:51:27Z

mendenm
Nov 9, 2023

@robert-hh Your point is why I moved this whole issue outside of the new proposed sdcard.py code. You can hand in any function you want, so it's up to the user. I may put one of the most general codes @rkompass provides in crc16.py in the sdcard package, in case user wants a simple default. However, there are interesting cases in which the default isn't optimal. The rp2040 computes CRCs in its DMA, which is used by the SPI transfer code in the rp2 port. With trivial modification to the SPI code, we can turn on CRCs there, and get the SDcards to get their CRC computed at no cost by the hardware. Marcus

…

On Nov 9, 2023, at 2:38 PM, Robert Hammelrath ***@***.***> wrote: We should not over-engineer this topic. My suggestion is to provide just two 16 entry table based versions for the straight and reflected approaches, with options to set the polynomial and the initial value. And then leave everything else to the user. One can add links to other variants in the documentation. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

robert-hh Nov 9, 2023
Collaborator

That can all got into a separate CRC section of micropython-lib. And I simply do not know if the SPI devices of all supported MCUs support CRC calculation.

mendenm · 2023-11-09T20:31:51Z

mendenm
Nov 9, 2023

@robert-hh I seriously doubt if most MCU DMA units support CRC computation. That is why it would have to go in the port-specific SPI module (for example). The rp2040 SPI module does use DMA, so it would be easily adapted. And, the way I am structuring sdcard.py, the function provided by the SPI module could be handed directly in to __init__, so you can use it if it exists. I agree at this point that the CRCs probably belong in a separate place, although I have included one simple example in the sdcard module that I just created a PR for. That crc16.py will probably disappear once the improved versions from @rkompass get included. Marxu

…

On Nov 9, 2023, at 3:02 PM, Robert Hammelrath ***@***.***> wrote: That can all got into a separate CRC section of micropython-lib. And I simply do not know if the SPI devices of all supported MCUs support CRC calculation. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

rkompass · 2023-11-09T21:34:16Z

rkompass
Nov 9, 2023

We should not over-engineer this topic.

Agreed. My tendency, always :-| But otoh this makes things simpler too. Turns out, that the four 8-bit table-based universal implementations (raw, native, viper and asm_thumb) are already optimally implemented (the core of those), for both directions.

The 4-bit left-shifting variant was giving me headaches. I tried many combinations of settings and always got a wrong CRC.
After reading the attached intro I seem to understand things better.

RNWilliams93_PainlessGuidetoCRC.zip

Have you tested the swapped algorithms with an odd number of bytes going in?

The code example uses 7 bytes. I will test of course against the crc calculator website. Later.

0 replies

rkompass · 2023-11-11T13:15:36Z

rkompass
Nov 11, 2023

An update:
The low-level left-shift CRC computation cannot be made as fast as the right-shift variant.

CRC-16 left-shift computation time in µs/byte on Blackpill board (96MHz), in 

                python     native      viper      asm_thumb
                -------------------------------------------
no table
(8bit-loop)      92.7       45.4        3.8         1.17

4-bit table
(32 bytes)      n.impl.    n.impl.      1.05        0.35

8-bit table
(512 bytes)     n.impl.    n.impl.      0.65        0.21

older right-shift variant for comparison:

CRC-16 right-shift computation time in µs/byte on Blackpill board (96MHz), in 

                python     native      viper      asm_thumb
                -------------------------------------------
no table
(8bit-loop)      83.4       42.5        3.6         1.0

4-bit table
(32 bytes)       22.7       15.2        0.93        0.29

8-bit table
(512 bytes)      12.9       10.3        0.59        0.17

This is due to the algorithm itself, but also because there are more constants involved than cannot be held in registers at a time as seen in the 4-bit table based asm_thumb version:

# --- internal ---
# r7 .. 4
# r6 .. 0x0f
# r5 .. idx or tab[idx]
# r4 .. data[0] working and test register
# --- arguments ---
# r3 .. table address
# r2 .. n (length of data)
# r1 .. data address
# r0 .. crc
@micropython.asm_thumb
def crc16_t4a2(r0, r1, r2, r3) -> int:
    mov(r7, 4)
    mov(r6, 0x0f)
    label(loop)
    ldrb(r4, [r1, 0])  # d = data[0]
    add(r1, 1)         # increment data pointer
    mov(r5, r0)        # idx = crc
    lsr(r5, r7)
    lsr(r5, r7)        # idx >>= 8
    eor(r5, r4)        # idx ^= d
    lsr(r5, r7)        # idx >>= 4
    and_(r5, r6)       # idx &= 0x0f
    add(r5, r5, r5)    # double to make idx ptr16
    add(r5, r5, r3)    # table data address
    ldrh(r5, [r5, 0])  # fetch table entry: idx = tab[idx]
    lsl(r0, r7)        # crc <<= 4
    eor(r0, r5)        # crc ^= idx
    mov(r5, r0)        # idx = crc
    lsr(r5, r7)
    lsr(r5, r7)
    lsr(r5, r7)        # idx >>= 12
    eor(r5, r4)        # idx ^= d
    and_(r5, r6)       # idx &= 0x0f
    add(r5, r5, r5)    # double to make idx ptr16
    add(r5, r5, r3)    # table data address
    ldrh(r5, [r5, 0])  # fetch table entry: idx = tab[idx]
    lsl(r0, r7)        # crc <<= 4
    eor(r0, r5)        # crc ^= idx
    sub(r2, 1)         # n -= 1
    bne(loop)          # --- test and outer end ---
    movw(r6, 0xffff)
    and_(r0, r6)

For a left shift of 12 bits we had to use three times lsr(r5, r7), where r7=4, because r8 is not available. Same with value 8.

Fortunately we found another symmetry that allows to use the right-shift 4-bit table based algorithm (with swapped order of nibble evaluation of the read-in data bytes) with the left-shift-initialized table.

So we are at 0.285µs per byte (instead of 0.35) at the left-shift variant too.

@mendenm: You need 4-bit table base viper (for universality) and asm_thumb versions of your CRC, right?

1 reply

mendenm Nov 11, 2023

@rkompass I think that would be a useful thing to have. We still need to think about how all these CRC functions will be presented. We probably just want a collection of completely independent files, with one in each file, so the use can only load the one they intend to use and which is compatible with their platform.

rkompass · 2023-11-11T20:03:04Z

rkompass
Nov 11, 2023

@mendenm:
For the SD card driver here is now my new 4-bit table based CRC.
Like your previous version it does try on @micropython.asm_thumb and falls back on the viper version.
I included the assert:
assert crc16_ccitt(0x0000, b'\x31\x32\x33\x34\x35\x36\x37\x38\x39', 9) == 0x31C3, "Imported CRC function failed"

Your speed test reports ~ 20-25 % less bytes/s, which is due to the 4-bit table speed/memory trade off.
On the memory side there is this benefit: The import takes only 672 bytes instead of 3216 bytes with the old version (may vary on platform/test).

Note that the crc16 function takes 3 arguments now: crc16_ccitt(crc: int, data: ptr8, n: int) where n is len(data)
crc16_ccitt.zip

9 replies

mendenm Nov 11, 2023

@robert-hh I think the final release will pretty much end up with one version in each file, and the user installs and imports the version they want. That is why I made the mechanism in the new sdcard driver just accept a crc16 function as an argument. The sdcard module itself doesn't care where it comes from. It gives complete flexibility to break this up. This is a case where we don't have to have a single 'optimum' module. There are multiple parameters that can be selected (size vs. speed, asm vs. regular python, etc.), and which ones are very use-case dependent.

I would appreciate you, as a real expert in micropython, taking a look at my new code in the PR micropython/micropython-lib#765 and giving any input you have. Note that the crc16 module in that PR will get superseded by some of the improved versions in the crc project going on here.

rkompass Nov 11, 2023

@robert-hh The first point would be mended by the wrapper, while still enabling one to import the underscored basic function.

As for the second point: Would be another try/except fallback to ordinary Python (8-bit table based for speed) an option?
Or evaluation of architecture and the relevant support bits? Which will change too (e.g. with Risc-V).

Or do you rather suggest a selection of files for: MP, viper, asm_thumb, asm_xtensa? This no problem as long as we only have this one CRC variant. With all possible variants the combinatorics would prevent his, of course.

What shall I do?

mendenm Nov 11, 2023

@rkompass I personally don't mind the combinatorics. You have already written them, and files that give the user choices, without needing to be installed, are harmless. I am follow the lead of the microdot project, which I find quite nicely executed in the respect.

rkompass Nov 11, 2023

Interesting: I rather don't like excessive combinatorics of files (just my personal opinion) - in the long run.

I like to do all variants now for pleasure and for finding the optimal trade-off.
No problem to do Xtensa now, as I'm just in it anyway.
But later I know I'll have forgotten (and like to forget) about all this and then I'll avoid to read about the necessary selections somewhat like a newbie. Because then my mind will be full of other stuff.

This is why I thought of this variant:
You just putting the CRCs in their current form into the sdcard.py file.
Because they are slim enough, and you minimized the driver, this will not produce memory overhead.
CRC16 is optional anyway. So you may do a check whether it is usable and only then (and if it is selected as option, of course) activate the internal CRC check. The driver would still work on all platforms like before and naive users wouldn't take notice.
Perhaps a message at init, in case CRC was selected and is not usable - that's all.
CRC7 as ordinary python, because it is used at init() only.

But your argument taking the crc function (being default None) is better.

mendenm Nov 11, 2023

note that crc7 is used on all commands, not just at init. however, because it is only for the 5-byte command blocks, it doesn't have to be nearly as optimized as the crc16, which is for all the data (and for multi-block reads and writes, you get a LOT of data for one command!).

xyzzy42 · 2024-02-08T12:31:35Z

xyzzy42
Feb 8, 2024

I did a quick benchmark on an ESP32-S3 running at 240 MHz.

I tried the micropython.native version by @robert-hh, from the forum thread referenced here, the crc16_tv() viper version, and I also wrote an mpy interface to a CRC16 routine in the ESP32-S3's ROM and benchmarked that. I used a 400 byte buffer to test on.

8bit-table-based native version: crc16_tn
crc16: 1255.38 µs total, 3.138442 µs/byte

8bit-table-based viper version: crc16_tv
crc16: 98.28 µs total, 0.245688 µs/byte

ESP32-S3 ROM code: esp32.crc16_le
crc16: 27.52 µs total, 0.068797 µs/byte

Compared to @mendenm's benchmarks for a rp2040 in an earlier post, it seems like the ESP32-S3 is about 10x faster for both the native and viper versions. Which is quite a bit more than I would have expected.

And the ESP32 code is about 4x faster still.

0 replies

Is there fast CRC16 in pyboard #12891

Uh oh!

Replies: 18 comments · 38 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robert-hh Nov 8, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

robert-hh Nov 8, 2023 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robert-hh Nov 9, 2023 Collaborator

Uh oh!

Uh oh!

robert-hh Nov 9, 2023 Collaborator

Uh oh!

Replies: 18 comments 38 replies

robert-hh Nov 8, 2023
Collaborator

robert-hh Nov 8, 2023
Collaborator

robert-hh
Nov 9, 2023
Collaborator

robert-hh Nov 9, 2023
Collaborator