Is there fast CRC16 in pyboard #12891
Replies: 18 comments 38 replies
-
|
I'd rewrite that and use the @micropython.viper decorator. |
Beta Was this translation helpful? Give feedback.
-
|
In the sdcard driver by @mendenm there is a fast crc16 implementation that uses a 256 byte-wide table and |
Beta Was this translation helpful? Give feedback.
-
|
One thing i may implement on the Pico/rp2 based machines is the use of a DMA channel to compute crc16. I need to look at which CRC16 they compute, and then see if it is fast, including the DMA setup overhead time. The DMA on rp2040 can do this as the data passes thrugh, and if one just sets one up to read the data block and pass it into a single byte somewhere, it would be an awesomely fast CRC16.
… On Nov 5, 2023, at 12:46 PM, Raul Kompaß ***@***.***> wrote:
Correction: The above linked crc16 uses another poly (CCIT).
For a perhaps faster table-based approach, where the poly is user-defined see this older discussion.
Certainly the crc-computing part may be accelerated further by using viper too, or asm.thumb.
As for the pybard: It has hardware crc32, with a fixed polynomial, but not crc16, according to this doc. Related MP code is here.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
Nice, @rkompass, you even figured out how to do the asm version in a couple less instructions than I did, by handling the shifting/byte swapping more cleanly, although this is the other common crc16 polynomial. I will have to see if the ccitt/sdcard one can be done with the same improved swapping.
Marcus
… On Nov 6, 2023, at 6:21 PM, Raul Kompaß ***@***.***> wrote:
Here is @GitHubsSilverBullet 's comparison extended for the table approach and asm_thumb. Was curious to see how far it would go.
Found: assembler + table together lead to another 20x speedup.
from time import ticks_us, ticks_diff
from array import array
PRESET = const(0xffff)
POLY = const(0xa001)
def crc16(data): # data.. bytearray or bytes object of data
crc = PRESET
for d in data:
crc ^= d
for _ in range(8):
if crc & 1:
crc = crc>>1 ^ POLY
else:
crc >>= 1
return crc
@micropython.native
def crc16_n(data):
crc = PRESET
for d in data:
crc ^= d
for _ in range(8):
if crc & 1:
crc = crc>>1 ^ POLY
else:
crc >>= 1
return crc
@micropython.viper
def crc16_v(data: ptr8, num: int) -> int:
crc: int = PRESET
i: int = 0
while i < num:
crc = data[i] ^ crc
j: int = 8
while j > 0:
if crc & 1:
crc = crc>>1 ^ POLY
else:
crc = crc>>1
j -= 1
i += 1
return crc
# --- internal ---
# r6 .. 1
# r5 .. loop index (8, 7, .., 1)
# r4 .. working and test register
# --- arguments ---
# r3 .. poly
# r2 .. n (length of data)
# r1 .. data address
# r0 .. crc
@micropython.asm_thumb
def crc16_a(r0, r1, r2, r3) -> int:
mov(r6, 1) # helper value for bit shift
label(loop) # --- outer loop --
ldrb(r4, [r1, 0]) # r4 = data[0]
add(r1, 1) # increment data pointer
eor(r0, r4) # crc ^= data[0]
mov(r5, 8) # r5 = 8, prepare inner loop
label(loop_8) # --- inner loop ---
mov(r4, 1)
and_(r4, r0) # if crc & 1:
beq(lelse)
lsr(r0, r6) # crc >>= 1
eor(r0, r3) # crc ^= poly
sub(r5, 1)
bne(loop_8)
b(fin_8)
label(lelse) # else:
lsr(r0, r6) # crc >>= 1
sub(r5, 1)
bne(loop_8)
label(fin_8) # --- inner end ---
sub(r2, 1) # n -= 1
bne(loop) # --- test and outer end ---
# --------------------------------------------------------
# Table based implementations
# create a single entry to the CRC table
def _initial(c):
crc = c
for j in range(8):
if crc & 0x01:
crc = (crc >> 1) ^ POLY
else:
crc = crc >> 1
return crc
# Create the table
_tab = array("H", [_initial(i) for i in range(256)])
def crc16_t(data):
crc = PRESET
for d in data:
crc = (crc >> 8) ^ _tab[(crc ^ d) & 0xff]
return crc
@micropython.native
def crc16_tn(data):
crc = PRESET
for d in data:
crc = (crc >> 8) ^ _tab[(crc ^ d) & 0xff]
return crc
@micropython.viper
def crc16_tv(data: ptr8, num: int, tab: ptr16) -> int:
crc: int = PRESET
i: int = 0
idx:int = 0
while i < num:
idx = (crc ^ data[i]) & 0xff
crc = (crc >> 8) ^ tab[idx]
i += 1
return crc
# --- internal ---
# r7 .. 8
# r6 .. 0xff
# r5 .. idx
# r4 .. working and test register
# --- arguments ---
# r3 .. table address
# r2 .. n (length of data)
# r1 .. data address
# r0 .. crc
@micropython.asm_thumb
def crc16_ta(r0, r1, r2, r3) -> int:
mov(r7, 8)
mov(r6, 0xff)
label(loop)
ldrb(r5, [r1, 0]) # idx = data[0]
add(r1, 1) # increment data pointer
eor(r5, r0) # idx ^= crc
and_(r5, r6) # idx ^= 0xff
add(r5, r5, r5) # double to make idx ptr16
add(r5, r5, r3) # table data address
ldrh(r5, [r5, 0]) # fetch table entry: r5 = tab[idx]
lsr(r0, r7) # crc >>= 8
eor(r0, r5) # crc ^= tab[idx]
sub(r2, 1) # n -= 1
bne(loop) # --- test and outer end ---
# -------------- actual tests -----------------------
buffer = bytearray((i%256 for i in range(20_000)))
for i in range(3):
t0 = ticks_us()
result = crc16(buffer)
t1 = ticks_us()
td = ticks_diff(t1,t0)
print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')
for i in range(3):
t0 = ticks_us()
result = crc16_n(buffer)
t1 = ticks_us()
td = ticks_diff(t1,t0)
print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')
for i in range(3):
t0 = ticks_us()
result = crc16_v(buffer, len(buffer))
t1 = ticks_us()
td = ticks_diff(t1,t0)
print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')
for i in range(3):
t0 = ticks_us()
result = crc16_a(PRESET, buffer, len(buffer), POLY)
t1 = ticks_us()
td = ticks_diff(t1,t0)
print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')
for i in range(3):
t0 = ticks_us()
result = crc16_t(buffer)
t1 = ticks_us()
td = ticks_diff(t1,t0)
print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')
for i in range(3):
t0 = ticks_us()
result = crc16_tn(buffer)
t1 = ticks_us()
td = ticks_diff(t1,t0)
print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')
for i in range(3):
t0 = ticks_us()
result = crc16_tv(buffer, len(buffer), _tab)
t1 = ticks_us()
td = ticks_diff(t1,t0)
print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')
for i in range(3):
t0 = ticks_us()
result = crc16_ta(PRESET, buffer, len(buffer), _tab)
t1 = ticks_us()
td = ticks_diff(t1,t0)
print(f'crc16: 0x{result:4x}, {td/len(buffer)}µs per byte')
Results on my Blackpill (96MHz):
crc16: 0x9cb8, 85.5907µs per byte
crc16: 0x9cb8, 85.58925µs per byte
crc16: 0x9cb8, 85.59105µs per byte
crc16: 0x9cb8, 42.5017µs per byte
crc16: 0x9cb8, 42.5013µs per byte
crc16: 0x9cb8, 42.5032µs per byte
crc16: 0x9cb8, 3.473µs per byte
crc16: 0x9cb8, 3.47325µs per byte
crc16: 0x9cb8, 3.47295µs per byte
crc16: 0x9cb8, 1.01305µs per byte
crc16: 0x9cb8, 1.01295µs per byte
crc16: 0x9cb8, 1.013µs per byte
crc16: 0x9cb8, 14.7685µs per byte
crc16: 0x9cb8, 14.76845µs per byte
crc16: 0x9cb8, 14.7679µs per byte
crc16: 0x9cb8, 10.3521µs per byte
crc16: 0x9cb8, 10.3505µs per byte
crc16: 0x9cb8, 10.35255µs per byte
crc16: 0x9cb8, 0.6587µs per byte
crc16: 0x9cb8, 0.65865µs per byte
crc16: 0x9cb8, 0.6587µs per byte
crc16: 0x9cb8, 0.1688µs per byte
crc16: 0x9cb8, 0.1688µs per byte
crc16: 0x9cb8, 0.1688µs per byte
There is a memory-penalty of 512 bytes for the table-based approach, of course.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
@rkompass yes, I was trying to figure out if the CRC that you are using really amount to a bit or byte reversed version of the SD card CCITT one. I haven't had time to look into it.
Marcus
… On Nov 7, 2023, at 4:30 AM, Raul Kompaß ***@***.***> wrote:
Your code was a clearly a helper. I'm beginning to get an idea now why sometimes bit-reverse is involved in crc computation. Could it be that the ccitt/sdcard computation you implemented is only compatible with the codes presented here after adding such a reverse?
asm_thumb has a bit-reverse instruction btw..
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
@rkompass The fast way to bit-reverse bytes is a 256 byte lookup table, if you don't have a built-in function to do it. The rbit8 and rbit16 functions below can only be considered placeholders; they are very slow.
… On Nov 7, 2023, at 11:40 AM, Raul Kompaß ***@***.***> wrote:
Found that it is compatible by bit-reversing the bytes to be processed, the polynomial, the initial value and the crc result:
# Table based implementations of crc - comparison
from array import array
# ------- implementation by rkompass ------
# create a single entry of CRC table
def _initial(crc, poly): # by roberthh: https://forum.micropython.org/viewtopic.php?t=12837
for j in range(8):
if crc & 1:
crc = (crc >> 1) ^ poly
else:
crc = crc >> 1
return crc
# --- internal ---
# r7 .. 8
# r6 .. 0xff
# r5 .. idx
# r4 .. working and test register
# --- arguments ---
# r3 .. table address
# r2 .. n (length of data)
# r1 .. data address
# r0 .. crc
@micropython.asm_thumb
def crc16_ta(r0, r1, r2, r3) -> int:
mov(r7, 8)
mov(r6, 0xff)
label(loop)
ldrb(r5, [r1, 0]) # idx = data[0]
add(r1, 1) # increment data pointer
eor(r5, r0) # idx ^= crc
and_(r5, r6) # idx ^= 0xff
add(r5, r5, r5) # double to make idx ptr16
add(r5, r5, r3) # table data address
ldrh(r5, [r5, 0]) # fetch table entry: r5 = tab[idx]
lsr(r0, r7) # crc >>= 8
eor(r0, r5) # crc ^= tab[idx]
sub(r2, 1) # n -= 1
bne(loop) # --- test and outer end ---
# ------- alternative implementation by mendenm ------.
# create a single entry of CRC table
def _initial2(crc, poly):
crc <<= 8;
for bit in range(8):
crc = crc << 1
if ((crc & 0x10000) != 0): crc ^= poly
return crc & 0xffff
# @micropython.viper
# def crc16_tv2(crc: int, data: ptr8, n: int, tab: ptr16) -> int:
# i:int = 0
# while i < n:
# crc = ((crc << 8) & 0xffff) ^ tab[((crc>>8) ^ data[i]) & 0xff]
# i += 1
# return crc
# r3 .. count
# r2 .. CRC
# r1 .. table (address)
# r0 .. data (address)
@micropython.asm_thumb
def crc16_ta2(r0, r1, r2, r3) -> int:
mov(r4, 0)
mvn(r4, r4) #all ones now
mov(r7,16)
lsr(r4, r7) #R4 = half-word of ones
mov(r5, 0xff) #used for byte masking
label(loop)
mov(r6, r2) #copy current CRC
mov(r7, 8)
lsr(r6, r7) #crc >> 8
ldrb(r7, [r0,0]) #fetch new byte dp[idx]
add(r0,1) #push to next byte address
eor(r6, r7) #r6 = (crc>>8) ^ dp[idx]
and_(r6, r5) #mask byte ( (crc>>8) ^ dp[idx]) & 0xff
add(r6, r6, r6) #double for table offset
add(r6, r6, r1) #table data address
ldrh(r6,[r6,0]) #fetch table syndrome
mov(r7,8)
lsl(r2, r7) # CRC << 8
and_(r2, r4) #(crc << 8) & 0xffff)
eor(r2, r6) #new CRC
sub(r3, 1)
bne(loop)
mov(r0, r2)
# -- Bit reverse functions by Peter Hinch: https://github.com/peterhinch/micropython-samples/blob/master/reverse/reverse.py ------
def rbit8(v): # bit reverse of 8 bit value
v = (v & 0x0f) << 4 | (v & 0xf0) >> 4
v = (v & 0x33) << 2 | (v & 0xcc) >> 2
return (v & 0x55) << 1 | (v & 0xaa) >> 1
def rbit16(v): # bit reverse of 16 bit value
v = (v & 0x00ff) << 8 | (v & 0xff00) >> 8
v = (v & 0x0f0f) << 4 | (v & 0xf0f0) >> 4
v = (v & 0x3333) << 2 | (v & 0xcccc) >> 2
return (v & 0x5555) << 1 | (v & 0xaaaa) >> 1
# -------- Simple comparison ----------
data = b'\xFF\x03\x02\x15\x28\x9F\x1F' # crc == 0 if data == b'\xFF\x03\x02\x15\x28\x9F\x1E'
PRESET = 0x0000
POLY = 0x1021
_tab2 = array("H", (_initial2(byt, POLY) for byt in range(256)))
# crc2 = crc16_tv2(PRESET, data, len(data), _tab2)
crc2 = crc16_ta2(data, _tab2, PRESET, len(data))
print("\nCRC2: ", hex(crc2), len(data))
print(f'Binary repr: {crc2:016b}')
_tab = array("H", [_initial(i, rbit16(POLY)) for i in range(256)]) # Create the table, based on _initial(), POLY
crc = rbit16(crc16_ta(rbit16(PRESET), bytes((rbit8(b) for b in data)), len(data), _tab))
print("\nbit-reversed CRC of bit-reversed data: ", hex(crc), len(data))
print(f'Binary repr: {crc:016b}')
result:
CRC2: 0x4c9c 7
Binary repr: 0100110010011100
bit-reversed CRC of bit-reversed data: 0x4c9c 7
Binary repr: 0100110010011100
-- certainly something that every informatics student is aware of (?!?).
Now the question is: If we want to be able to represent every crc-16 variant, do we need the reverse? Apparently. Can the bit-reverse of the bytes be done fast enough, or should there rather be switch of the inner loop?
See these crc-16 variants.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
@rkompass
I suspect that the slightly longer asm_thumb code I write saves more time & effort than having to bit-reverse everything for the other (faster) polynomial. I had sort of thought, though, that the bit reversal was going to be the solution, since some devices shift in LSB first, and others MSB first.
… On Nov 7, 2023, at 11:40 AM, Raul Kompaß ***@***.***> wrote:
Found that it is compatible by bit-reversing the bytes to be processed, the polynomial, the initial value and the crc result:
# Table based implementations of crc - comparison
from array import array
# ------- implementation by rkompass ------
# create a single entry of CRC table
def _initial(crc, poly): # by roberthh: https://forum.micropython.org/viewtopic.php?t=12837
for j in range(8):
if crc & 1:
crc = (crc >> 1) ^ poly
else:
crc = crc >> 1
return crc
# --- internal ---
# r7 .. 8
# r6 .. 0xff
# r5 .. idx
# r4 .. working and test register
# --- arguments ---
# r3 .. table address
# r2 .. n (length of data)
# r1 .. data address
# r0 .. crc
@micropython.asm_thumb
def crc16_ta(r0, r1, r2, r3) -> int:
mov(r7, 8)
mov(r6, 0xff)
label(loop)
ldrb(r5, [r1, 0]) # idx = data[0]
add(r1, 1) # increment data pointer
eor(r5, r0) # idx ^= crc
and_(r5, r6) # idx ^= 0xff
add(r5, r5, r5) # double to make idx ptr16
add(r5, r5, r3) # table data address
ldrh(r5, [r5, 0]) # fetch table entry: r5 = tab[idx]
lsr(r0, r7) # crc >>= 8
eor(r0, r5) # crc ^= tab[idx]
sub(r2, 1) # n -= 1
bne(loop) # --- test and outer end ---
# ------- alternative implementation by mendenm ------.
# create a single entry of CRC table
def _initial2(crc, poly):
crc <<= 8;
for bit in range(8):
crc = crc << 1
if ((crc & 0x10000) != 0): crc ^= poly
return crc & 0xffff
# @micropython.viper
# def crc16_tv2(crc: int, data: ptr8, n: int, tab: ptr16) -> int:
# i:int = 0
# while i < n:
# crc = ((crc << 8) & 0xffff) ^ tab[((crc>>8) ^ data[i]) & 0xff]
# i += 1
# return crc
# r3 .. count
# r2 .. CRC
# r1 .. table (address)
# r0 .. data (address)
@micropython.asm_thumb
def crc16_ta2(r0, r1, r2, r3) -> int:
mov(r4, 0)
mvn(r4, r4) #all ones now
mov(r7,16)
lsr(r4, r7) #R4 = half-word of ones
mov(r5, 0xff) #used for byte masking
label(loop)
mov(r6, r2) #copy current CRC
mov(r7, 8)
lsr(r6, r7) #crc >> 8
ldrb(r7, [r0,0]) #fetch new byte dp[idx]
add(r0,1) #push to next byte address
eor(r6, r7) #r6 = (crc>>8) ^ dp[idx]
and_(r6, r5) #mask byte ( (crc>>8) ^ dp[idx]) & 0xff
add(r6, r6, r6) #double for table offset
add(r6, r6, r1) #table data address
ldrh(r6,[r6,0]) #fetch table syndrome
mov(r7,8)
lsl(r2, r7) # CRC << 8
and_(r2, r4) #(crc << 8) & 0xffff)
eor(r2, r6) #new CRC
sub(r3, 1)
bne(loop)
mov(r0, r2)
# -- Bit reverse functions by Peter Hinch: https://github.com/peterhinch/micropython-samples/blob/master/reverse/reverse.py ------
def rbit8(v): # bit reverse of 8 bit value
v = (v & 0x0f) << 4 | (v & 0xf0) >> 4
v = (v & 0x33) << 2 | (v & 0xcc) >> 2
return (v & 0x55) << 1 | (v & 0xaa) >> 1
def rbit16(v): # bit reverse of 16 bit value
v = (v & 0x00ff) << 8 | (v & 0xff00) >> 8
v = (v & 0x0f0f) << 4 | (v & 0xf0f0) >> 4
v = (v & 0x3333) << 2 | (v & 0xcccc) >> 2
return (v & 0x5555) << 1 | (v & 0xaaaa) >> 1
# -------- Simple comparison ----------
data = b'\xFF\x03\x02\x15\x28\x9F\x1F' # crc == 0 if data == b'\xFF\x03\x02\x15\x28\x9F\x1E'
PRESET = 0x0000
POLY = 0x1021
_tab2 = array("H", (_initial2(byt, POLY) for byt in range(256)))
# crc2 = crc16_tv2(PRESET, data, len(data), _tab2)
crc2 = crc16_ta2(data, _tab2, PRESET, len(data))
print("\nCRC2: ", hex(crc2), len(data))
print(f'Binary repr: {crc2:016b}')
_tab = array("H", [_initial(i, rbit16(POLY)) for i in range(256)]) # Create the table, based on _initial(), POLY
crc = rbit16(crc16_ta(rbit16(PRESET), bytes((rbit8(b) for b in data)), len(data), _tab))
print("\nbit-reversed CRC of bit-reversed data: ", hex(crc), len(data))
print(f'Binary repr: {crc:016b}')
result:
CRC2: 0x4c9c 7
Binary repr: 0100110010011100
bit-reversed CRC of bit-reversed data: 0x4c9c 7
Binary repr: 0100110010011100
-- certainly something that every informatics student is aware of (?!?).
Now the question is: If we want to be able to represent every crc-16 variant, do we need the reverse? Apparently. Can the bit-reverse of the bytes be done fast enough, or should there rather be switch of the inner loop?
See these crc-16 variants.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
@rkompass @robert-hh This looks really promising for the 4-bit table! It's pretty fast, and certainly a lot more compact. I like it.
… On Nov 8, 2023, at 10:18 AM, Raul Kompaß ***@***.***> wrote:
@robert-hh Obviously there are similar thoughts:
I just finished another round of coding, including viper and asm_thumb versions of the 4bit table crc, which requires 32 bytes.
I will present the table of benchmark results here.
Also related: My last suggestion in the sdcard discussion, where I suggested that
binascii.crc_hqx(data, value)
Compute a 16-bit CRC value of data, starting with value as the initial CRC, and return the result. This uses the CRC-CCITT polynomial x16 + x12 + x5 + 1, often represented as 0x1021. This CRC is used in the binhex4 format.
could/should be implemented.
My thoughts at present are:
Having this as part of binascii, i.e. coded internally in C, it would be present in optimal speed at every architecture.
The new sdcard driver could benefit from that, without bloating its code.
For other purposes it would be beneficial to have a reasonably fast crc16 library that is universal, like originally requested in this discussion. I imagine a class that has certain variants available by name, like in https://crccalc.com, that at initialization sets the polynomial, starting crc and direction of bit shift resp. reversal. and allows to configure those, of course, by desired values.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
@rkompass @robert-hh
OK, I had to play! I wrote a version of the SD-card left-shift for the DMA on the RP2040 at 125 MHz. The resulting crc is the same value I get for my sdcard crc16 module, when I use DMA CRC algorithm 2 (CCITT, not bit reversed) and seed 0.
Note that this timing is for 20_000 byte arrays. It goes to 0.02 µs per byte for 100_000 bytes. In other words, as I implemented it here, the time is almost 100% in setting up the DMA channel in python. I suspect if one really needed to do this for something useful, you would keep a channel all set up, write the source address and transfer count register, and hit go. It would probably then be right down to the 8 ns per byte limited by the clock speed.
I'm not sure why this would be useful, but when the DMA client for rp2040 gets finalized, I may add this to the heap of options.
Now I have other things I need to do for real work...
Marcus
dma-based crc16 for sd cards
crc16: 0x3359, 0.07150001µs per byte
crc16: 0x3359, 0.0729µs per byte
crc16: 0x3359, 0.07185µs per byte
python version: crc16
crc16: 0x9cb8, 78.6903µs per byte
crc16: 0x9cb8, 78.6903µs per byte
crc16: 0x9cb8, 78.6894µs per byte
native version: crc16_n
crc16: 0x9cb8, 33.9129µs per byte
crc16: 0x9cb8, 33.9116µs per byte
crc16: 0x9cb8, 33.9125µs per byte
viper version: crc16_v
crc16: 0x9cb8, 2.50005µs per byte
crc16: 0x9cb8, 2.49915µs per byte
crc16: 0x9cb8, 2.49895µs per byte
asm_thumb version: crc16_a
crc16: 0x9cb8, 0.57935µs per byte
crc16: 0x9cb8, 0.5795µs per byte
crc16: 0x9cb8, 0.5792µs per byte
8bit-table-based python version: crc16_t
crc16: 0x9cb8, 10.23065µs per byte
crc16: 0x9cb8, 10.22925µs per byte
crc16: 0x9cb8, 10.22905µs per byte
8bit-table-based native version: crc16_tn
crc16: 0x9cb8, 7.37165µs per byte
crc16: 0x9cb8, 7.3707µs per byte
crc16: 0x9cb8, 7.37125µs per byte
8bit-table-based viper version: crc16_tv
crc16: 0x9cb8, 0.4513µs per byte
crc16: 0x9cb8, 0.4508µs per byte
crc16: 0x9cb8, 0.45075µs per byte
8bit-table-based asm_thumb version: crc16_ta
crc16: 0x9cb8, 0.1151µs per byte
crc16: 0x9cb8, 0.11465µs per byte
crc16: 0x9cb8, 0.11465µs per byte
4bit-table-based python version: crc16_t4
crc16: 0x9cb8, 19.97395µs per byte
crc16: 0x9cb8, 19.9732µs per byte
crc16: 0x9cb8, 19.97325µs per byte
4bit-table-based native version: crc16_t4n
crc16: 0x9cb8, 12.63475µs per byte
crc16: 0x9cb8, 12.6351µs per byte
crc16: 0x9cb8, 12.63515µs per byte
4bit-table-based viper version: crc16_t4v
crc16: 0x9cb8, 0.6752µs per byte
crc16: 0x9cb8, 0.67485µs per byte
crc16: 0x9cb8, 0.6749µs per byte
4bit-table-based asm_thumb version: crc16_t4
crc16: 0x9cb8, 0.19475µs per byte
crc16: 0x9cb8, 0.1948µs per byte
crc16: 0x9cb8, 0.19475µs per byte
8bit-table-based right-shift viper version: crc16_tv2
crc16: 0x9cb8, 0.4846µs per byte
crc16: 0x9cb8, 0.48365µs per byte
crc16: 0x9cb8, 0.4839µs per byte
8bit-table-based right-shift asm_thumb version: crc16_ta2
crc16: 0x9cb8, 0.1316µs per byte
crc16: 0x9cb8, 0.1316µs per byte
crc16: 0x9cb8, 0.1317µs per byte
… On Nov 8, 2023, at 12:42 PM, Raul Kompaß ***@***.***> wrote:
Hello Marcus,
yes, I also thought of the 4-bit table version as sufficient and lean tradeoff.
Given that with quite fast SPI at 12 MHz baudrate the sdcard reading takes with your driver ~1.5 µs / byte this makes a 60-70% (viper) or 25-35% (asm) overhead. Much less with slower SPI of course.
I'll do optimized 4-bit table viper and asm_thumb versions for your left-shift crc variant. The benchmark was for right-shift.
Raul
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
I was guessing that something like the following would hold: Idea: if you have to do a bit-reversal in the byte to be processed, and that byte goes as index into a table, then alternatively also the Actually just a byte-swap suffices to transforms one implementation into the other, The CRC PRESET has to be byte-swapped and the resulting CRC. This is demonstrated in the following code: # Equivalence of 8-bit table based left-shift and right-shift crc16
from array import array
# bit reverse of 8 bit value
def rbit8(v):
v = (v & 0x0f) << 4 | (v & 0xf0) >> 4
v = (v & 0x33) << 2 | (v & 0xcc) >> 2
return (v & 0x55) << 1 | (v & 0xaa) >> 1
def rbit16(v): # bit reverse of 16 bit value
v = (v & 0x00ff) << 8 | (v & 0xff00) >> 8
v = (v & 0x0f0f) << 4 | (v & 0xf0f0) >> 4
v = (v & 0x3333) << 2 | (v & 0xcccc) >> 2
return (v & 0x5555) << 1 | (v & 0xaaaa) >> 1
# swap bytes of 2-byte number
def swbyte2(v):
return v >> 8 | (v & 0xff) << 8
# bit reverse bytes of 2-byte number
def rbytes2(v):
return rbit8(v >> 8) << 8 | rbit8(v & 0xff)
# create a single entry of CRC table, right shift version
def _initial(crc, poly): # by roberthh: https://forum.micropython.org/viewtopic.php?t=12837
for j in range(8):
if crc & 1:
crc = (crc >> 1) ^ poly
else:
crc = crc >> 1
return crc
# create a single entry of CRC table, left shift version
def _initial2(crc, poly):
crc <<= 8
for bit in range(8):
crc = crc << 1
if crc & 0x10000:
crc ^= poly
return crc & 0xffff
# crc computation, right shift version
@micropython.viper
def crc16_tv(crc: int, data: ptr8, n: int, tab: ptr16) -> int:
i: int = 0
idx:int = 0
while i < n:
idx = (crc & 0xff) ^ data[i] # crc_lowbyte ^ data
crc = (crc >> 8) ^ tab[idx] # (crc highbyte -> lowbyte) ^ tab
i += 1
return crc
# crc computation, left shift version, note that just the roles of high and low crc byte are exchanged
@micropython.viper
def crc16_tv2(crc: int, data: ptr8, n: int, tab: ptr16) -> int:
i:int = 0
idx:int = 0
while i < n:
idx = (crc >> 8) ^ data[i] # (crc highbyte -> lowbyte) ^ data
crc = ((crc & 0xff) << 8) ^ tab[idx] # (crc lowbyte -> highbyte) ^ tab
i += 1
return crc
data = b'\xFF\x03\x02\x15\x28\x9F\x1F' # crc == 0 if data == b'\xFF\x03\x02\x15\x28\x9F\x1E'
PRESET = 0x0000
POLY = 0x1021
print('\nmendenms left shift 8-bit table viper version:')
_tab = array("H", (_initial2(i, POLY) for i in range(256)))
crc = crc16_tv2(PRESET, data, len(data), _tab)
print(" CRC: ", hex(crc), len(data))
print(f'Binary repr: {crc:016b}')
print('\nright shift table viper version, just byte-reversed table:')
_tab = array("H", [swbyte2(_initial2(i, POLY)) for i in range(256)]) # Create the table, based on _initial(), POLY
crc = swbyte2(crc16_tv(swbyte2(PRESET), data, len(data), _tab))
print("byte-reversed CRC: ", hex(crc), len(data))
print(f'Binary repr: {crc:016b}')
print('\nright shift table viper version, just byte-reversed table:')
_tab = array("H", [rbytes2(_initial(rbit8(i), rbit16(POLY))) for i in range(256)]) # Create the table, based on _initial(), POLY
crc = swbyte2(crc16_tv(swbyte2(PRESET), data, len(data), _tab))
print("byte-reversed CRC: ", hex(crc), len(data))
print(f'Binary repr: {crc:016b}')
# _initial2(i, POLY) is equivalent to rbit16(_initial(rbit8(i), rbit16(POLY)) :
#
tab = array("H", (rbit16(_initial(rbit8(i), rbit16(POLY))) for i in range(256)))
tab2 = array("H", (_initial2(i, POLY) for i in range(256)))
print('\nindex tab tab2')
for i in range(3):
print(f' {i:3d} 0b{tab[i]:016b} 0b{tab2[i]:016b}')
print(' ... ...')
for i in range(253,256):
print(f' {i:3d} 0b{tab[i]:016b} 0b{tab2[i]:016b}')
print(' --- ---')
for i in (0b10110111, 0b11101101):
print(f'0b{i:08b} 0b{tab[i]:016b} 0b{tab2[i]:016b}')Of course, also the table initialization functions are related: So there is also no need to have two implementations of the lookup-table initialization functions. And of course: No need to do bit-reflections while computing the CRC, as long as it is done for speed, i.e. table-based. With a nibble (4-bit) lookup table the thing seems a bit different. Here we have to distinguish whether the lower or higher nibble is digested first. For the general case there have to be different functions to be held, presumably. Or an option of nibble-order. |
Beta Was this translation helpful? Give feedback.
-
|
@rkompass
Have you tested the swapped algorithms with an odd number of bytes going in? Maybe the hanging byte doesn't work. The must be (I say with no confidence at all) a reason that they have been so universally written as they are. And I would think (again with no great confidence) the number theorists would understand all teh reflection formulae that go with these operations.
Marcus
… On Nov 9, 2023, at 2:17 PM, Raul Kompaß ***@***.***> wrote:
Mathematical insight beats programming skill.
The above observation, which seems totally plausible, seems not to be present in the answers and expert example code, even e.g. on
stackoverflow.
It can't be I just discoverd something new.?. It's too simple for that.
But everyone seems to compute reflected bytes, if needed, for CRC. Despite using a lookup-table for speed.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
We should not over-engineer this topic. |
Beta Was this translation helpful? Give feedback.
-
|
@robert-hh
Your point is why I moved this whole issue outside of the new proposed sdcard.py code. You can hand in any function you want, so it's up to the user. I may put one of the most general codes @rkompass provides in crc16.py in the sdcard package, in case user wants a simple default.
However, there are interesting cases in which the default isn't optimal. The rp2040 computes CRCs in its DMA, which is used by the SPI transfer code in the rp2 port. With trivial modification to the SPI code, we can turn on CRCs there, and get the SDcards to get their CRC computed at no cost by the hardware.
Marcus
… On Nov 9, 2023, at 2:38 PM, Robert Hammelrath ***@***.***> wrote:
We should not over-engineer this topic.
My suggestion is to provide just two 16 entry table based versions for the straight and reflected approaches, with options to set the polynomial and the initial value. And then leave everything else to the user. One can add links to other variants in the documentation.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
@robert-hh
I seriously doubt if most MCU DMA units support CRC computation. That is why it would have to go in the port-specific SPI module (for example).
The rp2040 SPI module does use DMA, so it would be easily adapted. And, the way I am structuring sdcard.py, the function provided by the SPI module could be handed directly in to __init__, so you can use it if it exists.
I agree at this point that the CRCs probably belong in a separate place, although I have included one simple example in the sdcard module that I just created a PR for. That crc16.py will probably disappear once the improved versions from @rkompass get included.
Marxu
… On Nov 9, 2023, at 3:02 PM, Robert Hammelrath ***@***.***> wrote:
That can all got into a separate CRC section of micropython-lib. And I simply do not know if the SPI devices of all supported MCUs support CRC calculation.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Agreed. My tendency, always :-| But otoh this makes things simpler too. Turns out, that the four 8-bit table-based universal implementations (raw, native, viper and asm_thumb) are already optimally implemented (the core of those), for both directions. The 4-bit left-shifting variant was giving me headaches. I tried many combinations of settings and always got a wrong CRC. RNWilliams93_PainlessGuidetoCRC.zip
The code example uses 7 bytes. I will test of course against the crc calculator website. Later. |
Beta Was this translation helpful? Give feedback.
-
|
An update: older right-shift variant for comparison: This is due to the algorithm itself, but also because there are more constants involved than cannot be held in registers at a time as seen in the 4-bit table based asm_thumb version: # --- internal ---
# r7 .. 4
# r6 .. 0x0f
# r5 .. idx or tab[idx]
# r4 .. data[0] working and test register
# --- arguments ---
# r3 .. table address
# r2 .. n (length of data)
# r1 .. data address
# r0 .. crc
@micropython.asm_thumb
def crc16_t4a2(r0, r1, r2, r3) -> int:
mov(r7, 4)
mov(r6, 0x0f)
label(loop)
ldrb(r4, [r1, 0]) # d = data[0]
add(r1, 1) # increment data pointer
mov(r5, r0) # idx = crc
lsr(r5, r7)
lsr(r5, r7) # idx >>= 8
eor(r5, r4) # idx ^= d
lsr(r5, r7) # idx >>= 4
and_(r5, r6) # idx &= 0x0f
add(r5, r5, r5) # double to make idx ptr16
add(r5, r5, r3) # table data address
ldrh(r5, [r5, 0]) # fetch table entry: idx = tab[idx]
lsl(r0, r7) # crc <<= 4
eor(r0, r5) # crc ^= idx
mov(r5, r0) # idx = crc
lsr(r5, r7)
lsr(r5, r7)
lsr(r5, r7) # idx >>= 12
eor(r5, r4) # idx ^= d
and_(r5, r6) # idx &= 0x0f
add(r5, r5, r5) # double to make idx ptr16
add(r5, r5, r3) # table data address
ldrh(r5, [r5, 0]) # fetch table entry: idx = tab[idx]
lsl(r0, r7) # crc <<= 4
eor(r0, r5) # crc ^= idx
sub(r2, 1) # n -= 1
bne(loop) # --- test and outer end ---
movw(r6, 0xffff)
and_(r0, r6)For a left shift of 12 bits we had to use three times Fortunately we found another symmetry that allows to use the right-shift 4-bit table based algorithm (with swapped order of nibble evaluation of the read-in data bytes) with the left-shift-initialized table. So we are at 0.285µs per byte (instead of 0.35) at the left-shift variant too. @mendenm: You need 4-bit table base viper (for universality) and asm_thumb versions of your CRC, right? |
Beta Was this translation helpful? Give feedback.
-
|
@mendenm: Your speed test reports ~ 20-25 % less bytes/s, which is due to the 4-bit table speed/memory trade off. Note that the crc16 function takes 3 arguments now: |
Beta Was this translation helpful? Give feedback.
-
|
I did a quick benchmark on an ESP32-S3 running at 240 MHz. I tried the micropython.native version by @robert-hh, from the forum thread referenced here, the Compared to @mendenm's benchmarks for a rp2040 in an earlier post, it seems like the ESP32-S3 is about 10x faster for both the native and viper versions. Which is quite a bit more than I would have expected. And the ESP32 code is about 4x faster still. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I used the pyboard to communication with RS485 device, when calucation the CRC value, the time beyond 2mS. Follows as my code:
`def calculateCRC(msgIn):
CRC16 = 0xffff
poly = 0xa001
`
Is there any algorithm to be suit pyboard?Thanks!
Beta Was this translation helpful? Give feedback.
All reactions