Replies: 8 comments 3 replies
-
|
Another approach to passing 32-bit (rather than 30-bit) ints to a Viper or asm function is to populate a 32-bit integer array with the values and pass the array as an arg. |
Beta Was this translation helpful? Give feedback.
-
|
The one I had high hopes might work was SIO = const(0xd0000000)
@micropython.viper
def callback(_):
sio = ptr32(SIO)but even that involves a 700ns lookup. I assume the const substitution happens before the viper decorator even gets a chance to see it? On the subject of array() and bytearray(), if For me, this 100x difference really hammered home the importance of taking local pointers once at the start of a viper function, rather than doing any namespace lookups inside tight loops, as the documentation already cautions. I see that x[0] = 324
x[1] = 324
x[2] = 324takes 40ns whereas for i in range(3):
x[i] = 324takes 560ns and even i = int(0)
while i < 3:
x[i] = 324
i = i + 1takes 510ns, although luckily viper still doesn't need to allocate for the range(3). (Within viper, I don't think I need the int(0) there - it would automatically be int32 when assigned as 0?) It'd be interesting to build a debug version of the code emitters and see what they're actually doing here. I'll need to work out how, though... |
Beta Was this translation helpful? Give feedback.
-
|
Yes, I found the documentation and started to dig through the emitnative.c code generator a bit too, although not yet figured out how to build a debug version of mpy-cross that debug logs the generated assembler from viper so I can experiment with its behaviour more directly. (I rather lazily started disassembling the relevant bytes of the .mpy file, but that's a horrific way to do it!) My measurement harness is as simple as from machine import Pin, Timer
import micropython
Pin(0, Pin.OUT)
@micropython.viper
def callback(_):
sio = ptr32(uint(0xd0) << 24)
sio[6] = 1
# Insert something here
sio[8] = 1
timer = machine.Timer(freq = 1000, mode = Timer.PERIODIC,
callback = callback, hard = True)where the block to test is pasted inline. (Obviously it can't use the predefined sio as that would be cheating.) Without any additions, this produces a 14ns pulse: essentially the two cycles to write sio[8]. I can offset the trigger to measure just the extra time added by the code spliced in. Inserting
I can't find any way to write the fast version that has the literal 0xd0000000 in it, nor any way to express addresses between 0x40000000 and 0xc0000000 faster/clearer than a shifted smaller constant. If python had macros or define-time/inline expanded functions, I could write a helper than takes a define-time constant and emits something that viper will optimise well, but sadly python is not scheme. Yes, I expected the while loop to be super-fast too, and I don't really understand why it's not. With no other code spliced in than: i = 0
while i < 4:
i = i + 1the loop over four values of i doing nothing still costs 390ns or about 59 cycles. i = 4
while i:
i = i - 1is a little bit better at 240ns but still not brilliant: I guess about 34 cycles? |
Beta Was this translation helpful? Give feedback.
-
|
Nice, there's basically zero jitter in the overhead on your test harness. Saves firing up a scope! I added two extra columns, one for PS For me the RVR register comes up as zero so I added |
Beta Was this translation helpful? Give feedback.
-
|
Here's a slightly boiled down version which disables interrupts during the test: import machine
import micropython
@micropython.viper
def test() -> uint:
ppb = ptr32(-0x20000000) # PPB = 0xe0000000
ppb[0x3804] = 0b101 # CSR = CLKSOURCE | ENABLE
ppb[0x3805] = -1 # RVR = maximum
state = machine.disable_irq()
t0 = ppb[0x3806] # t0 = CVR
t1 = ppb[0x3806] # t1 = CVR
x = ptr32(-0x30000000) # Line to benchmark
t2 = ppb[0x3806] # t2 = CVR
machine.enable_irq(state)
# Difference between t1 - t2 and t0 - t1 masked with RVR:
return uint(t1 - t2 - t0 + t1) & ppb[0x3805]
print(*(test() for _ in range(20)))and some corresponding results for variants you and I have measured above: When I look at the initially puzzling difference between |
Beta Was this translation helpful? Give feedback.
-
|
Another curious factor with gpio timing is that: p = Pin(10)
p(1) # set pin highis actually noticeably faster than: p = Pin(10)
p.value(1) # set pin highThis is because in the second case there's a dictionary lookup internally to find the |
Beta Was this translation helpful? Give feedback.
-
|
Interesting, and 420ish cycles vs 700ish cycles (a bit jittery): almost twice as fast as you say. I didn't know you could call pin objects directly like that! I wondered if I'd overlooked it when reading the documentation. The machine.Pin reference does mention Maybe I should cook up a docs PR? Another fun one I stumbled across: MicroPython interns strings, so comparing two strings in viper is cheap and constant time, as is comparing two pin objects (say). There isn't an [Edit after reading the code: no, comparing string reprs wouldn't be cheap like comparing objects is, because |
Beta Was this translation helpful? Give feedback.
-
|
I'm getting strange results with @arachsys timing framework which I was using to time an interrupt routine: import array
import machine
import time
L_BUF = 16
b_v = array.array('H', range(L_BUF))
b_t = array.array('I', range(L_BUF))
b_i = 0
t_c = machine.ADC(machine.ADC.CORE_TEMP)
t_f = machine.Timer()
def fill(_):
global b_i
b_t[b_i] = time.ticks_us()
b_v[b_i] = t_c.read_u16()
b_i += 1
if b_i == L_BUF:
t_f.deinit()
F_FILL = 16_000 # Hz
t_f.init(mode=machine.Timer.PERIODIC, freq=F_FILL, callback=fill)
time.sleep(2 * L_BUF / F_FILL) # Allow time to fill buffer before printing results.
print('b_v =', b_v)
assert b_i == L_BUF
print('freq =', F_FILL, 'Hz (period =', 1_000_000 / F_FILL, 'us).')
print('Time differences from timer call:', [f - s for s, f in zip(b_t[:L_BUF - 1], b_t[1:])])
@micropython.viper
def test_v() -> uint:
ppb = ptr32(0xe0000000) # PPB
ppb[0x3804] = 0b101 # CSR = PROC_CLKSOURCE | ENABLE
ppb[0x3805] = -1 # RVR = maximum
state = machine.disable_irq()
t0 = ppb[0x3806] # t0 = CVR
t1 = ppb[0x3806] # t1 = CVR
fill(0) # Line to benchmark
t2 = ppb[0x3806] # t2 = CVR
machine.enable_irq(state)
# Difference between t1 - t2 and t0 - t1 masked with RVR:
return uint(t1 - t2 - t0 + t1) & ppb[0x3805]
b_i = 0
print('Time intervals from viper timing framework:', [test_v() for _ in range(L_BUF)])
print('Time differences from viper call:', [f - s for s, f in zip(b_t[:L_BUF - 1], b_t[1:])])Results on RP Pico 2 are: The code runs the ISR, fill, from a timer and from the viper timing framework. Run from the timer it runs roughly every 62.5 us (as expected for a 16 kHz clock). When run from the viper framework it runs roughly every 42 us. But the viper timing network reports that each call takes around 4750 us! How can there be a call every 42 us and each call takes 4750 us? Note there is not a constant ratio between viper framework time and time interval of calls! Can anyone explain what is going on? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
While measuring jitter on hard vs soft IRQs on rp2350 with a scope, I got distracted into benchmarking different ways to do fast GPIO from a viper hard IRQ handler. I found the numbers interesting, so thought I'd post them in case anyone else is interested too.
As a baseline, with the default 150MHz clock frequency, if
pin = machine.Pin(0, machine.Pin.OUT), callingpin.value(1)from a viper function takes around 4us with roughly 700ns jitter.A more direct
mem32[0xd0000018] = 1takes about 2us with c. 500ns jitter.Of course, viper can write memory directly and this is much faster. If
siois a ptr32 to 0xd0000000, the directly equivalentsio[6] = 1takes only 14ns.But there's a little trap here: initialising
sio = ptr32(0xd0000000)takes 700ns! The same forsio = ptr32(13 << 28)orsio = ptr32(int(0xd0000000)). However,sio = ptr32(int(13) << 28)orsio = ptr32(int(0xd0) << 24)are fine and take just 26ns.Even though we're using viper, the argument to
ptr32()/int()is a python integer and if that's more than 30-bits we end up dereferencing an object. I understand why this happens but still managed to forget and end up surprised by it!Similarly, something like
ptr8(0)[0xd0000018]won't work at all because we're trying to index zero with an object not a machine int. However,ptr32(0)[0x34000006] = 1works fine and is fast at 40ns (= 14ns + 26ns).Final measurement: if we define
then calling
wibble()from another viper function costs 1.5us, so it can be quite costly to break up a viper callback in the absence of any way to create a define-time macro.Beta Was this translation helpful? Give feedback.
All reactions