#+tanco-format: 0.2
This is a test suite for pre.b4a, a parsing and regular expression toolkit for the b4 virtual machine.
> m1 @M ?d zp
ds: [-1]
> m0 @M ?d zp
ds: [0]
We will use the @M register to track whether
or not a pattern match has succeeded.
For brevity, the words m0 and m1 set this flag.
m1 also doubles as the command to match an empty
string.
> :S :K .. .. ..
> ln ?d zp
ds: [0]
> :S .. .. .. :K
> ln ?d /q
ds: [3]
@S marks the start of the current token,
and @K points to the current input character.
Subtracting @S from @K gives us the match length.
> pre ln ?d zp
ds: [0]
> c4 fw ln ?d /q
ds: [4]
> c4 !S m0 # set values we DON'T want
> pre @K @M ?d # init
ds: [4 -1]
The word pre sets @K to @S and @M to -1.
> :A m\ c4 rt
> m0 ^A ?d
ds: []
> m1 ^A ?d
ds: [4]
The word m\ causes a word to exit early if the match flag is not set.
The early exit works by popping one return value off the control
stack and then using the rt opcode.
Niladic here means that it doesn’t expect any arguments.
> :A m! c4 rt
> # when m0 the argument gets dropped
> m0 c2 ^A ?d
ds: []
> m1 c2 ^A ?d
ds: [2 4]
The word m! is like m\ in that it exits early if @M is 0,
but it also drops a single argument.
“Monadic” here means “taking one argument”.
> :S 'x 'y ..
> pre
> C0 any @M ln ?d /R
ds: [C0 -1 1]
> C1 any @M ln ?d /R
ds: [C1 -1 2]
> C2 any @M ln ?d /R
ds: [C2 0 2]
> m1 C3 end @M ln ?d /R
ds: [C3 -1 2]
> :S :K "hello"
> :Y lb 'h chr rt # should match and increment k
> # first test the match guard:
> m0 ^Y @M ln ?d zp zp
ds: [0 0]
> m1 ^Y @M ln ?d zp zp
ds: [-1 1]
> @S !K # reset pointer
> :N lb 'x chr rt # should not match
> ^N @M ln ?d zp zp
ds: [0 0]
chr sets the match bit according to whether or not the
current character ch matches the value on the stack.
chr should use m! internally and match only if @M is set.
If so, it increments @K by one position by calling nc.
> :S :K 'm ..
> # if m0, zap arg and do nothing
> m0 'z lte @M ln ?d /R
ds: [0 0]
> # if m1, run the test
> m1 'a lte @M ln ?d /R
ds: [0 0]
> m1 'z lte @M ln ?d /R
ds: [-1 0]
> m1 'm lte @M ln ?d /R
ds: [-1 0]
lte lets us tests whether the current character is less than the arguent.
It doesn’t advance @K because we will want to combine it with gte in a moment
to check character ranges.
lte takes one argument, so must call m!
Note there is no “less than or equal” op in b4, but you can simply
add one to calling lt.
> :S :K 'm ..
> # if m0, zap arg and do nothing
> C0 m0 'z gte @M ln ?d /R
ds: [C0 0 0]
> # if m1, run the test
> C1 m1 'a gte @M ln ?d /R # 'm >= 'a so match
ds: [C1 -1 0]
> C2 m1 'z gte @M ln ?d /R # 'm < 'z so fail
ds: [C2 0 0]
> C3 m1 'm gte @M ln ?d /R # 'm = 'm so match
ds: [C3 -1 0]
gte is basically the same as lte but with a different condition.
There is no “greater than or equal” opcode in b4, but you can write
it as “not less than”: lt nt
> :S :K ..
> m0 mf1 ln ?d zp
ds: [0]
> m1 mf1 ln ?d zp
ds: [1]
> :H lb 'A gte lb 'F lte mf1 rt # match on hex digit (range A-F)
> :S :K 'Q
> C0 m1 ^H @M ln ?d /R # Q not in range, so no match
ds: [C0 0 0]
> :S :K 'C
> C1 m1 ^H @M ln ?d /R # C is in the range, so match!
ds: [C1 -1 1]
> :S :K 'C
> C2 m0 ^H @M ln ?d /R # m0 was set so no matching allowed
ds: [C2 0 0]
We can now combine gte, lte and mfw into a sequence to create an idiom
for ranges. Each step already tests the match bit, so there’s no need
to call m! explicitly.
If you’d like to combine this into a “char-in-range” test that takes two
arguments, you might define a dyadic (two-argument) version of m!
(called m!! perhaps).
in fact, let’s just do that.
> :S '5 ..
> # case 0 it won't match if m0
> C0 pre m0 '0 '9 btw @M ln ?d /R
ds: [C0 0 0]
> # case 1 it's in range
> C1 pre m1 '0 '9 btw @M ln ?d /R
ds: [C1 -1 1]
> # case 2 it's above the range
> C2 pre m1 '0 '4 btw @M ln ?d /R
ds: [C2 0 0]
> # case 3 it's below the range
> C3 pre m1 '6 '9 btw @M ln ?d /R
ds: [C3 0 0]
> # case 4 it's the low end of range
> C4 pre m1 '5 '9 btw @M ln ?d /R
ds: [C4 -1 1]
> # case 5 it's the high end of range
> C5 pre m1 '0 '5 btw @M ln ?d /R
ds: [C5 -1 1]
> :S :K "54go"
> :digits ."0123456789"
> :G li `digits chs rt
> # remember the match guard!
> m0 ^G @M ln ?d zp zp
ds: [0 0]
> m1 ^G @M ln ?d zp zp
ds: [-1 1]
> m1 ^G @M ln ?d zp zp
ds: [-1 2]
> ^G @M ln ?d zp zp
ds: [0 2]
chs takes the address of a string of acceptable characters,
and succeeds if any of the charecters match.
Since it takes one argument, chs should call m!.
Note that in this case, the character set to match is a sequential
range, so we could have used the range idiom here as well (and in
fact, it’s probably faster to do so, since gte and lte don’t have to loop)
> :S :K "hello"
> :Y .[ ."he" .] lit rt
> m0 ^Y @M ln ?d zp zp
ds: [0 0]
> m1 ^Y @M ln ?d zp zp
ds: [-1 2]
> @S !K
> # if only partial match, we have to roll @K back
> :N .[ ."help" .] lit rt
> ^N @M ln ?d zp
ds: [0 0]
lit takes a string to match and succeeds if every
character in the input matches exactly.
It’s there to save you from having to write a long
sequence of chr operations.
Note that if lit succeeds, @K advances by the length
of the match, but if it fails, @K must go back
where it started.
> :S :K .. ..
> :A m0 .[ c4 +K zp rt .] try rt
> :B m1 .[ c4 +K zp rt .] try rt
> ^A ln ?d zp
ds: [0]
> @S !K
> ^B ln ?d /q
ds: [4]
We will need lit’s ability to save and restore the
character pointer @K from here on out.
try takes a quotation (the address of a word, usually
assembled either with li `name or using .[ … .].
While the word is running, it backs up @K to the control stack.
After the run, if the match failed, try restores the old @K,
otherwise it discards the backed up value.
> :S 'a 'b ..
> :A lb 'a chr rt
> C0 pre ^A @M ln ?d /R # normal matching, nothing new
ds: [C0 -1 1]
> C1 pre @A lka @M ln ?d /R # should match but not consume
ds: [C1 -1 0]
> :S 'a 'b ..
> :A lb 'a chr rt
> :B lb 'b chr rt
> C0 pre @A neg @M ln ?d /R # `^A` would match, so `@A neg` should fail
ds: [C0 0 0]
> C1 pre @B neg @M ln ?d /R # `@B neg` should match but consumes nothing.
ds: [C1 -1 0]
> :A .[ c2 fw m0 m| c4 fw m0 rt .] try rt
> :B .[ c2 fw m0 m| c4 fw m1 rt .] try rt
> :C .[ c2 fw m1 m| c4 fw m0 rt .] try rt
> :D .[ c2 fw m1 m| c4 fw m1 rt .] try rt
> CA pre ^A @M ln ?d /R
> CB pre ^B @M ln ?d /R
> CC pre ^C @M ln ?d /R
> CD pre ^D @M ln ?d /q
ds: [CA 0 0]
ds: [CB -1 4]
ds: [CC -1 2]
ds: [CD -1 2]
We want to create patterns with multiple rules, so we’ll
use the m| operator to seperate alternatives.
The logic is similar to the reverse of the match guard m!:
- if
@M, exit the sequence (since there’s no need to test alternatives if we already have a match) - otherwise, call
m1to prep the next alternative.
In addition, we need to handle the possibility that we
matched the start of a sequence and then failed to
match the rest. In that case, @K will have moved forward,
so we need to restore the original value.
Since try stores the old value of @K is on the control stack,
this means m| has to modify the control stack, and therefore
it is only ever safe to call m| inside a quotation followed
by try, lka, or not.
> :G .[ lb 'c chr rt .] opt rt
> :S 'c # ^G should match and advance 1
> pre ^G @M ln ?d /R
ds: [-1 1]
> :S 'x # ^G should match but not advance
> pre ^G @M ln ?d /R
ds: [-1 0]
The p in the stack comment indicates that opt takes another
parser as an argument. opt always succeeds but doesn’t
consume any characters unless the underlying parser matches.
> :G .[ lb 'a chr rt .] rep rt
> :S 'c # ^G should fail
> pre ^G @M ln ?d /R
ds: [0 0]
> :S "abc" # ^G should match 1 char
> pre ^G @M ln ?d /R
ds: [-1 1]
> :S "aardvark" # ^G should match 2
> pre ^G @M ln ?d /R
ds: [-1 2]
rep is the + operator in regular expressions. It must match
at least once, then as many times as possible.
> :G .[ lb 'a chr rt .] orp rt
> :S 'c # ^G should succeed but match 0 chars
> C0 pre ^G @M ln ?d /R
ds: [C0 -1 0]
> :S "abc" # ^G should match 1 char (same as before)
> C1 pre ^G @M ln ?d /R
ds: [C1 -1 1]
> :S "aardvark" # ^G should match 2 (same as before)
> C2 pre ^G @M ln ?d /R
ds: [C2 -1 2]
> C3 pre m0 ^G @M ln ?d /R # no match because m0
ds: [C3 0 0]
orp is the combination of opt and rep. It matches its pattern 0 or more times.
Note that orp can never fail to match, so it should always leave @M unchanged.
> :S 20 20 20 ..
> pre ws @M ln ?d /R
ds: [-1 1]
> ws* @M ln ?d
ds: [-1 3]
Whitespace is any character between 01 and 20 inclusive.
ws* is just li `ws orp
> :S "-123" ..
> pre mhex @M ln ?d /R
ds: [-1 4]
> :S "123" ..
> pre mhex @M ln ?d /R
ds: [-1 3]
> :S "-ABCDEF0123456789" ..
> pre mhex @M ln ?d /R
ds: [-1 11]
> # do not match lowercase!
> :S "abcedef01234567890" ..
> pre mhex @M ln ?d /R
ds: [0 0]
> :S "-no" ..
> pre mhex @M ln ?d /R
ds: [0 0]
Putting the pieces together, mhex matches an uppercase hex number,
followed by hex number.
> :S "1234" ..
> C0 pre mhex @M ln @N ?d /R
ds: [C0 -1 4 1234]
> :S "ABCD" ..
> C1 pre mhex @M ln @N ?d /R
ds: [C1 -1 4 ABCD]
In addtion to recognizing that the string is a hex number, it would be nice if we could know what the actual number is.
We will accumulate it in register @N. This register is reserved by the bios for this exact purpose, and so we do not need to preserve any previous value that was stored there.