Skip to content

Latest commit

 

History

History
452 lines (383 loc) · 11.8 KB

File metadata and controls

452 lines (383 loc) · 11.8 KB

pre: parsing and regular expressions

#+tanco-format: 0.2

intro

This is a test suite for pre.b4a, a parsing and regular expression toolkit for the b4 virtual machine.

TEST pre.m : m0 (-) and m1 (-) set the match bit.

> m1 @M ?d zp
ds: [-1]
> m0 @M ?d zp
ds: [0]

We will use the @M register to track whether or not a pattern match has succeeded.

For brevity, the words m0 and m1 set this flag.

m1 also doubles as the command to match an empty string.

TEST pre.ml : ln (-n) : match length

> :S :K .. .. ..
> ln ?d zp
ds: [0]
> :S .. .. .. :K
> ln ?d /q
ds: [3]

@S marks the start of the current token, and @K points to the current input character. Subtracting @S from @K gives us the match length.

TEST pre.fw : fw (n-) : move @K forward n characters

> pre ln ?d zp
ds: [0]
> c4 fw ln ?d /q
ds: [4]

TEST pre.pre : pre (-) : initialize parsing and regular expressions

> c4 !S m0       # set values we DON'T want
> pre @K @M ?d   # init
ds: [4 -1]

The word pre sets @K to @S and @M to -1.

TEST pre.m-guard0 : m\ : niladic match guard

> :A m\ c4 rt
> m0 ^A ?d
ds: []
> m1 ^A ?d
ds: [4]

The word m\ causes a word to exit early if the match flag is not set.

The early exit works by popping one return value off the control stack and then using the rt opcode.

Niladic here means that it doesn’t expect any arguments.

TEST pre.m-gaurd1 : m! (x-x?) : monadic match guard

> :A m! c4 rt
> # when m0 the argument gets dropped
> m0 c2 ^A ?d
ds: []
> m1 c2 ^A ?d
ds: [2 4]

The word m! is like m\ in that it exits early if @M is 0, but it also drops a single argument.

“Monadic” here means “taking one argument”.

TEST b4s.end-any : any and end

> :S 'x 'y ..
> pre
> C0 any @M ln ?d /R
ds: [C0 -1 1]
> C1 any @M ln ?d /R
ds: [C1 -1 2]
> C2 any @M ln ?d /R
ds: [C2 0 2]
> m1 C3 end @M ln ?d /R
ds: [C3 -1 2]

TEST pre.chr : chr (c-) : match exact character

> :S :K "hello"
> :Y lb 'h chr rt # should match and increment k
> # first test the match guard:
> m0 ^Y @M ln ?d zp zp
ds: [0 0]
> m1 ^Y @M ln ?d zp zp
ds: [-1 1]
> @S !K           # reset pointer
> :N lb 'x chr rt # should not match
> ^N @M ln ?d zp zp
ds: [0 0]

chr sets the match bit according to whether or not the current character ch matches the value on the stack.

chr should use m! internally and match only if @M is set.

If so, it increments @K by one position by calling nc.

TEST pre.lte : lte (c-) : matches if ch<=arg (does not consume)

> :S :K 'm ..
> # if m0, zap arg and do nothing
> m0 'z lte @M ln ?d /R
ds: [0 0]
> # if m1, run the test
> m1 'a lte @M ln ?d /R
ds: [0 0]
> m1 'z lte @M ln ?d /R
ds: [-1 0]
> m1 'm lte @M ln ?d /R
ds: [-1 0]

lte lets us tests whether the current character is less than the arguent. It doesn’t advance @K because we will want to combine it with gte in a moment to check character ranges.

lte takes one argument, so must call m!

Note there is no “less than or equal” op in b4, but you can simply add one to calling lt.

TEST pre.gte : gte (c-) : matches if ch>=arg (does not consume)

> :S :K 'm ..
> # if m0, zap arg and do nothing
> C0 m0 'z gte @M ln ?d /R
ds: [C0 0 0]
> # if m1, run the test
> C1 m1 'a gte @M ln ?d /R # 'm >= 'a so match
ds: [C1 -1 0]
> C2 m1 'z gte @M ln ?d /R # 'm < 'z so fail
ds: [C2 0 0]
> C3 m1 'm gte @M ln ?d /R # 'm = 'm so match
ds: [C3 -1 0]

gte is basically the same as lte but with a different condition.

There is no “greater than or equal” opcode in b4, but you can write it as “not less than”: lt nt

TEST pre.mf1 : mf1 advance if match

> :S :K ..
> m0 mf1 ln ?d zp
ds: [0]
> m1 mf1 ln ?d zp
ds: [1]

TEST pre.ranges : character range example

> :H lb 'A gte lb 'F lte mf1 rt  # match on hex digit (range A-F)
> :S :K 'Q
> C0 m1 ^H @M ln ?d /R    # Q not in range, so no match
ds: [C0 0 0]
> :S :K 'C
> C1 m1 ^H @M ln ?d /R    # C is in the range, so match!
ds: [C1 -1 1]
> :S :K 'C
> C2 m0 ^H @M ln ?d /R    # m0 was set so no matching allowed
ds: [C2 0 0]

We can now combine gte, lte and mfw into a sequence to create an idiom for ranges. Each step already tests the match bit, so there’s no need to call m! explicitly.

If you’d like to combine this into a “char-in-range” test that takes two arguments, you might define a dyadic (two-argument) version of m! (called m!! perhaps).

in fact, let’s just do that.

TEST pre.btw : btw to test whether the character is between two ends of the range

> :S '5 ..
> # case 0 it won't match if m0
> C0 pre m0 '0 '9 btw @M ln ?d /R
ds: [C0 0 0]
> # case 1 it's in range
> C1 pre m1 '0 '9 btw @M ln ?d /R
ds: [C1 -1 1]
> # case 2 it's above the range
> C2 pre m1 '0 '4 btw @M ln ?d /R
ds: [C2 0 0]
> # case 3 it's below the range
> C3 pre m1 '6 '9 btw @M ln ?d /R
ds: [C3 0 0]
> # case 4 it's the low end of range
> C4 pre m1 '5 '9 btw @M ln ?d /R
ds: [C4 -1 1]
> # case 5 it's the high end of range
> C5 pre m1 '0 '5 btw @M ln ?d /R
ds: [C5 -1 1]

TEST pre.chs : chs (s-) : choose from character set

> :S :K "54go"
> :digits ."0123456789"
> :G li `digits chs rt
> # remember the match guard!
> m0 ^G @M ln ?d zp zp
ds: [0 0]
> m1 ^G @M ln ?d zp zp
ds: [-1 1]
> m1 ^G @M ln ?d zp zp
ds: [-1 2]
> ^G @M ln ?d zp zp
ds: [0 2]

chs takes the address of a string of acceptable characters, and succeeds if any of the charecters match.

Since it takes one argument, chs should call m!.

Note that in this case, the character set to match is a sequential range, so we could have used the range idiom here as well (and in fact, it’s probably faster to do so, since gte and lte don’t have to loop)

TEST pre.lit : lit (s-) : match literal string

> :S :K "hello"
> :Y .[ ."he" .] lit rt
> m0 ^Y @M ln ?d zp zp
ds: [0 0]
> m1 ^Y @M ln ?d zp zp
ds: [-1 2]
> @S !K
> # if only partial match, we have to roll @K back
> :N .[ ."help" .] lit rt
> ^N @M ln ?d zp
ds: [0 0]

lit takes a string to match and succeeds if every character in the input matches exactly.

It’s there to save you from having to write a long sequence of chr operations.

Note that if lit succeeds, @K advances by the length of the match, but if it fails, @K must go back where it started.

TEST pre.try : try : backtracking sequence matcher

> :S :K .. ..
> :A m0 .[ c4 +K zp rt .] try rt
> :B m1 .[ c4 +K zp rt .] try rt
> ^A ln ?d zp
ds: [0]
> @S !K
> ^B ln ?d /q
ds: [4]

We will need lit’s ability to save and restore the character pointer @K from here on out.

try takes a quotation (the address of a word, usually assembled either with li `name or using .[.].

While the word is running, it backs up @K to the control stack. After the run, if the match failed, try restores the old @K, otherwise it discards the backed up value.

TEST pre.lka : lka (p-) : look ahead match but do not consume

> :S 'a 'b ..
> :A lb 'a chr rt
> C0 pre ^A @M ln ?d /R  # normal matching, nothing new
ds: [C0 -1 1]
> C1 pre @A lka @M ln ?d /R  # should match but not consume
ds: [C1 -1 0]

TEST pre.not : neg (p-) : negative lookahead

> :S 'a 'b ..
> :A lb 'a chr rt
> :B lb 'b chr rt
> C0 pre @A neg @M ln ?d /R  # `^A` would match, so `@A neg` should fail
ds: [C0 0 0]
> C1 pre @B neg @M ln ?d /R  # `@B neg` should match but consumes nothing.
ds: [C1 -1 0]

TEST pre.alt : m| : alt operator

> :A .[ c2 fw m0 m| c4 fw m0 rt .] try rt
> :B .[ c2 fw m0 m| c4 fw m1 rt .] try rt
> :C .[ c2 fw m1 m| c4 fw m0 rt .] try rt
> :D .[ c2 fw m1 m| c4 fw m1 rt .] try rt
> CA pre ^A @M ln ?d /R
> CB pre ^B @M ln ?d /R
> CC pre ^C @M ln ?d /R
> CD pre ^D @M ln ?d /q
ds: [CA 0 0]
ds: [CB -1 4]
ds: [CC -1 2]
ds: [CD -1 2]

We want to create patterns with multiple rules, so we’ll use the m| operator to seperate alternatives.

The logic is similar to the reverse of the match guard m!:

  • if @M, exit the sequence (since there’s no need to test alternatives if we already have a match)
  • otherwise, call m1 to prep the next alternative.

In addition, we need to handle the possibility that we matched the start of a sequence and then failed to match the rest. In that case, @K will have moved forward, so we need to restore the original value.

Since try stores the old value of @K is on the control stack, this means m| has to modify the control stack, and therefore it is only ever safe to call m| inside a quotation followed by try, lka, or not.

TEST pre.opt : opt (p-) : optional match (regex ?)

> :G  .[ lb 'c chr rt .] opt rt
> :S 'c  # ^G should match and advance 1
> pre ^G @M ln ?d /R
ds: [-1 1]
> :S 'x  # ^G should match but not advance
> pre ^G @M ln ?d /R
ds: [-1 0]

The p in the stack comment indicates that opt takes another parser as an argument. opt always succeeds but doesn’t consume any characters unless the underlying parser matches.

TEST pre.rep : rep (p-) : repeat one or more times (regex +)

> :G  .[ lb 'a chr rt .] rep rt
> :S 'c  # ^G should fail
> pre ^G @M ln ?d /R
ds: [0 0]
> :S "abc"  # ^G should match 1 char
> pre ^G @M ln ?d /R
ds: [-1 1]
> :S "aardvark"  # ^G should match 2
> pre ^G @M ln ?d /R
ds: [-1 2]

rep is the + operator in regular expressions. It must match at least once, then as many times as possible.

TEST pre.orp : orp (p-) : optional repetition (regex *)

> :G  .[ lb 'a chr rt .] orp rt
> :S 'c  # ^G should succeed but match 0 chars
> C0 pre ^G @M ln ?d /R
ds: [C0 -1 0]
> :S "abc"  # ^G should match 1 char (same as before)
> C1 pre ^G @M ln ?d /R
ds: [C1 -1 1]
> :S "aardvark"  # ^G should match 2 (same as before)
> C2 pre ^G @M ln ?d /R
ds: [C2 -1 2]
> C3 pre m0 ^G @M ln ?d /R # no match because m0
ds: [C3 0 0]

orp is the combination of opt and rep. It matches its pattern 0 or more times. Note that orp can never fail to match, so it should always leave @M unchanged.

TEST b4s.ws : ws and ws* match whitespace

> :S 20 20 20 ..
> pre ws @M ln ?d /R
ds: [-1 1]
> ws* @M ln ?d
ds: [-1 3]

Whitespace is any character between 01 and 20 inclusive. ws* is just li `ws orp

TEST pre.mhex : mhex : match an uppercase hex number

> :S "-123" ..
> pre mhex @M ln ?d /R
ds: [-1 4]
> :S "123" ..
> pre mhex @M ln ?d /R
ds: [-1 3]
> :S "-ABCDEF0123456789" ..
> pre mhex @M ln ?d /R
ds: [-1 11]
> # do not match lowercase!
> :S "abcedef01234567890" ..
> pre mhex @M ln ?d /R
ds: [0 0]
> :S "-no" ..
> pre mhex @M ln ?d /R
ds: [0 0]

Putting the pieces together, mhex matches an uppercase hex number, followed by hex number.

TEST pre.mhex-n : mhex should populate @N

> :S "1234" ..
> C0 pre mhex @M ln @N ?d /R
ds: [C0 -1 4 1234]
> :S "ABCD" ..
> C1 pre mhex @M ln @N ?d /R
ds: [C1 -1 4 ABCD]

In addtion to recognizing that the string is a hex number, it would be nice if we could know what the actual number is.

We will accumulate it in register @N. This register is reserved by the bios for this exact purpose, and so we do not need to preserve any previous value that was stored there.