Composite Operations:
- Concatentation:
ab
matches the subexpressiona
, thenb
; - Alternation:
a|b
matches either subexpressiona
orb
; - Repetition:
a*
matches the string that repeata
zero or many times; - Quotation:
(a)
quotes a subexpression
Atomic Operations:
- :
x
matches a literal character - :
.
matches any single character
The priority (from high to low) is: 1. Parenthesis, 2. Union, 3. Repetition 4. Concatenation
The above basic operations are able to express all operations that are supported by the google re2 engine.
First, import the regex
module. Initialize a instance by
re = regex(string)
where string
is the regular expression. First, regex
will detect whether the expression is legal with the error
module.
After deeming the expression legal, regex
will then proceed to compile the expression into a NFA. It will pass through 3 stages:
- the
syntax
module will first parse the expression into literal parts - the
syntax
module will then assemble the syntax tree from those parts and detect syntax errors in the process. The syntax trees are stored as tuples:(id, args)
. - the
nfa
module will then construct the nfa that corresponds to the expression.
This finishes the initialization. After initialization the user will be able to pattern match by calling
re.match(text)
where text
is the target string. This function will traverse the NFA via breadth first search with queues.
.*a
matches any string ending witha
th(e|ere|ose)
matchesthe
,there
, andthose
The following picture summarizes my algorithm: