Skip to content

Commit 64637d8

Browse files
committed
Documentation for regular expressions.
1 parent f5363cc commit 64637d8

File tree

3 files changed

+274
-0
lines changed

3 files changed

+274
-0
lines changed

docs/cpp2/metafunctions.md

+61
Original file line numberDiff line numberDiff line change
@@ -372,6 +372,67 @@ A `cpp1_rule_of_zero` type is one that has no user-written copy/move/destructor
372372
> This is known as "the rule of zero".
373373
> — Stroustrup, Sutter, et al. (C++ Core Guidelines)
374374
375+
#### `regex`
376+
377+
Replaces fields in the class with regular expression objects. Each field starting with `regex` is replaced with a regular expression of the same type.
378+
379+
``` cpp title="Regular expression example"
380+
name_matcher: @regex type
381+
= {
382+
regex := R"((\w+) (\w+))";
383+
regex_no_case := R"(/(ab)+/i)";
384+
}
385+
386+
main: (args) = {
387+
m: name_matcher = ();
388+
389+
data: std::string = "Donald Duck";
390+
if args.ssize() >= 2 {
391+
data = args[1];
392+
}
393+
394+
result := m.regex.match(data);
395+
if result.matched {
396+
std::cout << "Hello (result.group(2))$, (result.group(1))$!" << std::endl;
397+
}
398+
else {
399+
std::cout << "I only know names of the form: <name> <family name>." << std::endl;
400+
}
401+
402+
std::cout << "Case insensitive match: " << m.regex_no_case.search("blubabABblah").group(0) << std::endl;
403+
}
404+
405+
```
406+
407+
The regex syntax used by cppfront is the [perl syntax](https://perldoc.perl.org/perlre). Most of the syntax is available. Currently we do not support unicode characters and the syntax tokens associated with them. In [supported features](../other/regex_status.md) all the available regex syntax is listed.
408+
409+
The fields have the type `cpp2::regex::regular_expression`, which is defined in `include/cpp2regex.h2`. The member functions are
410+
``` cpp title="Member functions for regular expressions"
411+
match: (in this, str: std::string_view) -> search_return;
412+
match: (in this, str: std::string_view, start) -> search_return;
413+
match: (in this, str: std::string_view, start, length) -> search_return;
414+
match: <Iter> (in this, start: Iter, end: Iter) -> search_return;
415+
416+
search: (in this, str: std::string_view) -> search_return;
417+
search: (in this, str: std::string_view, start) -> search_return;
418+
search: (in this, str: std::string_view, start, length) -> search_return;
419+
search: <Iter> (in this, start: Iter, end: Iter) -> search_return;
420+
```
421+
422+
The return type `search_return` is defined inside of `cpp2::regex::regular_expression` and has the fields/functions:
423+
``` cpp title="Function and fields of a regular expression result."
424+
matched: bool;
425+
pos: int;
426+
427+
group_number: (this) -> size_t;;
428+
group: (this, g: int) -> std::string;
429+
group_start: (this, g: int) -> int;
430+
group_end: (this, g: int) -> int;
431+
432+
group: (this, g: bstring<CharT>) -> std::string;
433+
group_start: (this, g: bstring<CharT>) -> int;
434+
group_end: (this, g: bstring<CharT>) -> int;
435+
```
375436
376437
#### `print`
377438

docs/other/regex_status.md

+211
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# Supported regular expression features
2+
3+
The listings are taken from [perl regex docs](https://perldoc.perl.org/perlre). Regular expressions are applied via [metafunctions](../cpp2/metafunctions.md#regex)
4+
5+
## Current status and planned on doing
6+
7+
### Modifiers
8+
```
9+
- [x] i Do case-insensitive pattern matching. For example, "A" will match "a" under /i.
10+
- [x] m Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
11+
- [x] s Treat the string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
12+
- [x] x and xx Extend your pattern's legibility by permitting whitespace and comments. Details in "/x and /xx"
13+
- [x] n Prevent the grouping metacharacters () from capturing. This modifier, new in 5.22, will stop $1, $2, etc... from being filled in.
14+
- [ ] c keep the current position during repeated matching
15+
```
16+
17+
### Escape sequences __(Complete)__
18+
```
19+
- [x] \t tab (HT, TAB)
20+
- [x] \n newline (LF, NL)
21+
- [x] \r return (CR)
22+
- [x] \f form feed (FF)
23+
- [x] \a alarm (bell) (BEL)
24+
- [x] \e escape (think troff) (ESC)
25+
- [x] \x{}, \x00 character whose ordinal is the given hexadecimal number
26+
- [x] \o{}, \000 character whose ordinal is the given octal number
27+
28+
```
29+
30+
### Quantifiers __(Complete)__
31+
```
32+
- [x] * Match 0 or more times
33+
- [x] + Match 1 or more times
34+
- [x] ? Match 1 or 0 times
35+
- [x] {n} Match exactly n times
36+
- [x] {n,} Match at least n times
37+
- [x] {,n} Match at most n times
38+
- [x] {n,m} Match at least n but not more than m times
39+
- [x] *? Match 0 or more times, not greedily
40+
- [x] +? Match 1 or more times, not greedily
41+
- [x] ?? Match 0 or 1 time, not greedily
42+
- [x] {n}? Match exactly n times, not greedily (redundant)
43+
- [x] {n,}? Match at least n times, not greedily
44+
- [x] {,n}? Match at most n times, not greedily
45+
- [x] {n,m}? Match at least n but not more than m times, not greedily
46+
- [x] *+ Match 0 or more times and give nothing back
47+
- [x] ++ Match 1 or more times and give nothing back
48+
- [x] ?+ Match 0 or 1 time and give nothing back
49+
- [x] {n}+ Match exactly n times and give nothing back (redundant)
50+
- [x] {n,}+ Match at least n times and give nothing back
51+
- [x] {,n}+ Match at most n times and give nothing back
52+
- [x] {n,m}+ Match at least n but not more than m times and give nothing back
53+
```
54+
55+
### Character Classes and other Special Escapes __(Complete)__
56+
```
57+
- [x] [...] [1] Match a character according to the rules of the
58+
bracketed character class defined by the "...".
59+
Example: [a-z] matches "a" or "b" or "c" ... or "z"
60+
- [x] [[:...:]] [2] Match a character according to the rules of the POSIX
61+
character class "..." within the outer bracketed
62+
character class. Example: [[:upper:]] matches any
63+
uppercase character.
64+
- [x] \g1 [5] Backreference to a specific or previous group,
65+
- [x] \g{-1} [5] The number may be negative indicating a relative
66+
previous group and may optionally be wrapped in
67+
curly brackets for safer parsing.
68+
- [x] \g{name} [5] Named backreference
69+
- [x] \k<name> [5] Named backreference
70+
- [x] \k'name' [5] Named backreference
71+
- [x] \k{name} [5] Named backreference
72+
- [x] \w [3] Match a "word" character (alphanumeric plus "_", plus
73+
other connector punctuation chars plus Unicode
74+
marks)
75+
- [x] \W [3] Match a non-"word" character
76+
- [x] \s [3] Match a whitespace character
77+
- [x] \S [3] Match a non-whitespace character
78+
- [x] \d [3] Match a decimal digit character
79+
- [x] \D [3] Match a non-digit character
80+
- [x] \v [3] Vertical whitespace
81+
- [x] \V [3] Not vertical whitespace
82+
- [x] \h [3] Horizontal whitespace
83+
- [x] \H [3] Not horizontal whitespace
84+
- [x] \1 [5] Backreference to a specific capture group or buffer.
85+
'1' may actually be any positive integer.
86+
- [x] \N [7] Any character but \n. Not affected by /s modifier
87+
- [x] \K [6] Keep the stuff left of the \K, don't include it in $&
88+
```
89+
90+
### Assertions
91+
```
92+
- [x] \b Match a \w\W or \W\w boundary
93+
- [x] \B Match except at a \w\W or \W\w boundary
94+
- [x] \A Match only at beginning of string
95+
- [x] \Z Match only at end of string, or before newline at the end
96+
- [x] \z Match only at end of string
97+
- [ ] \G Match only at pos() (e.g. at the end-of-match position
98+
of prior m//g)
99+
```
100+
101+
### Capture groups __(Complete)__
102+
```
103+
- [x] (...)
104+
```
105+
106+
### Quoting metacharacters __(Complete)__
107+
```
108+
- [x] For ^.[]$()*{}?+|\
109+
```
110+
111+
### Extended Patterns
112+
```
113+
- [x] (?<NAME>pattern) Named capture group
114+
- [x] (?#text) Comments
115+
- [x] (?adlupimnsx-imnsx) Modification for surrounding context
116+
- [x] (?^alupimnsx) Modification for surrounding context
117+
- [x] (?:pattern) Clustering, does not generate a group index.
118+
- [x] (?adluimnsx-imnsx:pattern) Clustering, does not generate a group index and modifications for the cluster.
119+
- [x] (?^aluimnsx:pattern) Clustering, does not generate a group index and modifications for the cluster.
120+
- [x] (?|pattern) Branch reset
121+
- [x] (?'NAME'pattern) Named capture group
122+
- [ ] (?(condition)yes-pattern|no-pattern) Conditional patterns.
123+
- [ ] (?(condition)yes-pattern) Conditional patterns.
124+
- [ ] (?>pattern) Atomic patterns. (Disable backtrack.)
125+
- [ ] (*atomic:pattern) Atomic patterns. (Disable backtrack.)
126+
```
127+
128+
### Lookaround Assertions
129+
```
130+
- [x] (?=pattern) Positive look ahead.
131+
- [x] (*pla:pattern) Positive look ahead.
132+
- [x] (*positive_lookahead:pattern) Positive look ahead.
133+
- [x] (?!pattern) Negative look ahead.
134+
- [x] (*nla:pattern) Negative look ahead.
135+
- [x] (*negative_lookahead:pattern) Negative look ahead.
136+
- [ ] (?<=pattern) Positive look behind.
137+
- [ ] (*plb:pattern) Positive look behind.
138+
- [ ] (*positive_lookbehind:pattern) Positive look behind.
139+
- [ ] (?<!pattern) Negative look behind.
140+
- [ ] (*nlb:pattern) Negative look behind.
141+
- [ ] (*negative_lookbehind:pattern) Negative look behind.
142+
```
143+
144+
### Special Backtracking Control Verbs
145+
```
146+
- [ ] (*PRUNE) (*PRUNE:NAME) No backtracking over this point.
147+
- [ ] (*SKIP) (*SKIP:NAME) Start next search here.
148+
- [ ] (*MARK:NAME) (*:NAME) Place a named mark.
149+
- [ ] (*THEN) (*THEN:NAME) Like PRUNE.
150+
- [ ] (*COMMIT) (*COMMIT:arg) Stop searching.
151+
- [ ] (*FAIL) (*F) (*FAIL:arg) Fail the pattern/branch.
152+
- [ ] (*ACCEPT) (*ACCEPT:arg) Accept the pattern/subpattern.
153+
```
154+
155+
## Not planned (Mainly because of Unicode or perl specifics)
156+
157+
### Modifiers
158+
```
159+
- [ ] p Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} are available for use after matching.
160+
- [ ] a, d, l, and u These modifiers, all new in 5.14, affect which character-set rules (Unicode, etc.) are used, as described below in "Character set modifiers".
161+
- [ ] g globally match the pattern repeatedly in the string
162+
- [ ] e evaluate the right-hand side as an expression
163+
- [ ] ee evaluate the right side as a string then eval the result
164+
- [ ] o pretend to optimize your code, but actually introduce bugs
165+
- [ ] r perform non-destructive substitution and return the new value
166+
```
167+
168+
### Escape sequences
169+
```
170+
- [ ] \cK control char (example: VT)
171+
- [ ] \N{name} named Unicode character or character sequence
172+
- [ ] \N{U+263D} Unicode character (example: FIRST QUARTER MOON)
173+
- [ ] \l lowercase next char (think vi)
174+
- [ ] \u uppercase next char (think vi)
175+
- [ ] \L lowercase until \E (think vi)
176+
- [ ] \U uppercase until \E (think vi)
177+
- [ ] \Q quote (disable) pattern metacharacters until \E
178+
- [ ] \E end either case modification or quoted section, think vi
179+
```
180+
181+
### Character Classes and other Special Escapes
182+
```
183+
- [ ] (?[...]) [8] Extended bracketed character class
184+
- [ ] \pP [3] Match P, named property. Use \p{Prop} for longer names
185+
- [ ] \PP [3] Match non-P
186+
- [ ] \X [4] Match Unicode "eXtended grapheme cluster"
187+
- [ ] \R [4] Linebreak
188+
```
189+
190+
### Assertions
191+
```
192+
- [ ] \b{} Match at Unicode boundary of specified type
193+
- [ ] \B{} Match where corresponding \b{} doesn't match
194+
```
195+
196+
### Extended Patterns
197+
```
198+
- [ ] (?{ code }) Perl code execution.
199+
- [ ] (*{ code }) Perl code execution.
200+
- [ ] (??{ code }) Perl code execution.
201+
- [ ] (?PARNO) (?-PARNO) (?+PARNO) (?R) (?0) Recursive subpattern.
202+
- [ ] (?&NAME) Recursive subpattern.
203+
```
204+
205+
### Script runs
206+
```
207+
- [ ] (*script_run:pattern) All chars in pattern need to be of the same script.
208+
- [ ] (*sr:pattern) All chars in pattern need to be of the same script.
209+
- [ ] (*atomic_script_run:pattern) Without backtracking.
210+
- [ ] (*asr:pattern) Without backtracking.
211+
```

mkdocs.yml

+2
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,8 @@ nav:
6868
- 'Cppfront reference':
6969
- 'Using Cpp1 (today''s syntax) and Cpp2 in the same source file': cppfront/mixed.md
7070
- 'Cppfront command line options': cppfront/options.md
71+
- 'Other':
72+
- 'Regular expression features': other/regex_status.md
7173

7274
markdown_extensions:
7375
- pymdownx.highlight:

0 commit comments

Comments
 (0)