Skip to content

Commit 254d73e

Browse files
MaxSagebaumhsutterjarzec
authored
Feature/regular expression metafunction (#904)
* Missing trailing '...' in variadic template arguments. * Regular expressions initial setup. * Current working status. * Handling of groups. * Handling of alternatives. * Refactor to position based matching. * Added regular expression class. * Added class matching. * Add line start and end match. * Compatibility fixes. * Added test file for regular expressions. * Basic state machine for range matchers. * Proper range printing and group invalidation on alternatives. * State management for ranges. No longer invalid groups if last match fails. * Range check in class_matcher and restore of groups in ranges_matcher. * Fix for list matcher and state for alternatives. * Improved handling of empty matches and ranges. * Bugfix for missing semaphore in typed template parameters. * Support for posix character classes. * Fix for missing escape of '\'. * Missing group clear in alternate matcher. * Whitespace errors from cppfront. * Include regular expressions in cpp2utils.hpp. * Update for tests. * Remove initialization from context. * Basic char matching. * Removed need of end logic. * Basic group matching. * Added alternative. * Added range matcher logic. * Added remaining regex patterns. * Fixes for range matcher. * Improved group handling. * Proper reset of ranges. * Refactor. * Moved begin and end to context. * Removed Iter as template argument. * Refactor. * Refactor of parser. * Regular expression update. * Update of test and header files. * Addressed review comments. In addition: - Removed TODO in reflect.h2. - Reworked name handling in regex_gen. * Review updates. - Moved 'source/regex.h' to '../include/cpp2regex.h'. - Created 'source/cpp2regex.h' - Added a utility function for 'is_escaped'. * Removed static annotation. * Added greedy matching for alternatives. * Greedy version of alternative regex. * Helpers in match return and first shorthand character class. * Proper error handling. * Escaped characters from perl. * Group escapes and named group handling. * Named group access in regex result. * Case insensitive matching flag. * Added modifiers to regular expressions. * Fixing bugs in implementation. * Non-greedy and possessive matching. * Added horizontal and vertical white spaces. * Additional handling of excapes. * Update of tests. * Make regex generation public. * Performance fixes for greedy matching. * Added modifiers to matching logic. * Added (?<mod>) notation. * Added (?:) notation. * Added m and s modifiers. * Bugfix for wrong none-greedy parsing. * Fixes for regex results. * Update for tests. * Added support for 'n' modifier. * Added perquisites for per syntax parsing. * Remove direkt position handling from parser. * Added named groups with '. * Escape of space in character classes. * Proper handliing of x and xx modifier switches. * Added parsing for comment groups. * Aded branch reset support. * Update of tests and header. * Added support for \x. * Added \000 and \o{000} handles. * Added lookahead matchers. * Addes statefull match tail. * Helper functions for match_return creation. * Refactor of matcher naming and helpers. * Cleanup of to_string. * Matcher cleanup. * Parser refactor. * Header update and tests update. * Fix for greedy range matching. Greedy ranges matching is now a recursive call which no longer discards the state of previous iterations of M. This enables the regex to try all alternatives over the bounds of the greedy method. * Better template arguments for range matchers. * Update for regex tests. * Changes for new analysis. * Remove <format> dependency, get warning-clean build Header `<format>` requires newer (2022/2023) compilers, so I'll try to remove depending on it to keep cppfront's own build working to ~2019 compilers * Create pure2-regex-partial.cpp2 Feel free to drop this file again: I just added it to show the results of trying to comment out all the cases that cause the metafunction to report an error, in an attempt to get an executable file. After commenting these ones, cppfront completes I still get strange errors from the Cpp1 compiler, which I've narrowed down to this short repro: #define CPP2_IMPORT_STD Yes #include "cpp2util.h" If the macro is commented out, things compile fine. So there's something about the "import std" path that's going wrong, it seems on both MSVC and GCC 10. I haven't been able to diagnose the problem further than that though. * Fix for compile time degradation. The change for the stateful tail increased the compile time. The tail is now stateless again and we have an extra argument for the end function call. * Added first conversion to matcher generator. * Basic generation of char matchers. * Basic code generation via regegx tokens. * Moved and renamed char token matcher. * Parsing of ranges. * Moved code generation to generation context. * Generation of statefull matcher. * Cleanup of ranger matcher implementation. * Added special range matching. * Removed old range parsers. * Moved parse_until into parser context. * Added to_string output. * Added group handling. * Added logic for alternative. * Added '.' regex expression. * Added group reference matchers. * Added anchor matchers. * Added class matchers. * Added full group matcher parsing. * Added basic character escapes. * Added word boundary matchers. * Added named start and end line matchers. * Added named class matchers. * Added \K token. * Added hexadicimal token and octal token. * Added lookahead parsing and matching. * Group gathering is now done in a set. * Added modifier handling. * Removal of unused funtionality and proper parsing of global modifiers. * Moved name group lookup creation. * Fixes for generation. * Bugfix for nonconst has in generated flag_enums. * Removed modifiers from arguments and fixed a few wanrings. * Removed templates for regular arguments. * Agglomeration of character matchers. * Refactored function generation to do .. while loop. * Removed UFCS calls. * Fixes for new cppfront analysis. * Fixes for new char matcher logic. * Added handling of raw strings and using raw strings. * Escape adaptions for to_string and other to_string fixes. * General refactor. * Continuation of cleanup. * Added namespace ot string_util.h. * Fixes for compiler warnings. * Added new tests. * Changes for regression tests. * Fixes for regex and non-regex tests. * Remove UFCS from regex and more non-regex test fixes. * Update for TODOs. * Update for generated header files. * Update of generated header files. * Updates for regression tests. * Enable modules build on MSVC by removing #includes when using modules Also, silence two narrowing errors MSVC reports by adding unsafe_narrow * Updates for regression tests. * Changes for regression-tests. * Updates for regression tests. * Updates for regression tests. * Update for \e escape. * CI update tests * Update for tests. * Update for test results. * Update for regression tests. * Reran regressions on my box - whitespace changes only Probably line-ends Plus an MSVC minor version update Committing this just-whitespace update to clear the diff list before I make any review changes/renames... * Move & rename source/regex.h2 to include/cpp2regex.h2 In this project I'm trying to build *.h2 files in the same directory as the *.h they generate, and keep the same name In /include, "cpp2util.h" is named that way because it really is the Cpp2 run-time library... For regex, we could name it regex.h(2) or cpp2regex.h(2)... the argument for using "cpp2" is because it really does include additional run-time support for what will now be one of the Cpp2-built-in metafunctions... anyway we can always revisit that in the future... * Merge string_util.h into cpp2util.h Minus a couple of functions that aren't used And minor touchups, mainly int_to_string using more if-constexpr * Review pass through cpp2regex.h2 Up to line ~1600 Looking good, mainly formatting tweaks to follow the repo's style * Finish tweaking pass through cpp2regex.h2 From line 1600 onward --------- Signed-off-by: Max Sagebaum <[email protected]> Signed-off-by: Herb Sutter <[email protected]> Co-authored-by: Herb Sutter <[email protected]> Co-authored-by: jarzec <[email protected]>
1 parent e45eae5 commit 254d73e

File tree

542 files changed

+122033
-300
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

542 files changed

+122033
-300
lines changed

Diff for: include/cpp2regex.h

+4,141
Large diffs are not rendered by default.

Diff for: include/cpp2regex.h2

+2,789
Large diffs are not rendered by default.

Diff for: include/cpp2util.h

+158-3
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,16 @@
3636
// because it can't happen; using the name impl::deferred_init directly
3737
// from program code is not supported.
3838
//
39+
// 3) Entities in other subnamespaces, such as cpp2::string_util
40+
//
41+
// These are typically metafunction "runtime-library" functions,
42+
// implementation details called by metafunction-generated code.
43+
// For example, @regex generates code that uses string_util:: functions.
44+
//
3945
//===========================================================================
4046

41-
#ifndef CPP2_UTIL_H
42-
#define CPP2_UTIL_H
47+
#ifndef CPP2_CPP2UTIL_H
48+
#define CPP2_CPP2UTIL_H
4349

4450
// If this implementation doesn't support source_location yet, disable it
4551
#include <version>
@@ -269,14 +275,17 @@
269275
#include <sstream>
270276
#include <iterator>
271277
#include <limits>
278+
#include <map>
272279
#include <memory>
273280
#include <new>
274281
#include <random>
275282
#include <optional>
276283
#if defined(CPP2_USE_SOURCE_LOCATION)
277284
#include <source_location>
278285
#endif
286+
#include <set>
279287
#include <span>
288+
#include <sstream>
280289
#include <string>
281290
#include <string_view>
282291
#include <system_error>
@@ -320,7 +329,6 @@
320329
#define CPP2_CONSTEXPR constexpr
321330
#endif
322331

323-
324332
namespace cpp2 {
325333

326334

@@ -360,6 +368,152 @@ using _schar = signed char; // normally use i8 instead
360368
using _uchar = unsigned char; // normally use u8 instead
361369

362370

371+
//-----------------------------------------------------------------------
372+
//
373+
// String utilities
374+
//
375+
376+
namespace string_util {
377+
378+
// From https://stackoverflow.com/questions/216823/how-to-trim-a-stdstring
379+
380+
// Trim from start (in place)
381+
inline void ltrim(std::string &s) {
382+
s.erase(
383+
s.begin(),
384+
std::find_if(s.begin(), s.end(), [](unsigned char ch) { return !std::isspace(ch); })
385+
);
386+
}
387+
388+
// Trim from end (in place)
389+
inline void rtrim(std::string &s) {
390+
s.erase(
391+
std::find_if(s.rbegin(), s.rend(), [](unsigned char ch) { return !std::isspace(ch); }).base(),
392+
s.end()
393+
);
394+
}
395+
396+
// Trim from both ends (in place)
397+
inline void trim(std::string &s) {
398+
rtrim(s);
399+
ltrim(s);
400+
}
401+
402+
// Trim from both ends (copying)
403+
inline std::string trim_copy(std::string_view s) {
404+
std::string t(s);
405+
trim(t);
406+
return t;
407+
}
408+
409+
// From https://oleksandrkvl.github.io/2021/04/02/cpp-20-overview.html#nttp
410+
411+
template<typename CharT, std::size_t N>
412+
struct fixed_string {
413+
constexpr fixed_string(const CharT (&s)[N+1]) {
414+
std::copy_n(s, N + 1, c_str);
415+
}
416+
constexpr const CharT* data() const {
417+
return c_str;
418+
}
419+
constexpr std::size_t size() const {
420+
return N;
421+
}
422+
423+
constexpr auto str() const {
424+
return std::basic_string<CharT>(c_str);
425+
}
426+
427+
CharT c_str[N+1];
428+
};
429+
430+
template<typename CharT, std::size_t N>
431+
fixed_string(const CharT (&)[N])->fixed_string<CharT, N-1>;
432+
433+
// Other string utility functions.
434+
435+
inline bool is_escaped(std::string_view s) {
436+
return
437+
s.starts_with("\"")
438+
&& s.ends_with("\"")
439+
;
440+
}
441+
442+
inline bool string_to_int(std::string const& s, int& v, int base = 10) {
443+
try {
444+
v = stoi(s, nullptr, base);
445+
return true;
446+
}
447+
catch (std::invalid_argument const&)
448+
{
449+
return false;
450+
}
451+
catch (std::out_of_range const&)
452+
{
453+
return false;
454+
}
455+
}
456+
457+
template<int Base = 10>
458+
inline std::string int_to_string(int i) {
459+
if constexpr (8 == Base) {
460+
std::ostringstream oss;
461+
oss << std::oct << i;
462+
return oss.str();
463+
}
464+
else if constexpr (10 == Base) {
465+
return std::to_string(i);
466+
}
467+
else if constexpr (16 == Base) {
468+
std::ostringstream oss;
469+
oss << std::hex << i;
470+
return oss.str();
471+
}
472+
else {
473+
[] <bool flag = false>() {
474+
static_assert(flag, "Unsupported int_to_string Base");
475+
}();
476+
}
477+
}
478+
479+
inline char safe_toupper(char ch) {
480+
return static_cast<char>(std::toupper(static_cast<unsigned char>(ch)));
481+
}
482+
483+
inline char safe_tolower(char ch) {
484+
return static_cast<char>(std::tolower(static_cast<unsigned char>(ch)));
485+
}
486+
487+
inline std::string replace_all(
488+
std::string str,
489+
const std::string& from,
490+
const std::string& to
491+
)
492+
{
493+
size_t start_pos = 0;
494+
while((start_pos = str.find(from, start_pos)) != std::string::npos) {
495+
str.replace(start_pos, from.length(), to);
496+
start_pos += to.length(); // safe also when 'to' is a substring of 'from'
497+
}
498+
return str;
499+
}
500+
501+
template<typename List>
502+
inline std::string join(List const& list) {
503+
std::string r = "";
504+
std::string sep = "";
505+
506+
for (auto const& cur : list) {
507+
r += sep + cur;
508+
sep = ", ";
509+
}
510+
511+
return r;
512+
}
513+
514+
} // namespace string_util
515+
516+
363517
//-----------------------------------------------------------------------
364518
//
365519
// Conveniences for expressing Cpp1 references (rarely useful)
@@ -2365,6 +2519,7 @@ inline constexpr auto as_() -> decltype(auto)
23652519

23662520
}
23672521

2522+
#include "cpp2regex.h"
23682523

23692524
using cpp2::cpp2_new;
23702525

Diff for: regression-tests/pure2-regex_01_char_matcher.cpp2

+185
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
create_result: (resultExpr: std::string, r) -> std::string = {
2+
result: std::string = "";
3+
4+
get_next := :(iter) -> _ = {
5+
start := std::distance(resultExpr&$*.cbegin(), iter);
6+
firstDollar := resultExpr&$*.find("$", start);
7+
firstAt := resultExpr&$*.find("@", start);
8+
9+
end := std::min(firstDollar, firstAt);
10+
if end != std::string::npos {
11+
return resultExpr&$*.cbegin() + end;
12+
}
13+
else {
14+
return resultExpr&$*.cend();
15+
}
16+
};
17+
extract_group_and_advance := :(inout iter) -> _ = {
18+
start := iter;
19+
20+
while std::isdigit(iter*) next iter++ {}
21+
22+
return std::stoi(std::string(start, iter));
23+
};
24+
extract_until := :(inout iter, to: char) -> _ = {
25+
start := iter;
26+
27+
while (to != iter*) next iter++ {} // TODO: Without bracket: error: postfix unary * (dereference) cannot be immediately followed by a (, identifier, or literal - add whitespace before * here if you meant binary * (multiplication)
28+
29+
return std::string(start, iter);
30+
};
31+
32+
iter := resultExpr.begin();
33+
34+
while iter != resultExpr.end() {
35+
next := get_next(iter);
36+
37+
if next != iter {
38+
result += std::string(iter, next);
39+
}
40+
if next != resultExpr.end() {
41+
if next* == '$' {
42+
next++;
43+
44+
if next* == '&' {
45+
next++;
46+
result += r.group(0);
47+
}
48+
else if next* == '-' || next* == '+' {
49+
is_start := next* == '-';
50+
next++;
51+
if next* == '{' {
52+
next++; // Skip {
53+
group := extract_until(next, '}');
54+
next++; // Skip }
55+
result += r.group(group);
56+
}
57+
else if next* == '[' {
58+
next++; // Skip [
59+
group := extract_group_and_advance(next);
60+
next++; // Skip ]
61+
62+
if is_start {
63+
result += std::to_string(r.group_start(group));
64+
}
65+
else {
66+
result += std::to_string(r.group_end(group));
67+
}
68+
}
69+
else {
70+
// Return max group
71+
result += r.group(r.group_number() - 1);
72+
}
73+
}
74+
else if std::isdigit(next*) {
75+
group := extract_group_and_advance(next);
76+
result += r.group(group);
77+
}
78+
else {
79+
std::cerr << "Not implemented";
80+
}
81+
}
82+
else if next* == '@' {
83+
next++;
84+
85+
if next* == '-' || next* == '+' {
86+
i := 0;
87+
while i < cpp2::unsafe_narrow<int>(r.group_number()) next i++ {
88+
pos := 0;
89+
if next* == '-' {
90+
pos = r.group_start(i);
91+
}
92+
else {
93+
pos = r.group_end(i);
94+
}
95+
result += std::to_string(pos);
96+
}
97+
}
98+
else {
99+
std::cerr << "Not implemented";
100+
}
101+
}
102+
else {
103+
std::cerr << "Not implemented.";
104+
}
105+
}
106+
iter = next;
107+
}
108+
109+
return result;
110+
}
111+
112+
test: <M> (regex: M, id: std::string, regex_str: std::string, str: std::string, kind: std::string, resultExpr: std::string,
113+
resultExpected: std::string) = {
114+
115+
warning: std::string = "";
116+
if regex.to_string() != regex_str {
117+
warning = "Warning: Parsed regex does not match.";
118+
}
119+
120+
status: std::string = "OK";
121+
122+
r := regex.search(str);
123+
124+
if "y" == kind || "yM" == kind || "yS" == kind || "yB" == kind {
125+
if !r.matched {
126+
status = "Failure: Regex should apply.";
127+
}
128+
else {
129+
// Have a match check the result
130+
131+
result := create_result(resultExpr, r);
132+
133+
if result != resultExpected {
134+
status = "Failure: Result is wrong. (is: (result)$)";
135+
}
136+
}
137+
}
138+
else if "n" == kind {
139+
if r.matched {
140+
status = "Failure: Regex should not apply. Result is '(r.group(0))$'";
141+
}
142+
} else {
143+
status = "Unknown kind '(kind)$'";
144+
}
145+
146+
if !warning.empty() {
147+
warning += " ";
148+
}
149+
std::cout << "(id)$_(kind)$: (status)$ (warning)$regex: (regex_str)$ parsed_regex: (regex.to_string())$ str: (str)$ result_expr: (resultExpr)$ expected_results (resultExpected)$" << std::endl;
150+
}
151+
152+
153+
test_tests_01_char_matcher: @regex type = {
154+
regex_01 := R"(abc)";
155+
regex_02 := R"(abc)";
156+
regex_03 := R"(abc)";
157+
regex_04 := R"(abc)";
158+
regex_05 := R"(abc)";
159+
regex_06 := R"(abc)";
160+
regex_07 := R"(abc)";
161+
regex_08 := R"(abc)";
162+
regex_09 := R"(abc)";
163+
regex_10 := R"(abc)";
164+
regex_11 := R"(abc)";
165+
regex_12 := R"(abc)";
166+
run: (this) = {
167+
std::cout << "Running tests_01_char_matcher:"<< std::endl;
168+
test(regex_01, "01", R"(abc)", "abc", "y", R"($&)", "abc");
169+
test(regex_02, "02", R"(abc)", "abc", "y", R"($-[0])", "0");
170+
test(regex_03, "03", R"(abc)", "abc", "y", R"($+[0])", "3");
171+
test(regex_04, "04", R"(abc)", "xbc", "n", R"(-)", "-");
172+
test(regex_05, "05", R"(abc)", "axc", "n", R"(-)", "-");
173+
test(regex_06, "06", R"(abc)", "abx", "n", R"(-)", "-");
174+
test(regex_07, "07", R"(abc)", "xabcy", "y", R"($&)", "abc");
175+
test(regex_08, "08", R"(abc)", "xabcy", "y", R"($-[0])", "1");
176+
test(regex_09, "09", R"(abc)", "xabcy", "y", R"($+[0])", "4");
177+
test(regex_10, "10", R"(abc)", "ababc", "y", R"($&)", "abc");
178+
test(regex_11, "11", R"(abc)", "ababc", "y", R"($-[0])", "2");
179+
test(regex_12, "12", R"(abc)", "ababc", "y", R"($+[0])", "5");
180+
std::cout << std::endl;
181+
}
182+
}
183+
main: () = {
184+
test_tests_01_char_matcher().run();
185+
}

0 commit comments

Comments
 (0)