[SUGGESTION] Simpler interpolated raw string literals #648

msadeqhe · 2023-03-28T09:38:59Z

msadeqhe
Mar 28, 2023

Preface

I think I should explain in detail about what options are possible for string literals.

If you restrict " to be an escape sequence in character literals, e.g. programmers have to write '\"' instead of '"', in addition to disallow empty (multiple) single quotes '...', ''...'', '''...''' and etc, then the following quotes are all available syntaxes for string literals without any conflict or ambiguity for both C++2 compiler and programmers:

1)  'text'
2)  ''text''
3)  '''text'''
4)  ''''...

5)  "text"

6)  '"text"'

7)  '?"text"?'
8)  '?!"text"!?'
9)  '?!@"text"@!?'
10) '?!@#...

The ?, !, @ and # in the last four lines, can be any character except ', ", \ and new-line (because if you allow ' then it will conflict with ''...'', and if you allow " then it will conflict with '"..."', and if you allow \ then it's ambiguous with \" in character literals), if the character is an opening bracket at the opening quote, it should be the corresponding closing bracket at the closing quote, and the order of characters have to be reversed at the closing quote, e.g. 'x["text"]x'.

If you disallow to place an empty string literal side-by-side of another string literal, in addition to disallow empty (at least triple) double quotes """...""", """"..."""", """""...""""" and etc, then the following quotes are also available syntaxes for string literals without any conflict or ambiguity for both C++2 compiler and programmers:

11) """text"""
12) """"text""""
13) """""text"""""
14) """"""...

What is the current status of above quotes?

Number (1) is already for character literals, I leave it alone.
Number (5) is already for interpolated non-raw string literals, I leave it alone too.
Other numbers (2) to (4) and (6) to (14) are not yet taken by any language feature.

First let's consider numbers (2) to (4) (e.g. ''...'') and (11) to (14) (e.g. """..."""):

They cannot be empty, e.g. '''' is not an empty ''...'', but it's just an opening quote with four 's.
They cannot contain a single character of the same quote, e.g. ''''' doesn't contain ' inside ''...'', but it's just an opening quote with five 's.
They cannot contain a single character of the same quote at the beginning or ending in their content, e.g. '''text''' doesn't contain 'text' inside ''...'', but it contains text inside '''...'''.

Therefore to solve the above limits, we may allow optional white-spaces around the content of string literals. By the way, we can explore other alternative quotes.

Now, numbers (6) (e.g. '"..."') and (7) to (10) (e.g. '?"..."?') are left for us. The good news about these quotes are that their opening syntax (e.g. '"...) is different from their closing syntax (e.g. ..."'), so:

They can be empty, e.g. '""' is an empty '"..."'.
They can contain a single character of the same quote, e.g. '"""' contains " inside '"..."', and '"'"' contains ' inside '"..."'.
They can contain a single character of the same quote at the beggining or ending in their content, e.g. '""text""' contains "text" inside '"..."', and '"'text'"' contains 'text' inside '"..."'.

NOTE 1

Numbers (7) to (10) (e.g. '?"..."?') can have additional characters in the opening and closing quotes, other than that they are similar to number (6) (e.g. '"..."'). These additional characters are similar to R"?(...)?" in C++1, except:

The order of characters at the opening quote have to be reversed at the closing quote, e.g. 'abc"text"cba' contains text inside 'abc"..."cba'. Alternatively identifiers and numbers within the closing quote may keep their order as described in this comment (recommended as it improves readability).
If the character is an opening bracket at the opening quote, it should be the corresponding closing bracket at the closing quote, e.g. '[(<{"text"}>)]' contains text inside '[(<{"..."}>)]'.

Suggestion Detail

This is not a new issue from me. I have a similar issue before but it was cluttered in many replays in this issue, and I felt I should summerize my suggestion here.

I have to mention that my suggestion ...

doesn't introduce any new string format.
doesn't introduce any new keyword or new symbol.
doesn't introduce any new semantic.
introduces a new syntax for string literals.

Currently $R prefix is used to quote interpolated raw string literals in C++2, e.g. $R"?(text)?". But R prefix is used to quote non-interpolated raw string literals, e.g. R"?(text)?". I suggest to completely remove the prefixes, e.g. '?"text"?'.

$R"?(...)?" is a powerful way to have interpolated raw string literals but it's possible to go further and make its syntax simpler and smaller. The whole porpose of my suggestion is to transform $R"?(...)?" to '?"..."?' without any additional changes (see NOTE 1):

// := $R"(Username: (user)$)";
x0 := '"Username: (user)$"';

// := $R"x[(It's the message: "(message)$")x[";
x1 := 'x["It's the message: "(message)$""]x';

Why do I suggest this change?

$R"(...)" is a little verbose for most of the time that we just want to disable escape sequences and be able to simply write single quotes ' and double quotes " inside string literals. Using '"..."' is more readable and more convenient with less typing than $R"(...)" to start an interpolated raw string literal.

I have to mention that programmers are familiar with writing strings in quotes such as '"..."', but $R"(...)" is a little further than that and they must learn why parenthesis are not a part of content, and what is a prefix and how it can be combined with unicode prefixes.

Is there any exprience, data or working implementation available?

My suggestion is a small change. It is almost $R"?(...)?" without $R prefix (see NOTE 1).

Is there any additional suggestion?

I additionally suggest to unify interpolated and non-interpolated string literals instead of introducing different string literals for each of them, I suggest to have a way to disable captures in string literals. The pattern (...)$ captures a variable in string literals. It is complex enough that we don't often need to disable it, therefore we don't need to devote a different string literal to it.

To disable the capture pattern (expr)$, I introduce a new False Capture pattern (expr)...\$ that doesn't capture anything. We can add a back-slash before dollar sign, so the value of "(...)\$" is equal to (...)$. Also we can add more back-slashes before dollar sign, so the value of "(...)\\$" is equal to (...)\$. Each time we add a back-slash we get another one. Programmers are already familiar with escape sequences, this way is similar to escape sequence \$, but I should mention that escape sequence \\ (and other escape seqences too) doesn't have a meaning inside false capture pattern "(...)\\$", therefore each additional back-slash is excatly added to the value.

a := 0;

// The value is 0
x0 := "(a)$";

// The value is (a)$
x1 := "(a)\$";

// The value is (a)\$
x2 := "(a)\\$";

// The value is (a)\\$
x3 := "(a)\\\$";

In a nutshell, C++2 will have the following patterns in string literals:

Capture: (expr)$ is equal to the value of expr.
False Capture: (expr)...\$ is equal to the value (expr)...$. Only back-slash is allowed in place of ... after ) and before $. If you add any other character except back-slash in place of ..., then the whole pattern is violated, and it will not be a capture or false capture.

Finally there will be two string literals in C++2:

"..." for non-raw string literals. It supports escape sequences.
'"..."' for raw string literals, also '?"..."?', '?!"..."!?', '?!@"..."@!?' and etc, which the ?, !, @ and ... can be any character except ', ", \ and new-line. It doesn't support escape sequences, on the other hand, its content can be broken into multiple lines (see NOTE 1).

And we can capture or don't capture in the same string literal:

// := $R"((user)$ is a capture, but )" + R"((user)$ is not a capture.)";
x0 := '"(user)$ is a capture, but (user)\$ is not a capture."';

In the above example, a programmer have to determine if a string literal is interpolated or non-interpolated (as you see in the first line), then he can think about if (user)$ is a capture or is not a capture. But using false captures (as you see in the second line), makes it obvious that (user)\$ is not a capture.

This is a regular expression example:

// := R"(^("hi"|"hey"|"hello")$)";
x1 := '"^("hi"|"hey"|"hello")\$"';

As you see in the above example, without a back-slash before dollar sign (e.g.("hello")$) a programmer may think it's a capture in C++2 but infact it's a capture in regular expressions. Therefore using false captures (e.g. ("hello")\$) helps programmers to easily distinguish captures in C++2 and captures in regular expressions, and it brings a more readable code when dealing with regular expressions.

I mean completely disabling captures via non-interpolated string literals, may lead to less readable code, becuase a programmer have to determine if a string literal is interpolated or non-interpolated, then he can think about how to read the content of the string literal.

In this way, C++2 only have two string literals and we can control the capture anytime in a single string literal.

Edits

English is not my native language. Sorry if False Capture is not a proper name for it.
I've added a regular expression example and the reason of why false captures are more readable and more obvious than non-interpolated string literals.

AbhinavK00 · 2023-03-28T13:19:10Z

AbhinavK00
Mar 28, 2023

I think I fully support this change, would like to hear others' opinion on this

0 replies

msadeqhe · 2023-03-29T06:41:15Z

msadeqhe
Mar 29, 2023
Author

Thanks @AbhinavK00.

I need to explain that non-interpolated string literals are also somehow against the goal of general capture syntax as mentioned in @hsutter's issue comment:

For now I'm planning to stick to the experiment of the general capture syntax (thing)$ everywhere in the language (not just string literals, but also contracts and lambdas) for the reasons given in Design Note: Capture. I get the argument for consistency among string literals, but I'm currently putting heavier weight on seeing if a consistency of all capture across the language pans out.

Because non-interpolated string literals will break the rule of (thing)$ everywhere in the language:

// This R"(...)" breaks the rule of (thing)$ everywhere.
// := R"((user)$ is not a capture.)";

// But (user)\$ doesn't break the rule of (thing)$ everywhere.
x0 := '"(user)\$ is not a capture."';

False capture (user)\$ doesn't break the rule, because (user)\$ is not a capture pattern, and it's obvious from itself.

0 replies

JohelEGP · 2023-03-29T15:32:25Z

JohelEGP
Mar 29, 2023

I like how '""' is much simpler than C++1's R"()". It's less cluttered and easier to reason about.

'"(x)\$"' would be printed as (x)$, right?

This made me wonder how you'd interpolate in injected code.

x:=0;
-> {
  x:=1;
  z:='"(x)$ is 0 and (x)\$ is 1"';
};

0 replies

msadeqhe · 2023-03-29T18:52:14Z

msadeqhe
Mar 29, 2023
Author

'"(x)\$"' would be printed as (x)$, right?

Yes.

This made me wonder how you'd interpolate in injected code.

If I understand your example correctly, it will be evaluated only once, so:

x := 0;
y := :(str :std::string) = {
    x := 1;

    // The value of local variable is:
    // 0 is 0 and (x)$ is (x)$
    z := str;
};

// The value of function parameter is:
// 0 is 0 and (x)$ is (x)$
y('"(x)$ is 0 and (x)\$ is (x)\$"');

0 replies

JohelEGP · 2023-05-01T02:40:12Z

JohelEGP
May 1, 2023

Makes me think if raw string literals could just be backslash-escaped (see #302).
-- #392 (comment)

@JohelEGP, Good idea but in my opinion, raw (non-interpolated) string literals break the general capture syntax (thing)$ everywhere in the language. Also string literals without prefix or suffix will make it possible to have operator'' and operator"" or Tagged Template Strings or any other versatile syntax in Cpp2.
-- #392 (comment)

E.g., \"(x)$",

\"Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum."

After actually writing it down, I realize it's silly.
It would end at the first ".

0 replies

msadeqhe · 2023-05-01T05:18:20Z

msadeqhe
May 1, 2023
Author

If we put another \ at the end, it would end at first "\, but \"(x)$"\ doesn't look good. If we put backtick ` instead of ', also `"text"` is another option, whereas backtick ` is not yet used, and Cpp2 can reserve it for new features in the future (it's a possibility to consider).

Alternatively syntax ''"text"'' seems to be good based on the idea to have non-interpolated raw string literals with new quotes in addition to '"text"'. Now the categories of string literals will be like this:

New Syntax	Old Syntax	Escape Sequences	Interpolated	Description
`"text"`	`"text"`	Yes	Yes	String Literal
`'"text"'`	`$R"(text)"`	---	Yes	Raw String Literal
`''"text"''`	`R"(text)"`	---	---	Non-interpolated Raw String Literal

Consider how string literals in old syntax look different because of parenthesis and prefixes, while they look uniform in new syntax (it would leads to guidance improvement, because no one would ask why parenthesis are not a part of string or what is the meaning of prefixes):

Old Syntax ->   "text"   |   $R"(text)"   |   R"(text)"
New Syntax ->   "text"   |     '"text"'   |   ''"text"''

It's a simple rule to teach newbie programmers in which sensitivity of string literals will be decreased for each additional ' around "". Of course ''"text"'' is similar to '"text"' except it's non-interpolated. Therefore it's possible to have additional characters between '' and " to make quotes even more unique:

x: = ''[{"Here (variable)$ is not interpolated!"}]'';

So ''"text"'' has the power of R"(text)" syntax. It seems like triple quotes in other programming languages (e.g. '''text''' or """text""" in Python and C#) but it doesn't have the following problems of them:

First let's consider numbers (2) to (4) (e.g. ''...'') and (11) to (14) (e.g. """..."""):

They cannot be empty, e.g. '''' is not an empty ''...'', but it's just an opening quote with four 's.

They cannot contain a single character of the same quote, e.g. ''''' doesn't contain ' inside ''...'', but it's just an opening quote with five 's.

They cannot contain a single character of the same quote at the beginning or ending in their content, e.g. '''text''' doesn't contain 'text' inside ''...'', but it contains text inside '''...'''.

Therefore to solve the above limits, we may allow optional white-spaces around the content of string literals.

I have to mention syntax ''"text"'' will make syntax ''text'' not to be available for future use, but the following syntax are remained available (these quotes cannot have additional characters within themselves to make complex quotes unlike R"ABC(text)ABC" or 'ABC"text"CBA' or ''ABC"text"CBA''):

  '''text'''
 ''''text''''
'''''text'''''
     ...

  """text"""
 """"text""""
"""""text"""""
     ...

In my opinion, (something)...\$ (aka False Capture) is still helpful to occasionally evade capturing inside interpolated string literals in addition to ''"text"'' (Raw String Literals).

0 replies

msadeqhe · 2023-05-02T08:49:46Z

msadeqhe
May 2, 2023
Author

Now the categories of string literals will be like this:

New Syntax Old Syntax Escape Sequences Interpolated Description

"text" "text" Yes Yes String Literal

'"text"' $R"(text)" --- Yes Raw String Literal

''"text"'' R"(text)" --- --- Non-interpolated Raw String Literal

Raw string literals would end at )" in old syntax, and they would end at "' or "'' in new syntax. I've searched GitHub for those ending quotes, here is the number of founded results in source codes:

For interpolated raw string literals (noticeable difference!):
- )" has about 157'000'000 results.
- "' has about 28'600'000 results.
For non-interpolated raw string literals (too much difference!):
- )" has about 157'000'000 results.
- "'' has about 922'000 results.

By comparing the ending quote of them, we now realize "' and "'' are less common than )" in source codes. Therefore '"text"' and ''"text"'' are better options to quote raw string literals instead of R"(text)" when they contain source code (aka template string literals).

1 reply

msadeqhe Sep 7, 2023
Author

This statistics means the quotes of '"..."' and ''"..."'' are enough in most cases, and additional characters are not needed to make the quotes complex.

msadeqhe · 2023-06-17T11:21:21Z

msadeqhe
Jun 17, 2023
Author

NOTE 1

Numbers (7) to (10) (e.g. '?"..."?') can have additional characters in the opening and closing quotes, other than that they are similar to number (6) (e.g. '"..."'). These additional characters are similar to R"?(...)?" in C++1, except:

The order of characters at the opening quote have to be reversed at the closing quote, e.g. 'abc"text"cba' contains text inside 'abc"..."cba'.

If the character is an opening bracket at the opening quote, it should be the corresponding closing bracket at the closing quote, e.g. '[(<{"text"}>)]' contains text inside '[(<{"..."}>)]'.

Alternatively the order of characters at the opening quote and the closing quote must be the same if they are identifiers (sequence of letters and numbers) and numbers:

x: = 'something^([{another100<"THIS IS THE STRING VALUE!">another100}])^something';
// x == "THIS IS THE STRING VALUE!"

In that way, it would make it readable to write identifiers and numbers within opening and closing quotes. Compare how it would be more readable than $R"(...)" when correspondence brackets and symbols are in reversed order:

// $R"abc{xyz[100(...)abc{xyz[100";
x: = 'abc{xyz[100"..."100]xyz}abc';

// $R"{[abc100(...){[abc100";
y: = '{[abc100"..."abc100]}';

// $R"abc*#[(...)abc*#[";
z: = 'abc*#["..."]#*abc';

But for example, abc shouldn't be in reversed order because it would make it hard for programmers to construct the closing quote. I've added this alternative approach to NOTE 1.

Edits

corrected spelling (~~"reserved"~~ to "reversed")

0 replies

[SUGGESTION] Simpler interpolated raw string literals #648

Uh oh!

Uh oh!

msadeqhe Mar 28, 2023

Preface

NOTE 1

Suggestion Detail

Why do I suggest this change?

Is there any exprience, data or working implementation available?

Is there any additional suggestion?

Edits

Replies: 8 comments · 1 reply

Uh oh!

AbhinavK00 Mar 28, 2023

Uh oh!

Uh oh!

msadeqhe Mar 29, 2023 Author

Uh oh!

JohelEGP Mar 29, 2023

Uh oh!

Uh oh!

msadeqhe Mar 29, 2023 Author

Uh oh!

JohelEGP May 1, 2023

Uh oh!

Uh oh!

msadeqhe May 1, 2023 Author

Uh oh!

Uh oh!

msadeqhe May 2, 2023 Author

Uh oh!

Uh oh!

msadeqhe Sep 7, 2023 Author

Uh oh!

Uh oh!

msadeqhe Jun 17, 2023 Author

NOTE 1

Edits

msadeqhe
Mar 28, 2023

Replies: 8 comments 1 reply

AbhinavK00
Mar 28, 2023

msadeqhe
Mar 29, 2023
Author

JohelEGP
Mar 29, 2023

msadeqhe
Mar 29, 2023
Author

JohelEGP
May 1, 2023

msadeqhe
May 1, 2023
Author

msadeqhe
May 2, 2023
Author

msadeqhe Sep 7, 2023
Author

msadeqhe
Jun 17, 2023
Author