-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Improvements in regular expression doc #114357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The check about the f argument type was removed in this commit: python@2c94aa5 Thanks for Pedro Arthur Duarte (pedroarthur.jedi at gmail.com) by the help with this bug.
…#106335) Remove private _PyThreadState and _PyInterpreterState C API functions: move them to the internal C API (pycore_pystate.h and pycore_interp.h). Don't export most of these functions anymore, but still export functions used by tests. Remove _PyThreadState_Prealloc() and _PyThreadState_Init() from the C API, but keep it in the stable API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR does 3 things.
- Add headers. I have thought to propose the same. Please add 1 more at 320, something like
.. _re_extension_notation
Extension notation
^^^^^^^^^^^^^^^^^^
CHANGE
-
Add double backticks, either new or extending single backticks. The existing text always put backticks on REs and sometimes on text matched. PR makes that (nearly, 2 expections noted) always on matches. Defensible since this seems the majority of existing cases. CHANGE
-
Add 'only' in several places. I am not sure these are needed, but I see existing similar uses.
@serhiy-storchaka I want to finish this RE doc change. Any additional comments from you?
Doc/library/re.rst
Outdated
only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'`` | ||
matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for | ||
matches 'foo2' normally, but ``'foo1'`` in :const:`MULTILINE` mode; searching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with other additions, 'foo' above and 'foo2' here should be backticked. But see review summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
I am not sure that there is a need in these changes.
|
Hi, @terryjreedy. Thank you for your review and comments. The items 1 and 2 are done. Concern 3: the idea is to make the Without >>> import re
>>> re.findall(r'\d+', '567abc123٠١٢٣٤٥٦٧٨٩')
['567', '123٠١٢٣٤٥٦٧٨٩'] However, with >>> import re
>>> re.findall(r'\d+', '567abc123٠١٢٣٤٥٦٧٨٩', re.ASCII)
['567', '123'] |
This can start adding more in-line examples, like in progress with strings (#119445). |
Thanks for making the requested changes! @terryjreedy: please review the changes made to this pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the subheadings to divide up the syntax section.
I agree with adding only to clarify when ASCII mode matches less than previously described. But the cases with complemented sets don’t have this problem, and I think adding only to them only hurts.
\D:
Matches any character which is not a decimal digit. This is the opposite of \d.Matches only [^0–9] if the ASCII flag is used.
reads as “Matches only the universe except zero to nine”?
Doc/library/re.rst
Outdated
@@ -514,6 +529,9 @@ The special characters are: | |||
|
|||
.. _re-special-sequences: | |||
|
|||
Special sequences | |||
^^^^^^^^^^^^^^^^^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call them all escape sequences? Differentiates better from the multi-character “special character” sequences above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest change the next heading to something like String literal escapes, and change this heading from Special sequences to Escape sequences.
These are the types of the special characters I can think of for REs:
- The single-character metacharacters:
$, *, [, ], \, etc
, as listed in the how-to https://cpython-previews--114357.org.readthedocs.build/en/114357/howto/regex.html#matching-characters - Multicharacter syntax built with the metacharacters, like *?, {m,n} and the bracketed extension notation (?. . .)
- “Special sequences” a.k.a. escape sequences, which begin with a backslash. These could be subdivided into
- Non-alphanumeric, for escaping metacharacters and other syntax:
\$, \*, \\, \', \", etc
- Group references \1–\99
- Alphanumeric sequences that specify locations to match, or categories of characters: \A, \b, \d, etc
- String literal escapes:
\n, \\, \N{. . .}, \0–\777, etc
. Excludes \b and\<newline>
.
- Non-alphanumeric, for escaping metacharacters and other syntax:
- Characters only special in “verbose” expressions: whitespace and #
- Additional backslash sequence for re.sub templates: \g<. . .>
- Special characters inside square-bracketed classes/sets [. . .], especially -, ^, ], \b, and reserved [, &&, etc
matches both ``'foo'`` and ``'foobar'``, while the regular expression ``foo$`` | ||
matches | ||
only ``'foo'``. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'`` | ||
matches ``'foo2'`` normally, but also ``'foo1'`` in :const:`MULTILINE` mode; searching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the original was easier to read, with the full string being searched given in a different font from the substrings that are found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, it was inconsistent with the "(In the rest of this section, we’ll write RE’s in this special style, usually without quotes, and strings to be matched 'in single quotes'.)"
However, you highlighted that 'strings to be matched' is different from 'the matches'. On the other hand, both are literal strings, and this is a common pattern around all docs.
I would like some more opinions here.
Doc/library/re.rst
Outdated
many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed | ||
by any number of 'b's. | ||
many repetitions as are possible. ``ab*`` will match ``'a'``, ``'ab'``, or | ||
``'a'`` followed by any number of ``'b'`` s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -s signifying plural has become disconnected from the b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting decision about: #114357 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I think it is great formatting, but looking at the markup under *+ you might join the s on with
``'a'`` followed by any number of ``'b'`` s. | |
``'a'`` followed by any number of ``'b'``\ s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Doc/library/re.rst
Outdated
|
||
.. index:: single: + (plus); in regular expressions | ||
|
||
``+`` | ||
Causes the resulting RE to match 1 or more repetitions of the preceding RE. | ||
``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not | ||
match just 'a'. | ||
``ab+`` will match ``'a'`` followed by any non-zero number of ``'b'`` s; it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-s disconnected again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
I've reverted this change.
HUmm... It is curious. Even without the only, this can sound weird, mainly if you don't read the first paragraph. Maybe this would be "Matches all the ASCII universe except zero to nine." |
I have made the requested changes; please review again |
Thanks for making the requested changes! @terryjreedy: please review the changes made to this pull request. |
I have made the requested changes; please review again |
Thanks for making the requested changes! @terryjreedy: please review the changes made to this pull request. |
\D: Matches [^0–9] if the ASCII flag is used.
I don’t understand what is weird. It matches all Unicode characters, not just ASCII, except for ASCII zero to nine. The only suggestion I can think of is saying “Equivalent to [^0–9]” rather than “Matches”. Maybe that is clearer to you? (Although the equivalency doesn’t quite work when \D is already inside a square-bracket character class/set.) |
Oh, my goodness. I had a misconception about how re.ASCII works. I thought it was like a filter: "filter all ASCII characters and after matching against the re". So, necessarily, with the ASCII flag, the matches only had ASCII characters. But it is not true, especially with the negative set of characters ( In this case, we might need to improve the re.ASCII definition to avoid this misconception. |
Yeah. With the correct understanding of how ASCII works, it sounds better. Can I change this in all occurrences like that? |
Thank you @adorilson for your patience and effort on this PR. @terryjreedy I'm going through and triaging a bunch of docs PRs. If you have time, please review this one again. Thanks. |
📚 Documentation preview 📚: https://cpython-previews--114357.org.readthedocs.build/