Skip to content

Improvements in regular expression doc #114357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
817b3f3
Doc: Fix the array.fromfile method doc
adorilson Sep 1, 2020
6b53456
gh-106320: Remove private _PyInterpreterState functions (#106335)
vstinner Jul 2, 2023
1b4d152
[Doc] Divide RE Syntax in subsections
adorilson Jan 20, 2024
6ad009c
[DOC] Add crasis surrounding some RE-matched words
adorilson Jan 20, 2024
94f765f
[DOC] Make clearer what will be matched with a RE
adorilson Jan 20, 2024
292672b
Doc: minor change
adorilson Dec 30, 2023
65b4278
Merge branch 'python:main' into re_improvements
adorilson Feb 3, 2024
fe7389a
Merge branch 'python:main' into re_improvements
adorilson Feb 4, 2024
8394cd3
Merge branch 'python:main' into re_improvements
adorilson Feb 5, 2024
e2023e0
Doc: Put PatternError's attributes inside a table instead of regular …
adorilson Feb 5, 2024
cdaa9ae
Doc: Fix PatternError's attributes
adorilson Feb 5, 2024
bb98dad
Doc: fix lint issue
adorilson Feb 5, 2024
22ffed7
Merge branch 'main' into re_improvements
adorilson Feb 25, 2024
6a1e74e
Merge branch 'python:main' into re_improvements
adorilson Sep 25, 2024
6b357af
Doc: Add extension notation header
adorilson Sep 25, 2024
8f7356d
Doc: Add some more backticks
adorilson Sep 25, 2024
6ed5109
Merge branch 'python:main' into re_improvements
adorilson Sep 26, 2024
9c17aa8
Doc: Fix malformed hyperlink target
adorilson Sep 26, 2024
acb2e38
Merge branch 'main' into re_improvements
adorilson Sep 26, 2024
4d3b8dd
Merge branch 'python:main' into re_improvements
adorilson Oct 1, 2024
643070c
Merge branch 'main' into re_improvements
adorilson Oct 3, 2024
17baf98
Docs: add a 'also' for $ special character and RE examples reference …
adorilson Oct 3, 2024
4e12f7c
Docs: add some RE raw string notation references
adorilson Oct 3, 2024
a09a187
Merge branch 'python:main' into re_improvements
adorilson Oct 20, 2024
625a5cf
Revert "[DOC] Make clearer what will be matched with a RE"
adorilson Oct 20, 2024
12ecb3a
Doc: Put some subheadings at Special Character section
adorilson Oct 20, 2024
f576282
Doc: Fix raw string notation reference
adorilson Oct 20, 2024
337e4b4
Merge branch 'python:main' into re_improvements
adorilson Oct 28, 2024
0e0e082
Doc: Include "Python's" to a link text in RE module
adorilson Oct 28, 2024
f094a90
Doc: Add some backticks in re.IGNORECASE section
adorilson Oct 28, 2024
fd24e0f
Merge branch 'main' into re_improvements
adorilson Nov 2, 2024
a8c44e1
Merge branch 'main' into re_improvements
adorilson Nov 21, 2024
f970235
Doc: rename some heading in RE
adorilson Mar 15, 2025
8d52469
Doc: Connect some s in RE
adorilson Mar 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 71 additions & 30 deletions Doc/library/re.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,9 @@ usage of the backslash in string literals now generate a :exc:`SyntaxWarning`
and in the future this will become a :exc:`SyntaxError`. This behaviour
will happen even if it is a valid escape sequence for a regular expression.

The solution is to use Python's raw string notation for regular expression
patterns; backslashes are not handled in any special way in a string literal
The solution is to use :ref:`Python's raw string notation
for regular expression patterns <raw-string-notation>`; backslashes are not
handled in any special way in a string literal
prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
newline. Usually patterns will be expressed in Python code using this raw
Expand Down Expand Up @@ -83,6 +84,12 @@ characters, so ``last`` matches the string ``'last'``. (In the rest of this
section, we'll write RE's in ``this special style``, usually without quotes, and
strings to be matched ``'in single quotes'``.)


.. _re-special-characters:

Special characters
^^^^^^^^^^^^^^^^^^

Some characters, like ``'|'`` or ``'('``, are special. Special
characters either stand for classes of ordinary characters, or affect
how the regular expressions around them are interpreted.
Expand All @@ -93,7 +100,6 @@ directly nested. This avoids ambiguity with the non-greedy modifier suffix
repetition to an inner repetition, parentheses may be used. For example,
the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.


The special characters are:

.. index:: single: . (dot); in regular expressions
Expand All @@ -114,31 +120,33 @@ The special characters are:
``$``
Matches the end of the string or just before the newline at the end of the
string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
matches both ``'foo'`` and ``'foobar'``, while the regular expression ``foo$``
matches
only ``'foo'``. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
matches ``'foo2'`` normally, but also ``'foo1'`` in :const:`MULTILINE` mode; searching
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the original was easier to read, with the full string being searched given in a different font from the substrings that are found

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, it was inconsistent with the "(In the rest of this section, we’ll write RE’s in this special style, usually without quotes, and strings to be matched 'in single quotes'.)"

However, you highlighted that 'strings to be matched' is different from 'the matches'. On the other hand, both are literal strings, and this is a common pattern around all docs.

I would like some more opinions here.

for
a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
the newline, and one at the end of the string.

.. index:: single: * (asterisk); in regular expressions

``*``
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
by any number of 'b's.
many repetitions as are possible. ``ab*`` will match ``'a'``, ``'ab'``, or
``'a'`` followed by any number of ``'b'``\ s.

.. index:: single: + (plus); in regular expressions

``+``
Causes the resulting RE to match 1 or more repetitions of the preceding RE.
``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
match just 'a'.
``ab+`` will match ``'a'`` followed by any non-zero number of ``'b'``\ s; it
will not match just ``'a'``.

.. index:: single: ? (question mark); in regular expressions

``?``
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
``ab?`` will match either 'a' or 'ab'.
``ab?`` will match either ``'a'`` or ``'ab'``.

.. index::
single: *?; in regular expressions
Expand Down Expand Up @@ -224,7 +232,8 @@ The special characters are:
``'*'``, ``'?'``, and so forth), or signals a special sequence; special
sequences are discussed below.

If you're not using a raw string to express the pattern, remember that Python
If you're not using a :ref:`raw string to express the
pattern<raw-string-notation>`, remember that Python
also uses the backslash as an escape sequence in string literals; if the escape
sequence isn't recognized by Python's parser, the backslash and subsequent
character are included in the resulting string. However, if Python would
Expand Down Expand Up @@ -315,6 +324,12 @@ The special characters are:
special sequence, described below. To match the literals ``'('`` or ``')'``,
use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``.


.. _re_extension_notation:

Extension notation
""""""""""""""""""

.. index:: single: (?; in regular expressions

``(?...)``
Expand Down Expand Up @@ -514,6 +529,9 @@ The special characters are:

.. _re-special-sequences:

Escape sequences
""""""""""""""""

The special sequences consist of ``'\'`` and a character from the list below.
If the ordinary character is not an ASCII digit or an ASCII letter, then the
resulting RE will match the second character. For example, ``\$`` matches the
Expand Down Expand Up @@ -660,6 +678,12 @@ character ``'$'``.
``\Z``
Matches only at the end of the string.


.. _re-escape-sequences:

String literal escapes
""""""""""""""""""""""

.. index::
single: \a; in regular expressions
single: \b; in regular expressions
Expand Down Expand Up @@ -771,11 +795,11 @@ Flags

Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
If the :py:const:`~re.ASCII` flag is used, only letters 'a' to 'z'
and 'A' to 'Z' are matched.
letters and 4 additional non-ASCII letters: ``'İ'`` (U+0130, Latin capital
letter I with dot above), ``'ı'`` (U+0131, Latin small letter dotless i),
``'ſ'`` (U+017F, Latin small letter long s) and ``'K'`` (U+212A, Kelvin sign).
If the :py:const:`~re.ASCII` flag is used, only letters ``'a'`` to ``'z'``
and ``'A'`` to ``'Z'`` are matched.

.. data:: L
LOCALE
Expand Down Expand Up @@ -1191,25 +1215,26 @@ Exceptions
error if a string contains no match for a pattern. The ``PatternError`` instance has
the following additional attributes:

.. attribute:: msg
.. list-table::
:header-rows: 1

The unformatted error message.
* - Attribute
- Meaning

.. attribute:: pattern
* - .. attribute:: msg
- The unformatted error message.

The regular expression pattern.
* - .. attribute:: pattern
- The regular expression pattern.

.. attribute:: pos
* - .. attribute:: pos
- The index in *pattern* where compilation failed (may be ``None``).

The index in *pattern* where compilation failed (may be ``None``).
* - .. attribute:: lineno
- The line corresponding to *pos* (may be ``None``).

.. attribute:: lineno

The line corresponding to *pos* (may be ``None``).

.. attribute:: colno

The column corresponding to *pos* (may be ``None``).
* - .. attribute:: colno
- The column corresponding to *pos* (may be ``None``).

.. versionchanged:: 3.5
Added additional attributes.
Expand Down Expand Up @@ -1578,6 +1603,8 @@ Regular Expression Examples
---------------------------


.. _checking-for-a-pair:

Checking for a Pair
^^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -1632,6 +1659,8 @@ To find out what card the pair consists of, one could use the
'a'


.. _simulating-scanf:

Simulating scanf()
^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -1719,6 +1748,8 @@ beginning with ``'^'`` will match at the beginning of each line. ::
<re.Match object; span=(4, 5), match='X'>


.. _making-a-phonebook:

Making a Phonebook
^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -1780,6 +1811,8 @@ house number from the street name:
['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]


.. _text-munging:

Text Munging
^^^^^^^^^^^^

Expand All @@ -1800,6 +1833,8 @@ in each word of a sentence except for the first and last characters::
'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'


.. _finding-all-adverbs:

Finding all Adverbs
^^^^^^^^^^^^^^^^^^^

Expand All @@ -1813,6 +1848,8 @@ the following manner::
['carefully', 'quickly']


.. _finding-all-adverbs-and-their-positions:

Finding all Adverbs and their Positions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand All @@ -1829,6 +1866,8 @@ to find all of the adverbs *and their positions* in some text, they would use
40-47: quickly


.. _raw-string-notation:

Raw String Notation
^^^^^^^^^^^^^^^^^^^

Expand All @@ -1853,6 +1892,8 @@ functionally identical::
<re.Match object; span=(0, 1), match='\\'>


.. _writing-a-tokenizer:

Writing a Tokenizer
^^^^^^^^^^^^^^^^^^^

Expand Down
Loading