Skip to content

gh-130167: Improve speed of difflib.IS_LINE_JUNK by replacing re #130170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

donBarbos
Copy link
Contributor

@donBarbos donBarbos commented Feb 16, 2025

Now we don't import of re globally and there is no longer an extra (undocumented) function parameter (see documentation)

How did i benchmark a new version?

I wrote next python script difflib_bench.py:

import re
import timeit

def is_line_junk_regex(line, pat=re.compile(r"\s*(?:#\s*)?$").match):
    return pat(line) is not None

def is_line_junk_no_regex(line):
    return line.strip() == "" or line.lstrip().rstrip() == "#"

test_cases = [
    "    ", "    #", "   # comment", "code line", " 123 ", " 123 #",
    " 123 # hi", " 123 #comment", "\n", "  #   \n", "hello\n", "", " ", "\t",
    "#", " #", "# ", "   #   ", "text", "text # comment", "#text", "##", "#\t",
    "   ", " #\t ",
]

def benchmark(func, cases, n=100000):
    return timeit.timeit(lambda: [func(line) for line in cases], number=n)

regex_time = benchmark(is_line_junk_regex, test_cases)
no_regex_time = benchmark(is_line_junk_no_regex, test_cases)

print(f"Regex time: {regex_time:.6f} sec")
print(f"No regex time: {no_regex_time:.6f} sec")

timeit result: 1.33s -> 0.64s = x2.08 as fast

$ ./python -B difflib_bench.py
Regex time: 1.330593 sec
No regex time: 0.638559 sec

donBarbos and others added 2 commits February 16, 2025 07:20
Co-authored-by: Adam Turner <[email protected]>
Co-authored-by: Adam Turner <[email protected]>
Copy link
Member

@terryjreedy terryjreedy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One oddity of difflib is that re is imported at module level for IS_LINE_JUNK and again at class level in class _mdiff line 1378. It is used in line 1381 to define class name change_re = re.compile(r'(\++|\-+|\^+)'). To avoid importing re when importing difflib, the second import would have to be moved within a function. This is possible since change_re is only used within the _make_line method the immediately follows (1386).

I have not looked at how commonly this method is used.

@bedevere-app
Copy link

bedevere-app bot commented Feb 16, 2025

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@terryjreedy
Copy link
Member

I checked correctness of the return expression replacement, using test_cases above, with

for s in test_cases:
    (pat(s) is not None) is (s.strip() in ('', '#'))

and all are True.

@donBarbos
Copy link
Contributor Author

I checked correctness of the return expression replacement, using test_cases above, with

Ok, thank you, but this test_cases is too small a sample and therefore I have supplemented my script with the following checks:

import string
import random
from itertools import product

# First Checks:
for case in test_cases:
    assert is_line_junk_regex(case) == is_line_junk_no_regex(case), f"Inconsistent behavior for {repr(case)}"

def random_string(length=10):
    chars = string.printable + " " * 5 + "#" * 5
    return "".join(random.choice(chars) for _ in range(length))

# Second Checks:
for _ in range(10000):
    case = random_string(random.randint(0, 20))
    assert is_line_junk_regex(case) == is_line_junk_no_regex(case), f"Inconsistent behavior for {repr(case)}"

# Third Checks:
chars = " #\tabc"
for length in range(6):
    for case in map("".join, product(chars, repeat=length)):
        assert is_line_junk_regex(case) == is_line_junk_no_regex(case), f"Inconsistent behavior for {repr(case)}"

Then I checked the tests in Lib/test/test_difflib.py and everything finished well 👍

@donBarbos
Copy link
Contributor Author

I have made the requested changes; please review again

@bedevere-app
Copy link

bedevere-app bot commented Feb 16, 2025

Thanks for making the requested changes!

@terryjreedy, @AA-Turner: please review the changes made to this pull request.

@donBarbos donBarbos requested a review from picnixz February 16, 2025 14:48
Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're improving performance, we don't want to recompute line.strip()

Co-authored-by: Bénédikt Tran <[email protected]>
@donBarbos
Copy link
Contributor Author

Since we're improving performance, we don't want to recompute line.strip()

Sorry, I suspected this but hoped that there was a cache mechanism here :)

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that the performance of this function is worth changing the code. In any case, the relative performance depends on the input -- for a long string with a single space character at one end (e.g. 'x'*10000+' ') the non-regex version is more than 10 times slower than the regex version. For length 100000 the difference is more than 100, and for length 1000000 -- more than 1000.

Hmm, it seems, that that implementation has more than quadratic complexity. Of course, most data for which this function is used is short strings, but such behavior can be considered a security threat.

@terryjreedy
Copy link
Member

terryjreedy commented Feb 16, 2025

I checked correctness of the return expression replacement, using test_cases above, with

for s in test_cases:
    (pat(s) is not None) is (s.strip() in ('', '#'))

and all are True.

I am going to be neutral an whether the basic change should be merged. If it is, I personally think the 'pat' parameter can be considered to be private and eliminated. But I would defer the back-compatibility to others, include 'encukou'. The only documented use of the function, marked as a constant, is to be passed as an argument. The argument is always passed on to SequenceMatcher, which calls it once with 1 argument, which would be a line.

@donBarbos
Copy link
Contributor Author

donBarbos commented Feb 16, 2025

@serhiy-storchaka

In any case, the relative performance depends on the input -- for a long string with a single space character at one end (e.g. 'x'*10000+' ') the non-regex version is more than 10 times slower than the regex version. For length 100000 the difference is more than 100, and for length 1000000 -- more than 1000.

I tested your cases but didn't get the result you expected (maybe I did something wrong?)
I ran next script (accidentally mixed up the dividend and divisor, but it doesn't matter):

import re
import timeit

def is_line_junk_regex(line, pat=re.compile(r"\s*(?:#\s*)?$").match):
    return pat(line) is not None

def is_line_junk_no_regex(line):
    return line.strip() == "" or line.lstrip().rstrip() == "#"

def benchmark(func, cases, n=1_000):
    return timeit.timeit(lambda: [func(line) for line in cases], number=n)

lengths = [10_000, 100_000, 1_000_000]
for length in lengths:
    test_string = "x" * length + " "
    regex_time = benchmark(is_line_junk_regex, test_string)
    no_regex_time = benchmark(is_line_junk_no_regex, test_string)

    print(f"String length: {length}")
    print(f"Regex version: {regex_time:.6f} seconds")
    print(f"Non-regex version: {no_regex_time:.6f} seconds")
    print(f"Speedup: {no_regex_time / regex_time:.2f}x\n")

and got next results:

$ ./python -B bench.py
String length: 10000
Regex version: 4.220341 seconds
Non-regex version: 1.983180 seconds
Speedup: 0.47x

String length: 100000
Regex version: 42.698992 seconds
Non-regex version: 20.897327 seconds
Speedup: 0.49x

String length: 1000000
Regex version: 516.844790 seconds
Non-regex version: 262.632697 seconds
Speedup: 0.51x

@picnixz
Copy link
Member

picnixz commented Feb 16, 2025

Aren't you iterating over a single character instead of a full line:

[func(line) for line in cases]

You should pass [test_string] instead of just test_string (sorry if I'm wrong, I'm tired)

@AA-Turner
Copy link
Member

I was the one who originally reqested restoring pat, but I do find Terry's argument compelling, and I wouldn't oppose removing pat.

I think we should restore this PR to the simpler previous version (the differences micro-benchmarks between string comparisons are so small they will be swallowed by noise). The class-level re import should be resolved before merge, though.

A

Co-authored-by: Hugo van Kemenade <[email protected]>
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong, the complexity is not quadratic, but linear.

Anyway, the relative performance depends on input. For some input the regexp version is faster, for other it is slower. Claiming that this improves speed for any input is bold.

@@ -0,0 +1,2 @@
Improve speed of :func:`difflib.IS_LINE_JUNK` by replacing :mod:`re` with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using re or built-in string methods is an implementation detail. This does not concern users and is not of interest to them.

Copy link
Contributor Author

@donBarbos donBarbos Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serhiy-storchaka ok so should I remove NEWS.d entry or replace with something else. I just took as a sample news entries from PRs related to issue on improving importtime

@rhettinger rhettinger requested a review from tim-one February 18, 2025 18:16
@rhettinger
Copy link
Contributor

@tim-one Do you have a reason to want to keep the regex?

@tim-one
Copy link
Member

tim-one commented Feb 18, 2025

Can't say I much care, but I don't get it. What's the point? It's dubious on the face of it to build an entirely new string object just to do simple lexical analysis. If this is some "micro-optimizztion" thing, instead of

        stripped = line.strip()
        return stripped == '' or stripped == '#'

try

        return line.strip() in '#' # empty or single '#'

?

@donBarbos
Copy link
Contributor Author

donBarbos commented Feb 18, 2025

Can't say I much care, but I don't get it. What's the point? It's dubious on the face of it to build an entirely new string object just to do simple lexical analysis. If this is some "micro-optimizztion" thing, instead of

        stripped = line.strip()

        return stripped == '' or stripped == '#'

try

        return line.strip() in '#' # empty or single '#'

?

that's how it was originally :)

I'll probably return it so as not to overcomplicate things

Co-authored-by: Tim Peters <[email protected]>
@donBarbos
Copy link
Contributor Author

@rhettinger

@tim-one Do you have a reason to want to keep the regex?

Tim doesn't seem to mind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants