Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the generation of no-member suggestions #10277

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

correctmost
Copy link
Contributor

Type of Changes

Type
🔨 Refactoring

Description

Previously, edit distances were calculated for strings that had length differences greater than the maximum suggestion threshold.

Related PR:

Stats

I profiled yt-dlp (commit bd0a6681) with the following .pylintrc file:

[MAIN]
jobs=1

[MESSAGES CONTROL]
disable=all
enable=no-member

[REPORTS]
reports=no
score=no

Before

Command Mean [s] Min [s] Max [s] Relative
pylint --recursive=y . 39.161 ± 0.175 38.929 39.548 1.00
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   103292    3.706    0.000    5.653    0.000 pylint/checkers/typecheck.py:153(_string_distance)
 19623350    1.938    0.000    1.938    0.000 {built-in method builtins.min}
      245    0.039    0.000    5.747    0.023 pylint/checkers/typecheck.py:171(_similar_names)

After

Command Mean [s] Min [s] Max [s] Relative
pylint --recursive=y . 37.009 ± 0.249 36.650 37.333 1.00
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    15159    0.460    0.000    0.704    0.000 pylint/checkers/typecheck.py:153(_string_distance)
  2429889    0.244    0.000    0.244    0.000 {built-in method builtins.min}
      245    0.035    0.000    0.807    0.003 pylint/checkers/typecheck.py:175(_similar_names)

Previously, edit distances were calculated for strings that had
length differences greater than the maximum suggestion threshold.
Copy link

codecov bot commented Mar 15, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.87%. Comparing base (512c8be) to head (c73bc96).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main   #10277   +/-   ##
=======================================
  Coverage   95.86%   95.87%           
=======================================
  Files         175      175           
  Lines       19072    19081    +9     
=======================================
+ Hits        18284    18293    +9     
  Misses        788      788           
Files with missing lines Coverage Δ
pylint/checkers/typecheck.py 96.67% <100.00%> (+0.03%) ⬆️
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This comment has been minimized.

Copy link
Contributor

🤖 According to the primer, this change has no effect on the checked open source code. 🤖🎉

This comment was generated for commit c73bc96

Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, a 5+% improvement ! That's nice.

It seems like a well known problems that we should not have to optimize ourselves though.

Would you mind comparing this with using difflib.SequenceMatcher ? I suggested something but it might not directly work. And I don't like adding dependencies but maybe using https://pypi.org/project/edlib/ would be worth it (it seems it's faster than difflib)..

@@ -182,11 +186,20 @@ def _similar_names(
possible_names: list[tuple[str, int]] = []
names = _node_names(owner)

attr_str = attrname or ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
attr_str = attrname or ""
return difflib.get_close_matches(attrname, possible_names, max_choices, distance_threshold)

@correctmost
Copy link
Contributor Author

Would you mind comparing this with using difflib.SequenceMatcher ?

difflib.get_close_matches is about 200ms faster than the PR code.

One drawback is that it may be hard to map existing missing-member-hint-distance values to cutoff ratios because of the different distance algorithms and because of floating point math. For example, I would expect "buff" not to match "buffer" in the first two cases:

import difflib

difflib.get_close_matches("buff", ["buffer"], n=3, cutoff=0.75)
-> ['buffer']

difflib.get_close_matches("buff", ["buffer"], n=3, cutoff=0.80)
-> ['buffer']

difflib.get_close_matches("buff", ["buffer"], n=3, cutoff=0.80000001)
-> []

Here's a separate case where I would expect a match but don't see one:

import difflib

difflib.get_close_matches("x", ["z"], n=3, cutoff=0.00001)
-> []

Python's own attribute suggestion code does not seem to use difflib:
https://github.com/python/cpython/blob/a931a8b32415f311008dbb3f09079aae1e6d7a3d/Lib/traceback.py#L1538-L1545


Here's the patch I used:
diff --git a/pylint/checkers/typecheck.py b/pylint/checkers/typecheck.py
index edef5188b..ab280e38b 100644
--- a/pylint/checkers/typecheck.py
+++ b/pylint/checkers/typecheck.py
@@ -6,9 +6,8 @@
 
 from __future__ import annotations
 
-import heapq
+import difflib
 import itertools
-import operator
 import re
 import shlex
 import sys
@@ -170,7 +169,7 @@ def _string_distance(seq1: str, seq2: str) -> int:
 
 def _similar_names(
     owner: SuccessfulInferenceResult,
-    attrname: str | None,
+    attrname: str,
     distance_threshold: int,
     max_choices: int,
 ) -> list[str]:
@@ -179,31 +178,24 @@ def _similar_names(
     The similar names are searched given a distance metric and only
     a given number of choices will be returned.
     """
-    possible_names: list[tuple[str, int]] = []
-    names = _node_names(owner)
+    if not attrname:
+        return []
 
-    for name in names:
-        if name == attrname:
-            continue
+    names = _node_names(owner)
+    possible_names = [name for name in names if name != attrname]
 
-        distance = _string_distance(attrname or "", name)
-        if distance <= distance_threshold:
-            possible_names.append((name, distance))
+    ratio = distance_threshold / len(attrname)
+    if distance_threshold == 1 and ratio == 1:
+        ratio = 0.5
 
-    # Now get back the values with a minimum, up to the given
-    # limit or choices.
-    picked = [
-        name
-        for (name, _) in heapq.nsmallest(
-            max_choices, possible_names, key=operator.itemgetter(1)
-        )
-    ]
-    return sorted(picked)
+    threshold = 1 - ratio - 0.0001
+    choices = difflib.get_close_matches(attrname, possible_names, max_choices, threshold)
+    return sorted(choices)
 
 
 def _missing_member_hint(
     owner: SuccessfulInferenceResult,
-    attrname: str | None,
+    attrname: str,
     distance_threshold: int,
     max_choices: int,
 ) -> str:
@@ -1209,7 +1201,7 @@ accessed. Python regular expressions are accepted.",
             if self.linter.config.missing_member_hint:
                 hint = _missing_member_hint(
                     owner,
-                    node.attrname,
+                    node.attrname or "",
                     self.linter.config.missing_member_hint_distance,
                     self.linter.config.missing_member_max_choices,
                 )

@Pierre-Sassoulas
Copy link
Member

Neat.

It seems traceback use levenshtein distance while difflib use an algo that "does not yield minimal edit sequences, but does tend to yield matches that “look right” to people." (can't link the part of the doc specifically; https://docs.python.org/3/library/difflib.html)

Which would make https://pypi.org/project/edlib/ a better candidate as it's an optimized levenshtein distance ? Or we can copy paste the traceback code and not deal with a dependency if there isn't a big difference in performance ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants