-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up the generation of no-member suggestions #10277
base: main
Are you sure you want to change the base?
Conversation
Previously, edit distances were calculated for strings that had length differences greater than the maximum suggestion threshold.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #10277 +/- ##
=======================================
Coverage 95.86% 95.87%
=======================================
Files 175 175
Lines 19072 19081 +9
=======================================
+ Hits 18284 18293 +9
Misses 788 788
🚀 New features to boost your workflow:
|
This comment has been minimized.
This comment has been minimized.
🤖 According to the primer, this change has no effect on the checked open source code. 🤖🎉 This comment was generated for commit c73bc96 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, a 5+% improvement ! That's nice.
It seems like a well known problems that we should not have to optimize ourselves though.
Would you mind comparing this with using difflib.SequenceMatcher
? I suggested something but it might not directly work. And I don't like adding dependencies but maybe using https://pypi.org/project/edlib/ would be worth it (it seems it's faster than difflib)..
@@ -182,11 +186,20 @@ def _similar_names( | |||
possible_names: list[tuple[str, int]] = [] | |||
names = _node_names(owner) | |||
|
|||
attr_str = attrname or "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attr_str = attrname or "" | |
return difflib.get_close_matches(attrname, possible_names, max_choices, distance_threshold) |
One drawback is that it may be hard to map existing import difflib
difflib.get_close_matches("buff", ["buffer"], n=3, cutoff=0.75)
-> ['buffer']
difflib.get_close_matches("buff", ["buffer"], n=3, cutoff=0.80)
-> ['buffer']
difflib.get_close_matches("buff", ["buffer"], n=3, cutoff=0.80000001)
-> [] Here's a separate case where I would expect a match but don't see one: import difflib
difflib.get_close_matches("x", ["z"], n=3, cutoff=0.00001)
-> [] Python's own attribute suggestion code does not seem to use difflib: Here's the patch I used:diff --git a/pylint/checkers/typecheck.py b/pylint/checkers/typecheck.py
index edef5188b..ab280e38b 100644
--- a/pylint/checkers/typecheck.py
+++ b/pylint/checkers/typecheck.py
@@ -6,9 +6,8 @@
from __future__ import annotations
-import heapq
+import difflib
import itertools
-import operator
import re
import shlex
import sys
@@ -170,7 +169,7 @@ def _string_distance(seq1: str, seq2: str) -> int:
def _similar_names(
owner: SuccessfulInferenceResult,
- attrname: str | None,
+ attrname: str,
distance_threshold: int,
max_choices: int,
) -> list[str]:
@@ -179,31 +178,24 @@ def _similar_names(
The similar names are searched given a distance metric and only
a given number of choices will be returned.
"""
- possible_names: list[tuple[str, int]] = []
- names = _node_names(owner)
+ if not attrname:
+ return []
- for name in names:
- if name == attrname:
- continue
+ names = _node_names(owner)
+ possible_names = [name for name in names if name != attrname]
- distance = _string_distance(attrname or "", name)
- if distance <= distance_threshold:
- possible_names.append((name, distance))
+ ratio = distance_threshold / len(attrname)
+ if distance_threshold == 1 and ratio == 1:
+ ratio = 0.5
- # Now get back the values with a minimum, up to the given
- # limit or choices.
- picked = [
- name
- for (name, _) in heapq.nsmallest(
- max_choices, possible_names, key=operator.itemgetter(1)
- )
- ]
- return sorted(picked)
+ threshold = 1 - ratio - 0.0001
+ choices = difflib.get_close_matches(attrname, possible_names, max_choices, threshold)
+ return sorted(choices)
def _missing_member_hint(
owner: SuccessfulInferenceResult,
- attrname: str | None,
+ attrname: str,
distance_threshold: int,
max_choices: int,
) -> str:
@@ -1209,7 +1201,7 @@ accessed. Python regular expressions are accepted.",
if self.linter.config.missing_member_hint:
hint = _missing_member_hint(
owner,
- node.attrname,
+ node.attrname or "",
self.linter.config.missing_member_hint_distance,
self.linter.config.missing_member_max_choices,
) |
Neat. It seems traceback use levenshtein distance while difflib use an algo that "does not yield minimal edit sequences, but does tend to yield matches that “look right” to people." (can't link the part of the doc specifically; https://docs.python.org/3/library/difflib.html) Which would make https://pypi.org/project/edlib/ a better candidate as it's an optimized levenshtein distance ? Or we can copy paste the traceback code and not deal with a dependency if there isn't a big difference in performance ? |
Type of Changes
Description
Previously, edit distances were calculated for strings that had length differences greater than the maximum suggestion threshold.
Related PR:
no-member
/c-extension-no-member
#9962Stats
I profiled yt-dlp (commit bd0a6681) with the following
.pylintrc
file:Before
pylint --recursive=y .
After
pylint --recursive=y .