-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuzzy matching for identifiers #14370
base: master
Are you sure you want to change the base?
Conversation
Since this is all about generating better error diagnostics, my feeling is that we do want to do this.
When we are already about to abort the program with an error, performance may not be a big deal. We're already about to complete a lot faster than the user was anticipating, after all. :) However I do think that we can probably make it faster in the event that no error message is generated. What do you think about importing difflib only once an error branch is taken?
|
def _get_meson_modules(module_path: Path) -> T.List[str]: | ||
return [mod.name for mod in pkgutil.iter_modules(module_path) if not mod.name.startswith('_')] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is only ever used in exactly one place, once. I would inline it (and import pkgutil at the inline call site in the error branch).
Thanks for the feedback. I'll implement your suggestions asap. In the interim I realized that I missed a case: a misspelled argument to Regarding perf, I was worried as the algorithm difflib uses isn't great. All of the examples that I could find of others doing this kind of thing(clang, gcc, git, cpython) use levenshtein or some variation thereof, but there isn't an implementation of that in Python's stdlib that I am aware of. I'm happy enough to write one for meson, but only if the maintainers think it's necessary. Incidentally, is there a project that meson devs use as a benchmark? The largest I can think of is maybe qemu or mesa. While reading #14383, I had a thought that once the command line syntax without explicit |
I'd prefer to just use the stdlib for this. Although I wonder if given that CPython already has an impl, maybe they would consider exposing it and/or speeding up difflib with it?
Those are good examples of large projects, so is GStreamer (as a single mega-project, which is how it's available in Git). |
In the case of an unknown keyword argument passed to a function or method, attempt to find a closest match to present to the user in the error message.
If the user misspells a meson function then suggest the closest match in the error message.
If the user misspells an object method name then suggest the closest match in the error message.
As in e0f892c, if the user misspells a module method name then suggest the closest match in the error message.
If the user misspells a variable name then suggest the closest match. Try to catch as many ways a user could reference a variable: 1. As a regular variable 2. As an argument to get_variable/unset_variable 3. As an fstring identifier 4. As a variable in a subproject
If the user misspells a module name then suggest the closest match, using pkgutil to enumerate the available modules
If the user misspells a subproject name then suggest the closest match.
I frequently make mistakes like this:
note the misspelling of
licence
, or even:note the use of
srcs
instead ofsources
. I imagine that it's not just me that makes these kinds of silly mistakes. Currently in such cases, meson will raise an error that the misspelled keyword, or variable, or method, etc. does not exist.Other tools with a similar remit (of parsing structured text with known identifiers), such as gcc, have adopted a fuzzy matching approach in order to suggest potential corrections inline with the error. So instead of
we get
This PR implements this for meson. It uses
difflib.get_close_matches
whose underlying algorithm is Ratcliff/Obershelp as its matching engine. Currently the following types of identifier lookups support this fuzzy matching:get_variable
/unset_variable
, fstring identifiers, subproject variables)So, a couple of questions: first and foremost, is this a good idea? Personally I think it is, but as I have already noted, I frequently make mistakes ;) So if the meson maintainers and community don't agree I understand.
Secondly, if it's a good idea, then with regards to the current implementation there are some debatable things:
difflib
is probably not the most performant thing to use here. I used it as a starting point to get a proof of concept up. Profiling needs to be done.pkgutils
which I am not sure is a good idea.Anyway, that's enough for now. Let me know what you think.