Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy matching for identifiers #14370

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

amcn
Copy link
Contributor

@amcn amcn commented Mar 14, 2025

I frequently make mistakes like this:

project('myproject', licence: 'GPL')

note the misspelling of licence, or even:

myexe = executable('myexe',
  srcs: myexe_srcs
)

note the use of srcs instead of sources. I imagine that it's not just me that makes these kinds of silly mistakes. Currently in such cases, meson will raise an error that the misspelled keyword, or variable, or method, etc. does not exist.

Other tools with a similar remit (of parsing structured text with known identifiers), such as gcc, have adopted a fuzzy matching approach in order to suggest potential corrections inline with the error. So instead of

meson.build:1:0: ERROR: project got unknown keyword arguments "licence"

we get

meson.build:1:0: ERROR: project got unknown keyword arguments "licence" (did you mean "license"?)

This PR implements this for meson. It uses difflib.get_close_matches whose underlying algorithm is Ratcliff/Obershelp as its matching engine. Currently the following types of identifier lookups support this fuzzy matching:

  1. Keyword arguments
  2. Variables (regular variables, arguments to get_variable/unset_variable, fstring identifiers, subproject variables)
  3. Function and method names
  4. Module names and module methods

So, a couple of questions: first and foremost, is this a good idea? Personally I think it is, but as I have already noted, I frequently make mistakes ;) So if the meson maintainers and community don't agree I understand.

Secondly, if it's a good idea, then with regards to the current implementation there are some debatable things:

  1. difflib is probably not the most performant thing to use here. I used it as a starting point to get a proof of concept up. Profiling needs to be done.
  2. There are almost certainly cases of identifier lookups that I have missed.
  3. The module lookup logic uses pkgutils which I am not sure is a good idea.
  4. Commit history and tests could probably be better.

Anyway, that's enough for now. Let me know what you think.

@amcn amcn requested a review from jpakkane as a code owner March 14, 2025 23:26
@amcn amcn force-pushed the amcn/did-you-mean branch from 9fed9cc to c835037 Compare March 15, 2025 11:54
@eli-schwartz
Copy link
Member

So, a couple of questions: first and foremost, is this a good idea? Personally I think it is, but as I have already noted, I frequently make mistakes ;) So if the meson maintainers and community don't agree I understand.

Since this is all about generating better error diagnostics, my feeling is that we do want to do this.

difflib is probably not the most performant thing to use here. I used it as a starting point to get a proof of concept up. Profiling needs to be done.

When we are already about to abort the program with an error, performance may not be a big deal. We're already about to complete a lot faster than the user was anticipating, after all. :)

However I do think that we can probably make it faster in the event that no error message is generated. What do you think about importing difflib only once an error branch is taken?

  • usually we don't need to import difflib as we won't use it
  • although importing at the time of use has a performance penalty in that each time you re-import the same module you perform duplicative work, if you import immediately before raising an error then you already know you won't ever re-import it, which is a bit of a sneaky performance trick

Comment on lines 145 to 147
def _get_meson_modules(module_path: Path) -> T.List[str]:
return [mod.name for mod in pkgutil.iter_modules(module_path) if not mod.name.startswith('_')]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only ever used in exactly one place, once. I would inline it (and import pkgutil at the inline call site in the error branch).

@amcn
Copy link
Contributor Author

amcn commented Mar 24, 2025

Thanks for the feedback. I'll implement your suggestions asap. In the interim I realized that I missed a case: a misspelled argument to subproject. I'll add that too.

Regarding perf, I was worried as the algorithm difflib uses isn't great. All of the examples that I could find of others doing this kind of thing(clang, gcc, git, cpython) use levenshtein or some variation thereof, but there isn't an implementation of that in Python's stdlib that I am aware of. I'm happy enough to write one for meson, but only if the maintainers think it's necessary. Incidentally, is there a project that meson devs use as a benchmark? The largest I can think of is maybe qemu or mesa.

While reading #14383, I had a thought that once the command line syntax without explicit setup is fully retired, this approach can be applied to the command line arguments as well.

@eli-schwartz
Copy link
Member

eli-schwartz commented Mar 24, 2025

but there isn't an implementation of that in Python's stdlib that I am aware of. I'm happy enough to write one for meson, but only if the maintainers think it's necessary.

I'd prefer to just use the stdlib for this. Although I wonder if given that CPython already has an impl, maybe they would consider exposing it and/or speeding up difflib with it?

Incidentally, is there a project that meson devs use as a benchmark? The largest I can think of is maybe qemu or mesa.

Those are good examples of large projects, so is GStreamer (as a single mega-project, which is how it's available in Git).

amcn added 5 commits March 25, 2025 23:16
In the case of an unknown keyword argument passed to a function or
method, attempt to find a closest match to present to the user in the
error message.
If the user misspells a meson function then suggest the closest
match in the error message.
If the user misspells an object method name then suggest the closest
match in the error message.
As in e0f892c, if the user misspells a module method name then suggest
the closest match in the error message.
If the user misspells a variable name then suggest the
closest match. Try to catch as many ways a user could reference a
variable:

1. As a regular variable
2. As an argument to get_variable/unset_variable
3. As an fstring identifier
4. As a variable in a subproject
@amcn amcn force-pushed the amcn/did-you-mean branch from c835037 to 94e031c Compare March 25, 2025 22:47
amcn added 2 commits March 25, 2025 23:51
If the user misspells a module name then suggest the closest
match, using pkgutil to enumerate the available modules
If the user misspells a subproject name then suggest the closest
match.
@amcn amcn force-pushed the amcn/did-you-mean branch from 94e031c to c904c00 Compare March 25, 2025 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants