Skip to content

Fuzzy matching for identifiers #14370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

amcn
Copy link
Contributor

@amcn amcn commented Mar 14, 2025

I frequently make mistakes like this:

project('myproject', licence: 'GPL')

note the misspelling of licence, or even:

myexe = executable('myexe',
  srcs: myexe_srcs
)

note the use of srcs instead of sources. I imagine that it's not just me that makes these kinds of silly mistakes. Currently in such cases, meson will raise an error that the misspelled keyword, or variable, or method, etc. does not exist.

Other tools with a similar remit (of parsing structured text with known identifiers), such as gcc, have adopted a fuzzy matching approach in order to suggest potential corrections inline with the error. So instead of

meson.build:1:0: ERROR: project got unknown keyword arguments "licence"

we get

meson.build:1:0: ERROR: project got unknown keyword arguments "licence" (did you mean "license"?)

This PR implements this for meson. It uses difflib.get_close_matches whose underlying algorithm is Ratcliff/Obershelp as its matching engine. Currently the following types of identifier lookups support this fuzzy matching:

  1. Keyword arguments
  2. Variables (regular variables, arguments to get_variable/unset_variable, fstring identifiers, subproject variables)
  3. Function and method names
  4. Module names and module methods

So, a couple of questions: first and foremost, is this a good idea? Personally I think it is, but as I have already noted, I frequently make mistakes ;) So if the meson maintainers and community don't agree I understand.

Secondly, if it's a good idea, then with regards to the current implementation there are some debatable things:

  1. difflib is probably not the most performant thing to use here. I used it as a starting point to get a proof of concept up. Profiling needs to be done.
  2. There are almost certainly cases of identifier lookups that I have missed.
  3. The module lookup logic uses pkgutils which I am not sure is a good idea.
  4. Commit history and tests could probably be better.

Anyway, that's enough for now. Let me know what you think.

@amcn amcn requested a review from jpakkane as a code owner March 14, 2025 23:26
@amcn amcn force-pushed the amcn/did-you-mean branch from 9fed9cc to c835037 Compare March 15, 2025 11:54
@eli-schwartz
Copy link
Member

So, a couple of questions: first and foremost, is this a good idea? Personally I think it is, but as I have already noted, I frequently make mistakes ;) So if the meson maintainers and community don't agree I understand.

Since this is all about generating better error diagnostics, my feeling is that we do want to do this.

difflib is probably not the most performant thing to use here. I used it as a starting point to get a proof of concept up. Profiling needs to be done.

When we are already about to abort the program with an error, performance may not be a big deal. We're already about to complete a lot faster than the user was anticipating, after all. :)

However I do think that we can probably make it faster in the event that no error message is generated. What do you think about importing difflib only once an error branch is taken?

  • usually we don't need to import difflib as we won't use it
  • although importing at the time of use has a performance penalty in that each time you re-import the same module you perform duplicative work, if you import immediately before raising an error then you already know you won't ever re-import it, which is a bit of a sneaky performance trick

@amcn
Copy link
Contributor Author

amcn commented Mar 24, 2025

Thanks for the feedback. I'll implement your suggestions asap. In the interim I realized that I missed a case: a misspelled argument to subproject. I'll add that too.

Regarding perf, I was worried as the algorithm difflib uses isn't great. All of the examples that I could find of others doing this kind of thing(clang, gcc, git, cpython) use levenshtein or some variation thereof, but there isn't an implementation of that in Python's stdlib that I am aware of. I'm happy enough to write one for meson, but only if the maintainers think it's necessary. Incidentally, is there a project that meson devs use as a benchmark? The largest I can think of is maybe qemu or mesa.

While reading #14383, I had a thought that once the command line syntax without explicit setup is fully retired, this approach can be applied to the command line arguments as well.

@eli-schwartz
Copy link
Member

eli-schwartz commented Mar 24, 2025

but there isn't an implementation of that in Python's stdlib that I am aware of. I'm happy enough to write one for meson, but only if the maintainers think it's necessary.

I'd prefer to just use the stdlib for this. Although I wonder if given that CPython already has an impl, maybe they would consider exposing it and/or speeding up difflib with it?

Incidentally, is there a project that meson devs use as a benchmark? The largest I can think of is maybe qemu or mesa.

Those are good examples of large projects, so is GStreamer (as a single mega-project, which is how it's available in Git).

amcn added 5 commits March 25, 2025 23:16
In the case of an unknown keyword argument passed to a function or
method, attempt to find a closest match to present to the user in the
error message.
If the user misspells a meson function then suggest the closest
match in the error message.
If the user misspells an object method name then suggest the closest
match in the error message.
As in e0f892c, if the user misspells a module method name then suggest
the closest match in the error message.
If the user misspells a variable name then suggest the
closest match. Try to catch as many ways a user could reference a
variable:

1. As a regular variable
2. As an argument to get_variable/unset_variable
3. As an fstring identifier
4. As a variable in a subproject
@amcn amcn force-pushed the amcn/did-you-mean branch from c835037 to 94e031c Compare March 25, 2025 22:47
amcn added 2 commits March 25, 2025 23:51
If the user misspells a module name then suggest the closest
match, using pkgutil to enumerate the available modules
If the user misspells a subproject name then suggest the closest
match.
@amcn amcn force-pushed the amcn/did-you-mean branch from 94e031c to c904c00 Compare March 25, 2025 22:52
@amcn
Copy link
Contributor Author

amcn commented Mar 26, 2025

I've updated the PR to take into account feedback from @eli-schwartz: all imports necessary for matching are localized to their error paths. This does lead to some extra code duplication.

I've also tried manually inserting errors in the build definitions of mesa and gstreamer. Performance of this matching is in the noise. I'll add the caveat that it certainly helps that both of these projects use subprojects extensively since the input set of possible matches in the case of variables is limited to those local to a given project.

I'd prefer to just use the stdlib for this. Although I wonder if given that CPython already has an impl, maybe they would consider exposing it and/or speeding up difflib with it?

Yeah, it might be worth exploring if CPython's implementation could be exposed for this, either via difflib or otherwise. In any case, I believe it's a recent addition to CPython and meson supports older interpreters.

I think these changes are probably small enough that they don't need a release note but I'll add one if you think it's necessary.

Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants