Fuzzy matching for identifiers #14370

amcn · 2025-03-14T23:26:17Z

I frequently make mistakes like this:

project('myproject', licence: 'GPL')

note the misspelling of licence, or even:

myexe = executable('myexe',
  srcs: myexe_srcs
)

note the use of srcs instead of sources. I imagine that it's not just me that makes these kinds of silly mistakes. Currently in such cases, meson will raise an error that the misspelled keyword, or variable, or method, etc. does not exist.

Other tools with a similar remit (of parsing structured text with known identifiers), such as gcc, have adopted a fuzzy matching approach in order to suggest potential corrections inline with the error. So instead of

meson.build:1:0: ERROR: project got unknown keyword arguments "licence"

we get

meson.build:1:0: ERROR: project got unknown keyword arguments "licence" (did you mean "license"?)

This PR implements this for meson. It uses difflib.get_close_matches whose underlying algorithm is Ratcliff/Obershelp as its matching engine. Currently the following types of identifier lookups support this fuzzy matching:

Keyword arguments
Variables (regular variables, arguments to get_variable/unset_variable, fstring identifiers, subproject variables)
Function and method names
Module names and module methods

So, a couple of questions: first and foremost, is this a good idea? Personally I think it is, but as I have already noted, I frequently make mistakes ;) So if the meson maintainers and community don't agree I understand.

Secondly, if it's a good idea, then with regards to the current implementation there are some debatable things:

difflib is probably not the most performant thing to use here. I used it as a starting point to get a proof of concept up. Profiling needs to be done.
There are almost certainly cases of identifier lookups that I have missed.
The module lookup logic uses pkgutils which I am not sure is a good idea.
Commit history and tests could probably be better.

Anyway, that's enough for now. Let me know what you think.

eli-schwartz · 2025-03-24T07:08:43Z

So, a couple of questions: first and foremost, is this a good idea? Personally I think it is, but as I have already noted, I frequently make mistakes ;) So if the meson maintainers and community don't agree I understand.

Since this is all about generating better error diagnostics, my feeling is that we do want to do this.

difflib is probably not the most performant thing to use here. I used it as a starting point to get a proof of concept up. Profiling needs to be done.

When we are already about to abort the program with an error, performance may not be a big deal. We're already about to complete a lot faster than the user was anticipating, after all. :)

However I do think that we can probably make it faster in the event that no error message is generated. What do you think about importing difflib only once an error branch is taken?

usually we don't need to import difflib as we won't use it
although importing at the time of use has a performance penalty in that each time you re-import the same module you perform duplicative work, if you import immediately before raising an error then you already know you won't ever re-import it, which is a bit of a sneaky performance trick

mesonbuild/interpreter/interpreter.py

amcn · 2025-03-24T12:27:31Z

Thanks for the feedback. I'll implement your suggestions asap. In the interim I realized that I missed a case: a misspelled argument to subproject. I'll add that too.

Regarding perf, I was worried as the algorithm difflib uses isn't great. All of the examples that I could find of others doing this kind of thing(clang, gcc, git, cpython) use levenshtein or some variation thereof, but there isn't an implementation of that in Python's stdlib that I am aware of. I'm happy enough to write one for meson, but only if the maintainers think it's necessary. Incidentally, is there a project that meson devs use as a benchmark? The largest I can think of is maybe qemu or mesa.

While reading #14383, I had a thought that once the command line syntax without explicit setup is fully retired, this approach can be applied to the command line arguments as well.

eli-schwartz · 2025-03-24T13:20:00Z

but there isn't an implementation of that in Python's stdlib that I am aware of. I'm happy enough to write one for meson, but only if the maintainers think it's necessary.

I'd prefer to just use the stdlib for this. Although I wonder if given that CPython already has an impl, maybe they would consider exposing it and/or speeding up difflib with it?

Incidentally, is there a project that meson devs use as a benchmark? The largest I can think of is maybe qemu or mesa.

Those are good examples of large projects, so is GStreamer (as a single mega-project, which is how it's available in Git).

In the case of an unknown keyword argument passed to a function or method, attempt to find a closest match to present to the user in the error message.

If the user misspells a meson function then suggest the closest match in the error message.

If the user misspells an object method name then suggest the closest match in the error message.

As in e0f892c, if the user misspells a module method name then suggest the closest match in the error message.

If the user misspells a variable name then suggest the closest match. Try to catch as many ways a user could reference a variable: 1. As a regular variable 2. As an argument to get_variable/unset_variable 3. As an fstring identifier 4. As a variable in a subproject

If the user misspells a module name then suggest the closest match, using pkgutil to enumerate the available modules

If the user misspells a subproject name then suggest the closest match.

amcn · 2025-03-26T12:57:29Z

I've updated the PR to take into account feedback from @eli-schwartz: all imports necessary for matching are localized to their error paths. This does lead to some extra code duplication.

I've also tried manually inserting errors in the build definitions of mesa and gstreamer. Performance of this matching is in the noise. I'll add the caveat that it certainly helps that both of these projects use subprojects extensively since the input set of possible matches in the case of variables is limited to those local to a given project.

I'd prefer to just use the stdlib for this. Although I wonder if given that CPython already has an impl, maybe they would consider exposing it and/or speeding up difflib with it?

Yeah, it might be worth exploring if CPython's implementation could be exposed for this, either via difflib or otherwise. In any case, I believe it's a recent addition to CPython and meson supports older interpreters.

I think these changes are probably small enough that they don't need a release note but I'll add one if you think it's necessary.

Cheers.

amcn requested a review from jpakkane as a code owner March 14, 2025 23:26

amcn force-pushed the amcn/did-you-mean branch from 9fed9cc to c835037 Compare March 15, 2025 11:54

eli-schwartz reviewed Mar 24, 2025

View reviewed changes

mesonbuild/interpreter/interpreter.py Outdated Show resolved Hide resolved

amcn added 5 commits March 25, 2025 23:16

Misspellings: add closest match for function and method kwargs

5850b51

In the case of an unknown keyword argument passed to a function or method, attempt to find a closest match to present to the user in the error message.

Misspellings: add closest match for function names

0cf42f4

If the user misspells a meson function then suggest the closest match in the error message.

Misspellings: add closest match for object method names

e630ef6

If the user misspells an object method name then suggest the closest match in the error message.

Misspellings: add closest match for module method names

6aa2cf4

As in e0f892c, if the user misspells a module method name then suggest the closest match in the error message.

amcn force-pushed the amcn/did-you-mean branch from c835037 to 94e031c Compare March 25, 2025 22:47

amcn added 2 commits March 25, 2025 23:51

Misspellings: add closest match for module names

1681c09

If the user misspells a module name then suggest the closest match, using pkgutil to enumerate the available modules

Misspellings: add closest match for subproject names

c904c00

If the user misspells a subproject name then suggest the closest match.

amcn force-pushed the amcn/did-you-mean branch from 94e031c to c904c00 Compare March 25, 2025 22:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fuzzy matching for identifiers #14370

Fuzzy matching for identifiers #14370

Uh oh!

amcn commented Mar 14, 2025 •

edited

Loading

Uh oh!

eli-schwartz commented Mar 24, 2025

Uh oh!

Uh oh!

amcn commented Mar 24, 2025

Uh oh!

eli-schwartz commented Mar 24, 2025 •

edited

Loading

Uh oh!

amcn commented Mar 26, 2025

Uh oh!

Uh oh!

Uh oh!

Fuzzy matching for identifiers #14370

Are you sure you want to change the base?

Fuzzy matching for identifiers #14370

Uh oh!

Conversation

amcn commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eli-schwartz commented Mar 24, 2025

Uh oh!

Uh oh!

amcn commented Mar 24, 2025

Uh oh!

eli-schwartz commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amcn commented Mar 26, 2025

Uh oh!

Uh oh!

amcn commented Mar 14, 2025 •

edited

Loading

eli-schwartz commented Mar 24, 2025 •

edited

Loading