enhance search API #3658

matthias314 · 2025-02-08T03:09:57Z

This PR is based on #3575 ~~and therefore a draft at present~~. It has the following components:

additional methods for searching text, for example methods that also return matched capturing groups,
a new type RegexpGroup that combines a regexp with its padded versions as used in match beginning and end of line correctly #3575,
process Deltas in ExecuteTextEvent in reverse order. This makes replaceall easier to implement,
new functions LocVoid() and Loc.IsVoid() to deal with unused submatches.

The new types and functions are as follows (UPDATED):

// NewRegexpGroup creates a RegexpGroup from a string
func NewRegexpGroup(s string) (RegexpGroup, error)

// FindDown returns a slice containing the start and end positions
// of the first match of `rgrp` between `start` and `end`, or nil
// if no match exists.
func (b *Buffer) FindDown(rgrp RegexpGroup, start, end Loc) []Loc

// FindDownSubmatch returns a slice containing the start and end positions
// of the first match of `rgrp` between `start` and `end` plus those
// of all submatches (capturing groups), or nil if no match exists.
func (b *Buffer) FindDownSubmatch(rgrp RegexpGroup, start, end Loc) []Loc

// FindUp returns a slice containing the start and end positions
// of the last match of `rgrp` between `start` and `end`, or nil
// if no match exists.
func (b *Buffer) FindUp(rgrp RegexpGroup, start, end Loc) []Loc

// FindUpSubmatch returns a slice containing the start and end positions
// of the last match of `rgrp` between `start` and `end` plus those
// of all submatches (capturing groups), or nil if no match exists.
func (b *Buffer) FindUpSubmatch(rgrp RegexpGroup, start, end Loc) []Loc

// FindAllFunc calls the function `f` once for each match between `start`
// and `end` of the regexp given by `s`. The argument of `f` is the slice
// containing the start and end positions of the match. FindAllFunc returns
// the number of matches plus any error that occured when compiling the regexp.
func (b *Buffer) FindAllFunc(s string, start, end Loc, f func([]Loc)) (int, error)

// FindAll returns a slice containing the start and end positions of all
// matches between `start` and `end` of the regexp given by `s`, plus any
// error that occured when compiling the regexp. If no match is found, the
// slice returned is nil.
func (b *Buffer) FindAll(s string, start, end Loc) ([][]Loc, error)

// FindAllSubmatchFunc calls the function `f` once for each match between
// `start` and `end` of the regexp given by `s`. The argument of `f` is the
// slice containing the start and end positions of the match and all submatches
// (capturing groups). FindAllSubmatch Func returns the number of matches plus
// any error that occured when compiling the regexp.
func (b *Buffer) FindAllSubmatchFunc(s string, start, end Loc, f func([]Loc)) (int, error)

// FindAllSubmatch returns a slice containing the start and end positions of
// all matches and all submatches (capturing groups) between `start` and `end`
// of the regexp given by `s`, plus any error that occured when compiling
// the regexp. If no match is found, the slice returned is nil.
func (b *Buffer) FindAllSubmatch(s string, start, end Loc) ([][]Loc, error)

// ReplaceAll replaces all matches of the regexp `s` in the given area. The
// new text is obtained from `template` by replacing each variable with the
// corresponding submatch as in `Regexp.Expand`. The function returns the
// number of replacements made, the new end position and any error that
// occured during regexp compilation
func (b *Buffer) ReplaceAll(s string, start, end Loc, template []byte) (int, Loc, error)

// ReplaceAllLiteral replaces all matches of the regexp `s` with `repl` in
// the given area. The function returns the number of replacements made, the
// new end position and any error that occured during regexp compilation
func (b *Buffer) ReplaceAllLiteral(s string, start, end Loc, repl []byte) (int, Loc, error)

// ReplaceAllFunc replaces all matches of the regexp `s` with `repl(match)`
// in the given area, where `match` is the slice containing start and end
// positions of the match. The function returns the number of replacements
// made, the new end position and any error that occured during regexp
// compilation
func (b *Buffer) ReplaceAllFunc(s string, start, end Loc, repl func(match []Loc) []byte) (int, Loc, error)

// ReplaceAllSubmatchFunc replaces all matches of the regexp `s` with
// `repl(match)` in the given area, where `match` is the slice containing
// start and end positions of the match and all submatches. The function
// returns the number of replacements made, the new end position and any
// error that occured during regexp compilation
func (b *Buffer) ReplaceAllSubmatchFunc(s string, start, end Loc, repl func(match []Loc) []byte) (int, Loc, error)

// MatchedStrings converts a slice containing start and end positions of
// matches or submatches to a slice containing the corresponding strings.
func (b *Buffer) MatchedStrings(locs []Loc) ([]string)

// LocVoid returns a Loc strictly smaller then any valid buffer location
func LocVoid() Loc

// IsVoid returns true if the location l is void
func (l Loc) IsVoid() bool

The method FindNext is kept. ReplaceRegex is removed in favor of ReplaceAll. The latter is easier to use in Lua scripts.

Currently the simple search functions (FindDown etc.) take a RegexpGroup as argument to avoid recompiling the regexps. In contrast, FindAll, ReplaceAll and friends take a string argument. Many other variants would be possible. Also, the new search functions ignore the ignorecase setting of the buffer and don't wrap around when they hit the end of the search region. I think they are more useful this way in Lua scripts.

You will see that many new internal functions use callback functions. This avoids code duplication. (One has to somehow switch between (*regexp.Regexp).FindIndex() and (*regexp.Regexp).FindSubmatchIndex() in the innermost function that searches each line of the buffer.)

As said before, many details could be modified, but overall I think these functions are very useful for writing scripts. Please let me know what you think.

matthias314 · 2025-02-09T17:12:39Z

I've rebased the PR onto master and added NewRegexpGroup to the documentation.

matthias314 · 2025-02-09T22:22:08Z

The latest commit fixes a subtle bug related to the padding of the search region: In the presence of combining characters, one could end up with an infinite loop. (Try searching backwards for . in the line x⃗y⃗z⃗.)

This bug is also present in #3575, hence in master. If you want, I can backport 88f3cf5 to master. This would require some modification of the commit, so let me know if that's necessary.

matthias314 · 2025-02-22T21:43:03Z

I've force-pushed a polished version and updated the list of functions at the top of this page. It still fixes the bug mentioned above. Also, locations returned for matches and submatches are now guaranteed to include the runes that matched. The underlying Go regexp functions match runes, which may be part of combining characters like x⃗. The start and end locations now are such that the characters between them include all matching runes. This is not the case on master. (Search backwards for . in a row consisting of many x⃗ to see the difference.)

This PR introduces many new functions. Maybe we don't need all of them. For example, do we need a submatch version on top of each non-submatch search or replace function? If we keep only the submatch version, we wouldn't lose any functionality because the additional elements in the slice for each match can be ignored. I nevertheless added all functions to this PR to show what's possible.

matthias314 · 2025-03-01T16:23:18Z

@dmaluka Any chance to move forward with this PR?

Apart from a detailed review, a quick feedback would be helpful:

Do we need all search and replace functions I have defined? For example, we could drop the non-submatch functions (like FindAll) and rename the submatch versions (FindAll instead of FindAllSubmatch) without losing any functionality. (Is there a significant performance difference between the two?)
For the search functions accepting a regexp (or rather a RegexpGroup), should there be companion functions accepting strings? For example, FindDown could be renamed FindDownRegexpGroup, and a new function FindDown would accept a regexp given by a string. That could be convenient in Lua scripts. EDIT: In the latest version, all new search and replace functions allow the regexp to be either a string or RegexpGroup.
Currently the search and replace functions try to be smart, already the existing FindNext: If the end of the search region precedes the start, then they are swapped. This is convenient when searching/replacing within the cursor selection. However, should general-purpose functions better be simple and "dumb"?

dmaluka · 2025-03-01T19:20:32Z

Haven't had much time to look into it yet, sorry.

matthias314 · 2025-03-03T20:39:47Z

To see some of the new search functions in action, check out my LaTeX plugin. The way the new ReplaceAll function works is also useful for my bookmark plugin because it allows the simplified bookmark handling implemented there to work with undoing replacement operations. (Both plugins are under development and need a custom version of micro.)

matthias314 · 2025-03-05T00:41:42Z

In the latest version, the regexp for all new search and replace functions can be specified either as a string or as a RegexpGroup. The latter is better for performance (because it avoids compiling the same regexp multiple times) while the former is easier to use in Lua scripts.

dmaluka · 2025-03-09T12:56:21Z

// NewRegexpGroup creates a RegexpGroup from a string
func NewRegexpGroup(s string) (RegexpGroup, error)

I'm worried that we are exposing such implementation details as padded regexps as a part of the API. I think we should try and make it a bit more abstract and future-proof, e.g. something like:

type RegexpSearch struct {
	// We want "^" and "$" to match only the beginning/end of a line [...]
	regex [4]*regexp.Regexp
}

func NewRegexpSearch(s string) (*RegexpSearch, error)

In the latest version, all new search and replace functions allow the regexp to be either a string or RegexpGroup.

And IMHO this implicit polymorphism is messy.

I'm thinking of something like:

func (b *Buffer) FindDown(s string, start, end Loc, useRegex bool, ignoreCase bool) []Loc
func (b *Buffer) FindUp(s string, start, end Loc, useRegex bool, ignoreCase bool) []Loc

which internally use:

func NewRegexpSearch(s string) (*RegexpSearch, error)
func (b *Buffer) FindRegexpDown(search *RegexpSearch, start, end Loc) []Loc
func (b *Buffer) FindRegexpUp(search *RegexpSearch, start, end Loc) []Loc

we could drop the non-submatch functions (like FindAll) and rename the submatch versions (FindAll instead of FindAllSubmatch) without losing any functionality.

Seems reasonable. The caller can ignore the returned submatches if it doesn't need them.

Currently the search and replace functions try to be smart, already the existing FindNext: If the end of the search region precedes the start, then they are swapped. This is convenient when searching/replacing within the cursor selection. However, should general-purpose functions better be simple and "dumb"?

Seems reasonable. The intuitively expected behavior would be: if start is greater than end, treat the range as empty and thus return no matches.

And I'm not even sure why exactly we currently swap them. From looking at the code it seems like we already make sure to always pass start less or equal to end (except findUp(), where we on the contrary always pass start greater or equal than end, so we might want to swap them just in findUp(), and unconditionally?).

matthias314 · 2025-03-09T13:24:14Z

The intuitively expected behavior would be: if start is greater than end, treat the range as empty and thus return no matches.

Exactly. Another option would be to combine FindUp and FindDown into a single function Find(search string, start, end Loc). The search would be downwards if start is less than or equal to end and otherwise upwards. This would reduce the number of methods we define, but may be too "smart". What do you think?

dmaluka · 2025-03-09T13:30:17Z

Yes, it would be too smart.

dmaluka · 2025-03-01T13:23:00Z

internal/buffer/eventhandler.go

@@ -113,25 +113,24 @@ func (eh *EventHandler) DoTextEvent(t *TextEvent, useUndo bool) {
 }

 // ExecuteTextEvent runs a text event
+// The deltas are processed in reverse order and afterwards reversed


Add an explanation why it is needed?

Also, this is rather an implementation detail, so maybe this comment should be inside the function?

I wouldn't say that it's an implementation detail. When you create a TextEvent t, you have to know if which order the elements of t.Deltas are processed because that changes the meaning of the locations. To keep it less technical, we could say that the locations of the various Deltas have to be (non-overlapping and) in increasing order. The old comment could then indeed move inside the function.

dmaluka · 2025-03-01T13:24:32Z

internal/util/util.go

@@ -59,6 +59,20 @@ func init() {
 	Stdout = new(bytes.Buffer)
 }

+// RangeMap returns the slice obtained from applying the given function


Why not SliceMap?

And I'm not sure we even need this helper. I see there is exactly one usage of it, and it doesn't look very convincing. And I'm not sure we should create a precedent of using generics unless we really find it useful.

I called it RangeMap instead of SliceMap because the function f does not only receive the slice element, but also the position, as in a range.

You are right, this helper function is not used elsewhere at present. Maybe the reason is that it needs type parameters, and before the recent bump from Go 1.17 to 1.19 we didn't have them. I myself am new to Go, and I find it annoying that such basic functionality is not included directly in Go. I'm sure that if we looked through the code for micro, we would find places where RangeMap would be useful. I would be optimistic that there are other uses in the future. I'm using it in another PR that I haven't submitted yet because it depends on the present one. But it's up to you. If you want me to delete it, I'll do it.

Now I actually regret we bumped it from 1.17 to 1.19. We'd have an easy compelling answer to questions "why not use generics", "why not use any", "why not use another shiny new feature X".

dmaluka · 2025-03-09T13:20:07Z

internal/buffer/search.go

+// ReplaceAllLiteral replaces all matches of the regexp `s` with `repl` in
+// the given area. The function returns the number of replacements made, the
+// new end position and any error that occured during regexp compilation
+func (b *Buffer) ReplaceAllLiteral(s string, start, end Loc, repl []byte) (int, Loc, error) {


Why not pass literal as a boolean argument to ReplaceAll()?

Sure, I can change that. The reason I had chosen ReplaceAllLiteral was to imitate Go's regexp API.

Heh, didn't realize that. Anyway, we don't need to replicate Go's API precisely, we can define whatever API is more convenient for us to use.

What I like about ReplaceAllLiteral is that I don't have to remember whether a second argument true to ReplaceAll means "use as ~~regexp~~ template" or "use literally". (I believe that in almost all practical purposes, this argument would be a constant for each invocation of ReplaceAll, not some variable whose value is not known in advance.)

dmaluka · 2025-03-09T13:27:06Z

internal/buffer/loc.go

+func (l Loc) IsVoid() bool {
+	return l == LocVoid()
+}
+


WTF is this, sorry.

I was about to suggest something like:

const InvalidLoc = Loc{-1, -1}

but then I recalled that Go doesn't support constant structs.

So I think we should just keep using directly Loc{-1, -1} (and explicit checks like loc == Loc{-1, -1}, without helpers), there's nothing terrible about it.

I also thought first about constant structs.

The reason that I made this change in this PR is that the "internal" value Loc{-1, -1} is now exposed to Lua scripts: If a submatch is not filled in a match, then we need a way to indicate that. An example would be searching for "a([xy])|b([uv])" in "ax". The first submatch would be "x" and the second one would be void. (In Go, the indices of a void submatch are -1.) I thought that something like loc:IsVoid() looks cleaner in in a Lua script.

Does this convince you, or do you still want me to remove LocVoid and IsVoid?

First, the use of the word "void" here is very confusing, isn't it? Why "void"? (Now I understand it refers to this specific use case of "a submatch is not filled in a match", but how is a casual person supposed to guess that, and why limit the API to this narrow use case?)

What about just:

func (l Loc) IsValid() bool { return l.X >= 0 && l.Y >= 0 }

?

matthias314 · 2025-03-09T19:01:43Z

type RegexpSearch struct {

The struct is a good idea. Are you attached to the name RegexpSearch? I find that such a struct is not more related to searching than a single Regexp. I don't want to claim that RegexpGroup is the ideal name, but it conveys the idea that several regexps are grouped together.

func (b *Buffer) FindDown(s string, start, end Loc, useRegex bool, ignoreCase bool) []Loc

I wonder how convenient the arguments useRegex and ignoreCase would be in Lua scripts. (My general approach is that the API should be easy to use from Lua.) If one has an explicit repexp, then one can modify it directly. Moreover, ignoreCase may often just be the buffer setting. I have a draft PR where I use the new search functions in the rest of micro. (This makes the code simpler and shorter.) There I define the function

// RegexpString converts a search string into a string that can be compiled
// to a regexp. It can quotes special characters and switch to case-insensitive
// search if that is the setting for the buffer.
func (b *Buffer) RegexpString(s string, isRegexp bool) string {

Such a function might cover most uses of useRegex and ignoreCase. I'm asking myself whether these arguments to FindDown will be to be more of a help to Lua script writers or a burden.

dmaluka · 2025-03-09T20:48:05Z

I don't want to claim that RegexpGroup is the ideal name, but it conveys the idea that several regexps are grouped together.

That is exactly the kind of details that I'd prefer to hide, not expose.

matthias314 · 2025-03-09T21:41:09Z

That is exactly the kind of details that I'd prefer to hide, not expose.

Fair enough. What about RegexpData? In the case of RegexpSearch one may wonder what the Search part means. Nobody would be puzzled about Data.

matthias314 marked this pull request as draft February 8, 2025 03:10

This was referenced Feb 8, 2025

match beginning and end of line correctly #3575

Merged

FindNextSubmatch: return submatches when searching #3552

Closed

process Deltas in ExecuteTextEvent in reverse order

1eff47d

matthias314 force-pushed the m3/find-func branch 2 times, most recently from 8b80291 to 92b6fba Compare February 9, 2025 17:11

matthias314 marked this pull request as ready for review February 9, 2025 17:40

matthias314 force-pushed the m3/find-func branch from 92b6fba to 72fcf50 Compare February 9, 2025 22:12

matthias314 force-pushed the m3/find-func branch 2 times, most recently from 0d14eae to 88f3cf5 Compare February 10, 2025 00:05

matthias314 added 2 commits February 21, 2025 20:22

added util.RangeMap

7609b6f

changed util.isMark to public IsMark

61fb82d

matthias314 force-pushed the m3/find-func branch from 88f3cf5 to 1c1a35a Compare February 22, 2025 21:24

matthias314 added 3 commits March 4, 2025 19:38

modified search and replace methods

bc32a51

added LocVoid() and Loc.IsVoid()

82e77d0

made search and replace functions accept RegexpGroup argument

8683a1a

matthias314 force-pushed the m3/find-func branch from 1c1a35a to 8683a1a Compare March 5, 2025 00:38

dmaluka reviewed Mar 9, 2025

View reviewed changes

matthias314 mentioned this pull request Mar 9, 2025

adjust selection after replaceall #3623

Draft

matthias314 mentioned this pull request Mar 18, 2025

BUG: micro crashes if search query is \Q #3700

Open

enhance search API #3658

Are you sure you want to change the base?

enhance search API #3658

Uh oh!

Conversation

matthias314 commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthias314 commented Feb 9, 2025

Uh oh!

matthias314 commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthias314 commented Feb 22, 2025

Uh oh!

matthias314 commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmaluka commented Mar 1, 2025

Uh oh!

matthias314 commented Mar 3, 2025

Uh oh!

matthias314 commented Mar 5, 2025

Uh oh!

dmaluka commented Mar 9, 2025

Uh oh!

matthias314 commented Mar 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmaluka commented Mar 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthias314 Mar 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthias314 commented Mar 9, 2025

Uh oh!

dmaluka commented Mar 9, 2025

Uh oh!

matthias314 commented Mar 9, 2025

Uh oh!

Uh oh!

matthias314 commented Feb 8, 2025 •

edited

Loading

matthias314 commented Feb 9, 2025 •

edited

Loading

matthias314 commented Mar 1, 2025 •

edited

Loading

matthias314 commented Mar 9, 2025 •

edited

Loading

matthias314 Mar 9, 2025 •

edited

Loading