feat(file-indexer): regex support by AntoineGS · Pull Request #568 · vicinaehq/vicinae

AntoineGS · 2025-10-28T14:00:45Z

Support for regex in the file search, which is sped up using a trigram index to prevent walking the whole database.

This is still an early draft but it runs.
I still need to review and rework regex-utils as in its current state it was written by Claude AI to pass the series of tests so I am sure it has some issues.

Some edge cases also need some thinking, right now it requires that a minimum of a 3 character word is extracted from the regex to run the trigram match, otherwise there are no results (the alternative would be to execute the regex on the whole dataset which would be slow).

One question I have though, based on what I can see, the trigram algorithm could be a better fit compared to the unicode one, as it allows in-word matching instead of only beginning-of-word matching. Could we maybe replace the base tokenizer too?

Fixes #551

aurelleb · 2025-10-28T16:36:03Z

Is there any reason you implemented your own regexp utils instead of using the regexp implementation from std ?

One question I have though, based on what I can see, the trigram algorithm could be a better fit compared to the unicode one, as it allows in-word matching instead of only beginning-of-word matching. Could we maybe replace the base tokenizer too?

I don't really remember why I went for unicode instead of trigram (I considered both) but I think trigram would indeed probably be a better fit here.

AntoineGS · 2025-10-28T17:13:58Z

Are you referring to regex-utils?
If so, that unit is more of an anti-regex, whereas it extracts the characters from the regular expression that are guaranteed and can be used in the sql match clause. It does not run the regex against a string.

AntoineGS · 2025-10-28T17:16:01Z

I don't really remember why I went for unicode instead of trigram (I considered both) but I think trigram would indeed probably be a better fit here.

When you say 'here', are you referring to the query when using regex or also for the existing code?
Just so I don't go making changes to the existing code if that is not what you meant 😅

aurelleb · 2025-10-28T18:05:39Z

Are you referring to regex-utils?
If so, that unit is more of an anti-regex, whereas it extracts the characters from the regular expression that are guaranteed and can be used in the sql match clause. It does not run the regex against a string.

Okay got it (haven't reviewed the code so I wasn't sure)

When you say 'here', are you referring to the query when using regex or also for the existing code?
Just so I don't go making changes to the existing code if that is not what you meant 😅

Both actually. I could end up migrating the main index to trigram as well.

AntoineGS · 2025-10-28T19:10:55Z

Awesome, personally I prefer the looser matching of trigram

AntoineGS · 2025-10-29T00:20:57Z

So the reason you did not go this route is probably because it does not support most special characters, with the most unfortunate one being the 'dot'.
I feel like there might be a solution to get around this but for now I will leave it to a later modification and wrap up the ongoing changes.

On that note, I am pretty happy with the current PR so ready for an official review :)

quadratech188 · 2025-10-29T10:47:07Z

I get these errors in console and no results:

[19:54:55.121] INFO   -  Final query string: "\n    SELECT f.path, tri_idx.rank FROM indexed_file f \n        JOIN tri_idx ON tri_idx.rowid = f.id\n        WHERE tri_idx MATCH 'abcd' AND f.name REGEXP :pattern\n        ORDER BY f.relevancy_score, tri_idx.rank\n        LIMIT :limit\n        OFFSET :offset\n  " (file-indexer-db.cpp:253)
[19:54:55.121] INFO   -  regexString: "abcd" (file-indexer-db.cpp:261)
[19:54:55.121] INFO   -  SearchQuery: "abcd" (file-indexer-db.cpp:262)
[19:54:55.121] WARN   -  Search query failed QSqlError("", "Parameter count mismatch", "") (file-indexer-db.cpp:264)

AntoineGS · 2025-10-29T13:26:34Z

Woups I had not added the migration file to the qrc file so you were probably missing the table.

~~Update: I noticed an error in the logs on deletion, I am unsure yet if it is related but I suspect it has something to do with the after delete trigger.~~ resolved

aurelleb · 2025-10-29T17:39:56Z

On that note, I am pretty happy with the current PR so ready for an official review :)

Thank you for all your work, I'm currently working on getting the Vicinae extension store released (v0.16.0) I will review the big things after that.

AntoineGS · 2025-10-29T19:11:50Z

No worries!

AntoineGS · 2025-11-03T21:40:00Z

Since the default search is case-insensitive, would we also want to make the RegEx search case-insensitive? (I have it set to the default case-sensitive at the moment).

aurelleb · 2025-11-09T06:34:23Z

minor conflict here.
Also fails to build due to the explicit sqlite3 requirement.
I'm not a fan of calling sqlite3 functions directly, as we might replace it with https://github.com/sqlcipher/sqlcipher in the future.

AntoineGS · 2025-11-09T14:43:45Z

Yeah that makes sense, though unfortunately sqlite3 has no built-in method for this.
I saw that there is also a C extension that comes with sqlite3 but it needs compiling anyways so I did not see a benefit.

I could move the sqlite3-specific code to its own file which would make it easier to migrate (no need to hunt down db engine specific code).

I don't really have other ideas to decouple the db code 🫤

AntoineGS · 2025-11-10T00:50:31Z

There are no more conflicts though I did not make changes to the sqlite3 requirement

aurelleb · 2025-11-12T21:17:54Z

I'm still not sure what to make of the sqlite requirement, also I'm not a fan of mixing the use of native sqlite stuff and the QT driver for sqlite. I'm going to think about it, honestly I don't know. I'm going to review the fanotify stuff soon though, that's not a requirement for it, right?

AntoineGS · 2025-11-13T01:56:03Z

I played around with some alternatives and came up with two different ones, maybe it can spark a better solution.
Alternative 1 declares the needed functions from sqlite3.h locally. While a bit hacky it prevents having to require sqlite3 directly, but leaves a code dependency on sqlite3.
Alternative 2 runs the trigram query and then applies the regex on the result set. This has some issues like requiring enough characters to match the trigram index for results to show up, and requiring to fetch more data from the database.

I have not yet found a perfect solution, I'll keep on digging!

As for the fanotify changes, no this is not a requirement!
FYI I have created a PR upstream for the fanotify changes, but have not had any feedback at this time.

aurelleb · 2025-11-14T12:08:46Z

So I was playing with the shutdown sequence of vicinae and I'm starting to think we should probably move everything related to file indexing in its own subproject, and spawn it as a separate process to keep things clean and separate from QT land. Especially since we are mixing std::thread with QT stuff, which works but can be pretty confusing as we use the QSQL stuff.

As I see it, we would let the main vicinae server spawn a file indexing process which would expose a query api using a unix socket or standard file descriptors (communicating through protobuf or json). The file indexer codebase would then be free of all QT code, and use the sqlite library directly.

I think it's the best path forward, what do you guys think @AntoineGS @quadratech188

AntoineGS · 2025-11-14T12:26:21Z

I am not familiar enough with Qt or this project as a whole to have a strong opinion on this but having it run as a separate service could have other benefits like the ability to run it with elevated rights for fanotify without running the whole vicinae server with them.
If it allows for more freedom I do think it is a good idea.
Playing devil's advocate I guess the downside would be the added overhead/complexity of the inter-process data comm with which I have little experience so I can't really speak to that.

Would you keep both in the same repo and still package them together?

In any case I am still interested in helping out with the file indexing so whichever direction you take I will help out 🙂

aurelleb · 2025-11-14T12:29:49Z

@AntoineGS the overhead of IPC in this case is negligible, it's definitely not a deal breaker.

The file indexing code would remain in this repo, in fact we would still spawn it with the main vicinae binary using a special argument like vicinae file-indexer or something like that. It's already what we are doing with the wlr-clipboard server.

I can draft a simple boilerplate with the IPC setup and then if that's the route we want to go we can start migrating over from the QT stuff.

AntoineGS · 2025-11-14T12:34:25Z

I don't really see a downside except the initial effort of separating the processes in that case.
It would also make it easier to track down CPU spikes / high memory usage if it's a different process, which file indexing is more prone to.

aurelleb · 2025-11-14T12:38:37Z

I'm going to work on a draft, I may have something out later today, I will let you know about it.

Resolved conflicts: - vicinae/CMakeLists.txt: Combined library dependencies (sqlite3 + LibXml2) - vicinae/src/services/files-service/file-indexer/file-indexer.cpp: Integrated regex support with new query engine - vicinae/include/search-files-view.hpp: Removed (moved to new location in main) Additional fixes for compilation: - Added QJsonArray include to extension-manifest.cpp - Fixed aggregate initialization in extension-manifest.cpp - Added QApplication include to vlist.cpp

AntoineGS · 2025-12-02T19:51:50Z

No pressure as I also juggle a few projects but just checking in on the discussed changes :)

aurelleb · 2025-12-27T00:51:15Z

So I've been experimenting a little bit with an experimental file indexer that doesn't use sqlite as a backend. It's not ready to even be shared yet but I think it's probably going to be the best way to achieve the level of file search quality I want.
The scanning logic will remain mostly the same but the indexing will be able to benefit from proper tokenization, which is one of the limiting factors with sqlite. This will also allow me to make file-search specific optimizations because it only needs to be good at that unlike a general purpose database.

I don't think it will support regexp tho, because proper regexp on file search doesn't really seem feasible given the size of the dataset (or it should only apply to a list of prefiltered candidates).

As soon as I have something ready I will share it.

AntoineGS · 2025-12-28T17:04:44Z

Alright sounds good! Thanks for the update

AntoineGS force-pushed the feat_search_regex branch from f5fb607 to a111da1 Compare October 28, 2025 14:10

AntoineGS force-pushed the feat_search_regex branch from a111da1 to 351ffdd Compare October 29, 2025 00:20

AntoineGS force-pushed the feat_search_regex branch from 351ffdd to 6cff3a8 Compare October 29, 2025 00:21

AntoineGS marked this pull request as ready for review October 29, 2025 00:21

AntoineGS force-pushed the feat_search_regex branch 2 times, most recently from 65a8374 to 955d4bb Compare October 29, 2025 13:20

AntoineGS force-pushed the feat_search_regex branch from 955d4bb to 9a62ef7 Compare October 29, 2025 14:03

AntoineGS force-pushed the feat_search_regex branch from 9a62ef7 to e1c6982 Compare November 3, 2025 16:47

AntoineGS force-pushed the feat_search_regex branch 2 times, most recently from 7fcdc03 to 6bce339 Compare November 8, 2025 22:00

feat(file-search): add support for regex

e68181d

AntoineGS force-pushed the feat_search_regex branch from 6bce339 to e68181d Compare November 10, 2025 00:49

aurelleb force-pushed the main branch 2 times, most recently from 4b8065c to c26ad02 Compare November 13, 2025 15:12

aurelleb force-pushed the main branch from 85dadc8 to e305f46 Compare December 1, 2025 11:28

Uh oh!

Conversation

AntoineGS commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aurelleb commented Oct 28, 2025

Uh oh!

AntoineGS commented Oct 28, 2025

Uh oh!

AntoineGS commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aurelleb commented Oct 28, 2025

Uh oh!

AntoineGS commented Oct 28, 2025

Uh oh!

AntoineGS commented Oct 29, 2025

Uh oh!

quadratech188 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntoineGS commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aurelleb commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntoineGS commented Oct 29, 2025

Uh oh!

AntoineGS commented Nov 3, 2025

Uh oh!

aurelleb commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntoineGS commented Nov 9, 2025

Uh oh!

AntoineGS commented Nov 10, 2025

Uh oh!

aurelleb commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntoineGS commented Nov 13, 2025

Uh oh!

aurelleb commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntoineGS commented Nov 14, 2025

Uh oh!

aurelleb commented Nov 14, 2025

Uh oh!

AntoineGS commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aurelleb commented Nov 14, 2025

Uh oh!

AntoineGS commented Dec 2, 2025

Uh oh!

aurelleb commented Dec 27, 2025

Uh oh!

AntoineGS commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AntoineGS commented Oct 28, 2025 •

edited

Loading

AntoineGS commented Oct 28, 2025 •

edited

Loading

quadratech188 commented Oct 29, 2025 •

edited

Loading

AntoineGS commented Oct 29, 2025 •

edited

Loading

aurelleb commented Oct 29, 2025 •

edited

Loading

aurelleb commented Nov 9, 2025 •

edited

Loading

aurelleb commented Nov 12, 2025 •

edited

Loading

aurelleb commented Nov 14, 2025 •

edited

Loading

AntoineGS commented Nov 14, 2025 •

edited

Loading