Skip to content

feat(file-indexer): regex support#568

Open
AntoineGS wants to merge 2 commits intovicinaehq:mainfrom
AntoineGS:feat_search_regex
Open

feat(file-indexer): regex support#568
AntoineGS wants to merge 2 commits intovicinaehq:mainfrom
AntoineGS:feat_search_regex

Conversation

@AntoineGS
Copy link
Contributor

@AntoineGS AntoineGS commented Oct 28, 2025

Support for regex in the file search, which is sped up using a trigram index to prevent walking the whole database.

This is still an early draft but it runs.
I still need to review and rework regex-utils as in its current state it was written by Claude AI to pass the series of tests so I am sure it has some issues.

Some edge cases also need some thinking, right now it requires that a minimum of a 3 character word is extracted from the regex to run the trigram match, otherwise there are no results (the alternative would be to execute the regex on the whole dataset which would be slow).

One question I have though, based on what I can see, the trigram algorithm could be a better fit compared to the unicode one, as it allows in-word matching instead of only beginning-of-word matching. Could we maybe replace the base tokenizer too?

Fixes #551

@aurelleb
Copy link
Contributor

Is there any reason you implemented your own regexp utils instead of using the regexp implementation from std ?

One question I have though, based on what I can see, the trigram algorithm could be a better fit compared to the unicode one, as it allows in-word matching instead of only beginning-of-word matching. Could we maybe replace the base tokenizer too?

I don't really remember why I went for unicode instead of trigram (I considered both) but I think trigram would indeed probably be a better fit here.

@AntoineGS
Copy link
Contributor Author

Are you referring to regex-utils?
If so, that unit is more of an anti-regex, whereas it extracts the characters from the regular expression that are guaranteed and can be used in the sql match clause. It does not run the regex against a string.

@AntoineGS
Copy link
Contributor Author

AntoineGS commented Oct 28, 2025

I don't really remember why I went for unicode instead of trigram (I considered both) but I think trigram would indeed probably be a better fit here.

When you say 'here', are you referring to the query when using regex or also for the existing code?
Just so I don't go making changes to the existing code if that is not what you meant 😅

@aurelleb
Copy link
Contributor

Are you referring to regex-utils?
If so, that unit is more of an anti-regex, whereas it extracts the characters from the regular expression that are guaranteed and can be used in the sql match clause. It does not run the regex against a string.

Okay got it (haven't reviewed the code so I wasn't sure)

When you say 'here', are you referring to the query when using regex or also for the existing code?
Just so I don't go making changes to the existing code if that is not what you meant 😅

Both actually. I could end up migrating the main index to trigram as well.

@AntoineGS
Copy link
Contributor Author

Awesome, personally I prefer the looser matching of trigram

@AntoineGS
Copy link
Contributor Author

So the reason you did not go this route is probably because it does not support most special characters, with the most unfortunate one being the 'dot'.
I feel like there might be a solution to get around this but for now I will leave it to a later modification and wrap up the ongoing changes.

On that note, I am pretty happy with the current PR so ready for an official review :)

@AntoineGS AntoineGS marked this pull request as ready for review October 29, 2025 00:21
@quadratech188
Copy link
Contributor

quadratech188 commented Oct 29, 2025

I get these errors in console and no results:

[19:54:55.121] INFO   -  Final query string: "\n    SELECT f.path, tri_idx.rank FROM indexed_file f \n        JOIN tri_idx ON tri_idx.rowid = f.id\n        WHERE tri_idx MATCH 'abcd' AND f.name REGEXP :pattern\n        ORDER BY f.relevancy_score, tri_idx.rank\n        LIMIT :limit\n        OFFSET :offset\n  " (file-indexer-db.cpp:253)
[19:54:55.121] INFO   -  regexString: "abcd" (file-indexer-db.cpp:261)
[19:54:55.121] INFO   -  SearchQuery: "abcd" (file-indexer-db.cpp:262)
[19:54:55.121] WARN   -  Search query failed QSqlError("", "Parameter count mismatch", "") (file-indexer-db.cpp:264)

@AntoineGS AntoineGS force-pushed the feat_search_regex branch 2 times, most recently from 65a8374 to 955d4bb Compare October 29, 2025 13:20
@AntoineGS
Copy link
Contributor Author

AntoineGS commented Oct 29, 2025

Woups I had not added the migration file to the qrc file so you were probably missing the table.

Update: I noticed an error in the logs on deletion, I am unsure yet if it is related but I suspect it has something to do with the after delete trigger. resolved

@aurelleb
Copy link
Contributor

aurelleb commented Oct 29, 2025

On that note, I am pretty happy with the current PR so ready for an official review :)

Thank you for all your work, I'm currently working on getting the Vicinae extension store released (v0.16.0) I will review the big things after that.

@AntoineGS
Copy link
Contributor Author

No worries!

@AntoineGS
Copy link
Contributor Author

Since the default search is case-insensitive, would we also want to make the RegEx search case-insensitive? (I have it set to the default case-sensitive at the moment).

@AntoineGS AntoineGS force-pushed the feat_search_regex branch 2 times, most recently from 7fcdc03 to 6bce339 Compare November 8, 2025 22:00
@aurelleb
Copy link
Contributor

aurelleb commented Nov 9, 2025

minor conflict here.
Also fails to build due to the explicit sqlite3 requirement.
I'm not a fan of calling sqlite3 functions directly, as we might replace it with https://github.com/sqlcipher/sqlcipher in the future.

@AntoineGS
Copy link
Contributor Author

Yeah that makes sense, though unfortunately sqlite3 has no built-in method for this.
I saw that there is also a C extension that comes with sqlite3 but it needs compiling anyways so I did not see a benefit.

I could move the sqlite3-specific code to its own file which would make it easier to migrate (no need to hunt down db engine specific code).

I don't really have other ideas to decouple the db code 🫤

@AntoineGS
Copy link
Contributor Author

There are no more conflicts though I did not make changes to the sqlite3 requirement

@aurelleb
Copy link
Contributor

aurelleb commented Nov 12, 2025

I'm still not sure what to make of the sqlite requirement, also I'm not a fan of mixing the use of native sqlite stuff and the QT driver for sqlite. I'm going to think about it, honestly I don't know. I'm going to review the fanotify stuff soon though, that's not a requirement for it, right?

@AntoineGS
Copy link
Contributor Author

I played around with some alternatives and came up with two different ones, maybe it can spark a better solution.
Alternative 1 declares the needed functions from sqlite3.h locally. While a bit hacky it prevents having to require sqlite3 directly, but leaves a code dependency on sqlite3.
Alternative 2 runs the trigram query and then applies the regex on the result set. This has some issues like requiring enough characters to match the trigram index for results to show up, and requiring to fetch more data from the database.

I have not yet found a perfect solution, I'll keep on digging!

As for the fanotify changes, no this is not a requirement!
FYI I have created a PR upstream for the fanotify changes, but have not had any feedback at this time.

@aurelleb aurelleb force-pushed the main branch 2 times, most recently from 4b8065c to c26ad02 Compare November 13, 2025 15:12
@aurelleb
Copy link
Contributor

aurelleb commented Nov 14, 2025

So I was playing with the shutdown sequence of vicinae and I'm starting to think we should probably move everything related to file indexing in its own subproject, and spawn it as a separate process to keep things clean and separate from QT land. Especially since we are mixing std::thread with QT stuff, which works but can be pretty confusing as we use the QSQL stuff.

As I see it, we would let the main vicinae server spawn a file indexing process which would expose a query api using a unix socket or standard file descriptors (communicating through protobuf or json). The file indexer codebase would then be free of all QT code, and use the sqlite library directly.

I think it's the best path forward, what do you guys think @AntoineGS @quadratech188

@AntoineGS
Copy link
Contributor Author

I am not familiar enough with Qt or this project as a whole to have a strong opinion on this but having it run as a separate service could have other benefits like the ability to run it with elevated rights for fanotify without running the whole vicinae server with them.
If it allows for more freedom I do think it is a good idea.
Playing devil's advocate I guess the downside would be the added overhead/complexity of the inter-process data comm with which I have little experience so I can't really speak to that.

Would you keep both in the same repo and still package them together?

In any case I am still interested in helping out with the file indexing so whichever direction you take I will help out 🙂

@aurelleb
Copy link
Contributor

@AntoineGS the overhead of IPC in this case is negligible, it's definitely not a deal breaker.

The file indexing code would remain in this repo, in fact we would still spawn it with the main vicinae binary using a special argument like vicinae file-indexer or something like that. It's already what we are doing with the wlr-clipboard server.

I can draft a simple boilerplate with the IPC setup and then if that's the route we want to go we can start migrating over from the QT stuff.

@AntoineGS
Copy link
Contributor Author

AntoineGS commented Nov 14, 2025

I don't really see a downside except the initial effort of separating the processes in that case.
It would also make it easier to track down CPU spikes / high memory usage if it's a different process, which file indexing is more prone to.

@aurelleb
Copy link
Contributor

I'm going to work on a draft, I may have something out later today, I will let you know about it.

Resolved conflicts:
- vicinae/CMakeLists.txt: Combined library dependencies (sqlite3 + LibXml2)
- vicinae/src/services/files-service/file-indexer/file-indexer.cpp: Integrated regex support with new query engine
- vicinae/include/search-files-view.hpp: Removed (moved to new location in main)

Additional fixes for compilation:
- Added QJsonArray include to extension-manifest.cpp
- Fixed aggregate initialization in extension-manifest.cpp
- Added QApplication include to vlist.cpp
@AntoineGS
Copy link
Contributor Author

No pressure as I also juggle a few projects but just checking in on the discussed changes :)

@aurelleb
Copy link
Contributor

So I've been experimenting a little bit with an experimental file indexer that doesn't use sqlite as a backend. It's not ready to even be shared yet but I think it's probably going to be the best way to achieve the level of file search quality I want.
The scanning logic will remain mostly the same but the indexing will be able to benefit from proper tokenization, which is one of the limiting factors with sqlite. This will also allow me to make file-search specific optimizations because it only needs to be good at that unlike a general purpose database.

I don't think it will support regexp tho, because proper regexp on file search doesn't really seem feasible given the size of the dataset (or it should only apply to a list of prefiltered candidates).

As soon as I have something ready I will share it.

@AntoineGS
Copy link
Contributor Author

Alright sounds good! Thanks for the update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

File search - Regex support

3 participants