Conversation
f5fb607 to
a111da1
Compare
|
Is there any reason you implemented your own regexp utils instead of using the regexp implementation from std ?
I don't really remember why I went for unicode instead of trigram (I considered both) but I think trigram would indeed probably be a better fit here. |
|
Are you referring to regex-utils? |
When you say 'here', are you referring to the query when using regex or also for the existing code? |
Okay got it (haven't reviewed the code so I wasn't sure)
Both actually. I could end up migrating the main index to trigram as well. |
|
Awesome, personally I prefer the looser matching of trigram |
a111da1 to
351ffdd
Compare
|
So the reason you did not go this route is probably because it does not support most special characters, with the most unfortunate one being the 'dot'. On that note, I am pretty happy with the current PR so ready for an official review :) |
351ffdd to
6cff3a8
Compare
|
I get these errors in console and no results: |
65a8374 to
955d4bb
Compare
|
Woups I had not added the migration file to the qrc file so you were probably missing the table.
|
955d4bb to
9a62ef7
Compare
Thank you for all your work, I'm currently working on getting the Vicinae extension store released (v0.16.0) I will review the big things after that. |
|
No worries! |
9a62ef7 to
e1c6982
Compare
|
Since the default search is case-insensitive, would we also want to make the RegEx search case-insensitive? (I have it set to the default case-sensitive at the moment). |
7fcdc03 to
6bce339
Compare
|
minor conflict here. |
|
Yeah that makes sense, though unfortunately sqlite3 has no built-in method for this. I could move the sqlite3-specific code to its own file which would make it easier to migrate (no need to hunt down db engine specific code). I don't really have other ideas to decouple the db code 🫤 |
6bce339 to
e68181d
Compare
|
There are no more conflicts though I did not make changes to the sqlite3 requirement |
|
I'm still not sure what to make of the sqlite requirement, also I'm not a fan of mixing the use of native sqlite stuff and the QT driver for sqlite. I'm going to think about it, honestly I don't know. I'm going to review the fanotify stuff soon though, that's not a requirement for it, right? |
|
I played around with some alternatives and came up with two different ones, maybe it can spark a better solution. I have not yet found a perfect solution, I'll keep on digging! As for the fanotify changes, no this is not a requirement! |
4b8065c to
c26ad02
Compare
|
So I was playing with the shutdown sequence of vicinae and I'm starting to think we should probably move everything related to file indexing in its own subproject, and spawn it as a separate process to keep things clean and separate from QT land. Especially since we are mixing std::thread with QT stuff, which works but can be pretty confusing as we use the QSQL stuff. As I see it, we would let the main vicinae server spawn a file indexing process which would expose a query api using a unix socket or standard file descriptors (communicating through protobuf or json). The file indexer codebase would then be free of all QT code, and use the sqlite library directly. I think it's the best path forward, what do you guys think @AntoineGS @quadratech188 |
|
I am not familiar enough with Qt or this project as a whole to have a strong opinion on this but having it run as a separate service could have other benefits like the ability to run it with elevated rights for fanotify without running the whole vicinae server with them. Would you keep both in the same repo and still package them together? In any case I am still interested in helping out with the file indexing so whichever direction you take I will help out 🙂 |
|
@AntoineGS the overhead of IPC in this case is negligible, it's definitely not a deal breaker. The file indexing code would remain in this repo, in fact we would still spawn it with the main I can draft a simple boilerplate with the IPC setup and then if that's the route we want to go we can start migrating over from the QT stuff. |
|
I don't really see a downside except the initial effort of separating the processes in that case. |
|
I'm going to work on a draft, I may have something out later today, I will let you know about it. |
Resolved conflicts: - vicinae/CMakeLists.txt: Combined library dependencies (sqlite3 + LibXml2) - vicinae/src/services/files-service/file-indexer/file-indexer.cpp: Integrated regex support with new query engine - vicinae/include/search-files-view.hpp: Removed (moved to new location in main) Additional fixes for compilation: - Added QJsonArray include to extension-manifest.cpp - Fixed aggregate initialization in extension-manifest.cpp - Added QApplication include to vlist.cpp
|
No pressure as I also juggle a few projects but just checking in on the discussed changes :) |
|
So I've been experimenting a little bit with an experimental file indexer that doesn't use sqlite as a backend. It's not ready to even be shared yet but I think it's probably going to be the best way to achieve the level of file search quality I want. I don't think it will support regexp tho, because proper regexp on file search doesn't really seem feasible given the size of the dataset (or it should only apply to a list of prefiltered candidates). As soon as I have something ready I will share it. |
|
Alright sounds good! Thanks for the update |
Support for regex in the file search, which is sped up using a trigram index to prevent walking the whole database.
This is still an early draft but it runs.
I still need to review and rework regex-utils as in its current state it was written by Claude AI to pass the series of tests so I am sure it has some issues.
Some edge cases also need some thinking, right now it requires that a minimum of a 3 character word is extracted from the regex to run the trigram match, otherwise there are no results (the alternative would be to execute the regex on the whole dataset which would be slow).
One question I have though, based on what I can see, the trigram algorithm could be a better fit compared to the unicode one, as it allows in-word matching instead of only beginning-of-word matching. Could we maybe replace the base tokenizer too?
Fixes #551