Optimize search algorithm + documentation of prediction.cppm#9
Optimize search algorithm + documentation of prediction.cppm#9moonbeamcelery wants to merge 10 commits intoflorisboard:mainfrom
Conversation
patrickgold
left a comment
There was a problem hiding this comment.
Thanks a lot for your changes. Had a look on your [Bug?] labels and commented my thoughts on each of them separately.
nlpcore/src/latin/prediction.cppm
Outdated
| auto prefixCost = 0.0; // Is initialized in next line only if isWordPrefix results in true | ||
| // TODO: improve prefix searching performance (run time and stop detection) | ||
| // Eligible as prefix candidate? searching for prefix, current word and query word same until `token_index`, and cost at word size within bound. | ||
| // [Bug?] editDistanceAt at this stage is either 0 or leftover from last time we searched till word.size(). |
There was a problem hiding this comment.
Normally it shouldn't be a bug and the edit distance should be the current one, but I am not 100% sure
nlpcore/src/latin/prediction.cppm
Outdated
| } else { | ||
| // We have an n-gram | ||
| // We have an n-gram (n > 1) | ||
| // Only search the user's dictionary for n-grams for now. [Bug?] |
There was a problem hiding this comment.
Not a bug currently and behaves as described, only in the future if we also want to query the language pack dictionaries this is a bug. We should probably slap a TODO label on it
nlpcore/src/latin/prediction.cppm
Outdated
| /* Compute frequency and merged properties from word-level, n-gram level, and shortcut level. | ||
| * Returns pair: | ||
| * - merged_properties: absolute score = sum frequencies from all types, offensive/hidden = or(that of each n) | ||
| * - frequency: average of (smoothed) frequencies of all types [Bug?] Should add all n = 1..N. should normalize by (n-1)-gram's frequency, not the root's. |
There was a problem hiding this comment.
no normally P and EntryType are synced, so if i query ngrams i get the merged freq for ngrams only
patrickgold
left a comment
There was a problem hiding this comment.
Thanks a lot for the changes, really appreciate it!
I've now run your changes using the nlptools binary on PC and have compared the old and new results for suggestion queries. Here are some observations:
A: Dictionary loading in parallel seems to run very smoothly and is not much noticeable, except if one immediately tries to type and not everything is loaded yet, but on mobile that's more than enough
B. The confidence value in the old was a normalized value between 0 and 1:

using the new algorithm however the confidence values are all extremely low:

Here is the question why this occurs and how this could be fixed?
C. Sometimes the confidence value drops below 0 which should not happen:

D. The current implementation breaks the nlptool actions prep and train due to API changes in nlpcore, these need to be fixed please.
nlpcore/src/common/dictionary.cppm
Outdated
| ~WrappedJThread() { | ||
| thread.join(); | ||
| } | ||
| } |
There was a problem hiding this comment.
| } | |
| }; |
You missed a semicolon here
There was a problem hiding this comment.
How did this get past compilation...
nlpcore/src/latin/dictionary.cppm
Outdated
| LatinDictId dict_id_; | ||
| std::shared_ptr<LatinTrieNode> data_; | ||
| // all operations on data_->first need to acquire lock on data_->second please. | ||
| std::shared_ptr<std::pair<LatinTrieNode, std::shared_mutex>> data_; |
There was a problem hiding this comment.
Regarding the std::pair<LatinTrieNode, std::shared_mutex>:
What I would suggest to do here is to declare a new struct named LatinTrieNodeWithLock, which in turn looks something like this:
struct LatinTrieNodeWithLock {
LatinTrieNode node;
std::shared_mutex lock;
}This way we can make code using this more readable and we avoid having to use first and second, which are not verbose at all.
Then we can write:
| std::shared_ptr<std::pair<LatinTrieNode, std::shared_mutex>> data_; | |
| std::shared_ptr<LatinTrieNodeWithLock> data_; |
which is far more readable.
| inline LatinTrieNode* insertNgram(std::span<const fl::str::UniString> ngram) noexcept { | ||
| return algorithms::insertNgram(data_.get(), dict_id_, ngram); | ||
| std::scoped_lock<std::shared_mutex> lock(data_->second); | ||
| return algorithms::insertNgram(&(data_->first), dict_id_, ngram); | ||
| } | ||
|
|
||
| inline void forEachEntry(algorithms::EntryAction& action) { | ||
| algorithms::forEachEntry(data_.get(), dict_id_, action); | ||
| inline void forEachEntryReadSafe(algorithms::EntryAction& action) { | ||
| std::shared_lock<std::shared_mutex> lock(data_->second); | ||
| algorithms::forEachEntry(&(data_->first), dict_id_, action); | ||
| } |
There was a problem hiding this comment.
Is there a reason why you sometimes use scoped_lock and sometimes shared_lock?
There was a problem hiding this comment.
Because scoped_lock is non-shared lock (write lock) while shared_lock is shared (read lock). I'd use scoped_shared_lock if there is one, but there isn't.
nlpcore/src/latin/prediction.cppm
Outdated
| similarity = -cost; | ||
| double confidence = (w1 * similarity + w2 * frequency) / (w1 + w2); |
There was a problem hiding this comment.
Why assign negative cost to similarity?
There was a problem hiding this comment.
It's from the "just get it working" mindset. The similarity does not really need to be 0-1 for suggestion; it just needs to be correctly ordered such that better matches are scored higher. I'll work out the math in the next update.
|
Thanks for the review Patrick. A. It would only be noticeable if the user types a uncommon word straight away. Otherwise it should be equivalent. B. I have no idea what's going on; it should not be so small. I'll take a look. C. I haven't given the score much attention, thinking as long as it works, it's fine. Right now it's not constrained to 0-1 because that would be hard to make the cost non-decreasing during the A* search. I'll make changes so the math is more formal. However, please note that instead of having a 0-1 score, we can simply say the score is the log confidence ranging from -infty to 0. Log probability is adopted in some libraries, including KenLM mentioned somewhere here. Subtracting a number is just multiplying the confidence with something smaller than 1. For example, one edit distance subtracts from the log confidence, which is equivalent to multiplying the confidence by 1/e. I'll probably do something along this line and return an exponent at the end. D. I'm on Debian so I couldn't build nlptools. I'll ping you to get your setup. |
|
it seems that merging in the latest main has messed up the branch and now changes are shown in the diff which aren't yours, making PR review pretty hard. Could you maybe undo the merge and properly rebase on main? |
Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
… loaded Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>
|
Hey Patrick, |
Added documentation as I try to understand the prediction implementation.
Please also check the items labeled as
[Bug?]and see if they are actually bugs.I'll try optimizing the search in another PR.