Optimize search algorithm + documentation of prediction.cppm by moonbeamcelery · Pull Request #9 · florisboard/nlp

moonbeamcelery · 2023-10-19T07:01:02Z

Added documentation as I try to understand the prediction implementation.
Please also check the items labeled as [Bug?] and see if they are actually bugs.

I'll try optimizing the search in another PR.

patrickgold

Thanks a lot for your changes. Had a look on your [Bug?] labels and commented my thoughts on each of them separately.

patrickgold · 2023-10-21T21:11:59Z

nlpcore/src/latin/prediction.cppm

    auto prefixCost = 0.0; // Is initialized in next line only if isWordPrefix results in true
    // TODO: improve prefix searching performance (run time and stop detection)
+    // Eligible as prefix candidate? searching for prefix, current word and query word same until `token_index`, and cost at word size within bound.
+    // [Bug?] editDistanceAt at this stage is either 0 or leftover from last time we searched till word.size().


Normally it shouldn't be a bug and the edit distance should be the current one, but I am not 100% sure

patrickgold · 2023-10-21T21:13:51Z

nlpcore/src/latin/prediction.cppm

        } else {
-            // We have an n-gram
+            // We have an n-gram (n > 1)
+            // Only search the user's dictionary for n-grams for now. [Bug?]


Not a bug currently and behaves as described, only in the future if we also want to query the language pack dictionaries this is a bug. We should probably slap a TODO label on it

patrickgold · 2023-10-21T21:20:23Z

nlpcore/src/latin/prediction.cppm

+/* Compute frequency and merged properties from word-level, n-gram level, and shortcut level.
+ * Returns pair:
+ * - merged_properties: absolute score = sum frequencies from all types, offensive/hidden = or(that of each n)
+ * - frequency: average of (smoothed) frequencies of all types [Bug?] Should add all n = 1..N. should normalize by (n-1)-gram's frequency, not the root's.


no normally P and EntryType are synced, so if i query ngrams i get the merged freq for ngrams only

patrickgold

Thanks a lot for the changes, really appreciate it!

I've now run your changes using the nlptools binary on PC and have compared the old and new results for suggestion queries. Here are some observations:

A: Dictionary loading in parallel seems to run very smoothly and is not much noticeable, except if one immediately tries to type and not everything is loaded yet, but on mobile that's more than enough

B. The confidence value in the old was a normalized value between 0 and 1:

using the new algorithm however the confidence values are all extremely low:

Here is the question why this occurs and how this could be fixed?

C. Sometimes the confidence value drops below 0 which should not happen:

D. The current implementation breaks the nlptool actions prep and train due to API changes in nlpcore, these need to be fixed please.

patrickgold · 2023-11-04T11:08:59Z

nlpcore/src/common/dictionary.cppm

+    ~WrappedJThread() {
+        thread.join();
+    }
+}


Suggested change

}

};

You missed a semicolon here

How did this get past compilation...

patrickgold · 2023-11-04T11:25:56Z

nlpcore/src/latin/dictionary.cppm

    LatinDictId dict_id_;
-    std::shared_ptr<LatinTrieNode> data_;
+    // all operations on data_->first need to acquire lock on data_->second please.
+    std::shared_ptr<std::pair<LatinTrieNode, std::shared_mutex>> data_;


Regarding the std::pair<LatinTrieNode, std::shared_mutex>:

What I would suggest to do here is to declare a new struct named LatinTrieNodeWithLock, which in turn looks something like this:

struct LatinTrieNodeWithLock { LatinTrieNode node; std::shared_mutex lock; }

This way we can make code using this more readable and we avoid having to use first and second, which are not verbose at all.

Then we can write:

Suggested change

std::shared_ptr<std::pair<LatinTrieNode, std::shared_mutex>> data_;

std::shared_ptr<LatinTrieNodeWithLock> data_;

which is far more readable.

Sounds good, will do.

patrickgold · 2023-11-04T11:29:42Z

nlpcore/src/latin/dictionary.cppm

    inline LatinTrieNode* insertNgram(std::span<const fl::str::UniString> ngram) noexcept {
-        return algorithms::insertNgram(data_.get(), dict_id_, ngram);
+        std::scoped_lock<std::shared_mutex> lock(data_->second);
+        return algorithms::insertNgram(&(data_->first), dict_id_, ngram);
    }

-    inline void forEachEntry(algorithms::EntryAction& action) {
-        algorithms::forEachEntry(data_.get(), dict_id_, action);
+    inline void forEachEntryReadSafe(algorithms::EntryAction& action) {
+        std::shared_lock<std::shared_mutex> lock(data_->second);
+        algorithms::forEachEntry(&(data_->first), dict_id_, action);
    }


Is there a reason why you sometimes use scoped_lock and sometimes shared_lock?

Because scoped_lock is non-shared lock (write lock) while shared_lock is shared (read lock). I'd use scoped_shared_lock if there is one, but there isn't.

patrickgold · 2023-11-04T16:53:20Z

nlpcore/src/latin/prediction.cppm

+    similarity = -cost;
+    double confidence = (w1 * similarity + w2 * frequency) / (w1 + w2);


Why assign negative cost to similarity?

It's from the "just get it working" mindset. The similarity does not really need to be 0-1 for suggestion; it just needs to be correctly ordered such that better matches are scored higher. I'll work out the math in the next update.

moonbeamcelery · 2023-11-07T04:02:53Z

Thanks for the review Patrick.

A. It would only be noticeable if the user types a uncommon word straight away. Otherwise it should be equivalent.

B. I have no idea what's going on; it should not be so small. I'll take a look.

C. I haven't given the score much attention, thinking as long as it works, it's fine. Right now it's not constrained to 0-1 because that would be hard to make the cost non-decreasing during the A* search. I'll make changes so the math is more formal.

However, please note that instead of having a 0-1 score, we can simply say the score is the log confidence ranging from -infty to 0. Log probability is adopted in some libraries, including KenLM mentioned somewhere here. Subtracting a number is just multiplying the confidence with something smaller than 1. For example, one edit distance subtracts from the log confidence, which is equivalent to multiplying the confidence by 1/e. I'll probably do something along this line and return an exponent at the end.

D. I'm on Debian so I couldn't build nlptools. I'll ping you to get your setup.

patrickgold · 2023-12-11T16:04:27Z

it seems that merging in the latest main has messed up the branch and now changes are shown in the diff which aren't yours, making PR review pretty hard. Could you maybe undo the merge and properly rebase on main?

Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>

… loaded Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>

Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>

moonbeamcelery · 2024-01-29T05:18:15Z

Hey Patrick,

moonbeamcelery changed the title ~~Documentation of prediction.cppm~~ Optimize search algorithm + documentation of prediction.cppm Oct 20, 2023

patrickgold reviewed Oct 22, 2023

View reviewed changes

moonbeamcelery force-pushed the main branch from d9edb54 to 4b4de4d Compare October 25, 2023 16:59

patrickgold requested changes Nov 4, 2023

View reviewed changes

moonbeamcelery added 9 commits December 20, 2023 09:20

Documentation of prediction.cppm

72fa1c6

Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>

Optimize word fuzzy search algorithm using pruning

da0fffe

Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>

Adding gradual loading of the dictionaries.

cce3ac0

Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>

Adding computing frequency on the fly when dictionaries are gradually…

9280217

… loaded Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>

Remove bug comment

ae90cc2

Signed-off-by: moonbeamcelery <moonbeamcelery@proton.me>

WIP fix

32f7cd3

At least the nlpcore part compiles

0dd89e2

At least the code compiles

9b44772

Now it runs well.

3dc426d

moonbeamcelery force-pushed the main branch from d58f1d1 to 3dc426d Compare December 20, 2023 09:21

Clean up Dockerfile

6fce638

	std::shared_ptr<std::pair<LatinTrieNode, std::shared_mutex>> data_;
	std::shared_ptr<LatinTrieNodeWithLock> data_;

		similarity = -cost;
		double confidence = (w1 * similarity + w2 * frequency) / (w1 + w2);

Conversation

moonbeamcelery commented Oct 19, 2023

Uh oh!

patrickgold left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickgold left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

moonbeamcelery commented Nov 7, 2023

Uh oh!

patrickgold commented Dec 11, 2023

Uh oh!

moonbeamcelery commented Jan 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants