feat(core)!: more explicit handling of case-sensitivity in dictionaries#2630
feat(core)!: more explicit handling of case-sensitivity in dictionaries#263086xsk wants to merge 88 commits intoAutomattic:masterfrom
Conversation
`SpellCheck` shouldn't handle capitalization if `OrthographicConsistency` is going to do it anyway.
- Splits `WordId` into `CanonicalWordId` and `CaseFoldedWordId`. - Updates dictionary functions to more explicitly handle casing. There are now functions to get a specific word case-sensitively, multiple words case-insensitively, and ditto but merge all metadata (which was the old behavior). - Fixes issue where `SpellCheck` would sometimes mark words as incorrect if an identical entry with different casing existed in the dictionary (e.g. OS, PR, etc.). - Makes `SpellCheck` no longer care about casing, since that is handled by `OrthographicConsistency`.
…mattic#2476)" This partially reverts commit 5230d6a. Returns the word to the dictionary, since removing it should no longer be necessary.
Since casing-related issues are now handled by `OrthographicConsistency`, not `SpellCheck`.
Expands the criteria in which `OrthographicConsistency` will lint for incorrect capitalization. Makes the `no_improper_suggestion_for_macos` test pass.
Allow all word casing/orthography that are defined in the dictionary. If the dictionary contains the exact word, `OrthographicConsistency` will skip it.
Both variants are defined in the dictionary, and appear to be valid in this case.
The test would expect 'Al' to be linted by `OrthographicConsistency` for not being all-caps.
Remove needless borrow.
|
Does it handle the same word having multiple This happens a lot more than expected and causes some bugs of this type to crop up. I had a PR for it for many months that just tracked them but didn't really do anything with them as I wasn't sure what was and wasn't planned etc. |
|
I don't recall everything perfectly at the moment but both I think both cases are possible and only one is handled:
I believe you're talking about 1. which works but is behind a few bugs. But 2. is distinct. I found my old PR: #1035 Apologies if I've mixed up any concepts myself - it's been a while (-: |
|
Oh, I see what you mean now. Yeah no, at the moment I think this PR mostly sticks with the old behavior where once a I did change the field to store a I'll take another look though. I wasn't too certain with that change as is, and the PR you mentioned does bring up an issue I haven't even considered. |
If I recall correctly the old behaviour is for each new one to blindly stomp on the previous
Yes I never took on the case folding as I was worried about stepping on the toes of the main devs thinking it might've been an intentional design choice plus I didn't find out it was definitely case folding until months after I did the multi-
It looks like you're fixing up a whole bunch of sticky issues at once - good stuff! |
Support a word having multiple words that it's derived from.
I could be wrong, but I believe it will just keep the first value that was set, since in derived_from: self.derived_from.or(other.derived_from),...it sets the I threw together the changes so that it now supports a word having multiple |
Cool. I remember when I was working on that stuff that it turned out I missed some cases where a word could have multiple |
Avoid unnecessary cloning when using `DictWordMetadata::append`. This reduces the performance regression introduced by storing a Vec inside `DictWordMetadata`.
Move the implementations from `MutableDictionary` into `WordMap`.
Change the way the curated `WordMap` is constructed to more closely align with the way curated dictionaries are constructed. Also, this makes the curated `WordMap` publicly accessible.
Use `WordMap` instead of `FstDictionary` in `MergeableWords`. This avoids dynamic dispatch, and is more fitting since the features of `FstDictionary` aren't used by `MergeableWords`.
Avoid back-and-forth conversion between string and char array in `PhrasalVerbAsCompoundNoun`.
This works with an iterator directly, avoiding an intermediate allocating for `found_words`.
Fixes a warning related to having a `len` function without corresponding `is_empty` function.
Use `WordMap` directly in place of a dictionary when the features of said dictionary aren't necessary nor beneficial for the use case.
Create a new trait to contain `Dictionary` functions without requiring dyn-compatibility; move a significant portion of existing `Dictionary` functions to this trait. Aside from functions needing dyn-incompatible features (like generics), this trait also contains functions that don't and are unlikely to derive any benefit from being overridable (at least at the moment).
Avoid generics where they're not beneficial, and use `WordMap` directly where possible.
Make `MutableDictionary` a type alias for `WordMap`.
Replace `From<MutableDictionary> for FstDictionary` with `WordMap::to_fst`.
Replace `WordMap::extend` with `Extend<WordMapEntry>` impl for additional flexibility.
Rename private members and reorganize imports.
Take `WordMap` instead of `Vec<WordMapEntry>`. Since `WordMap` is already guaranteed to contain unique words only, we can skip deduping.

Issues
Fixes #2585
Related to #1688
Related to #2411
(Probably a few more, this has been a long-running issue.)
Description
WordIdwithCanonicalWordIdandCaseFoldedWordId.CanonicalWordIdhashes the input word as is, without lowercasing or normalization.CaseFoldedWordIdlowercases and normalizes the word before hashing it (this is the current behavior ofWordId).WordIdPairandto make it easier to work withEitherWordIdCanonicalWordId/CaseFoldedWordId.Vecof words, not an individual word.get_word_metadata_combinedfunctions.SpellCheckdespite being present in the dictionary.SpellCheckno longer care about capitalization, fully transferring that responsibility toOrthographicConsistency(which already handles it anyway for the most part).derived_fromfield inDictWordMetadatato contain multiple words.DerivedFromstruct to contain this data without exposing implementation details. The current implementation uses a Vec which allows for the insertion of unique values only.Running
cargo bench --bench parse_essayshowed a performance regression of about ~3% across the board with these changes. I didn't notice any performance issues when testing the modified build ofharper-ls.How Has This Been Tested?
cargo testChecklist