-
Notifications
You must be signed in to change notification settings - Fork 168
Dialect prototyping #925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dialect prototyping #925
Conversation
packages/web/src/routes/docs/integrations/language-server/+page.md
Outdated
Show resolved
Hide resolved
I'll look at the actual changes in a minute but I just wanted to write the first things I notice here:
|
@@ -4653,7 +4653,7 @@ Hymen/2M | |||
Hyperion/2M | |||
Hyundai/1M | |||
Hz/M | |||
I/8~sf | |||
I/8~f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"I" is a subject pronoun s
, we just haven't started building any logic on that fact yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the s
because it was overriding the f
.
harper-core/dictionary.dict
Outdated
@@ -8483,7 +8482,7 @@ Ritz/2M | |||
Rivas/2M | |||
Rivera/2M | |||
Rivers/2M | |||
Riverside/2M | |||
Riverside/2M@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of what look like generic place names and random words are getting marked as Canadian @
- is this right? (See also Vallejo
, Villa
, abridge
, etc) - maybe I just can't see the pattern yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add the dictionaries of other languages I:
- Diffed the original American English dictionary with the dialect's version of the dictionary.
- Added all words from the dialect's side of the diff into the dictionary, marking each (automatically) as being from that dialect.
For some reason, Riverside (and friends) wasn't in the American dictionary we started with. I'll go through and manually check each one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add the dictionaries of other languages I:
- Diffed the original American English dictionary with the dialect's version of the dictionary.
- Added all words from the dialect's side of the diff into the dictionary, marking each (automatically) as being from that dialect.
For some reason, Riverside (and friends) wasn't in the American dictionary we started with. I'll go through and manually check each one.
Ah OK. I looked at most of the new and changed lines that were between old lines but I didn't go through the section with lots of new material appended at the end to see how any of these pattern played out. I think there will be curation work to do, as there already of course was.
I think I have one outstanding dictionary curation PR that got held up because there was some code work in the same PR that got messy due to conflicts. I'm pretty sure I managed to clean it all up, so when that is merged and the dialect stuff is merged I'll try to start some more dictionary cleanup aimed at making curation easier. I won't do any new dictionary curation PRs in the meantime.
@@ -49062,7 +49062,7 @@ wayside/15SM | |||
wayward/~5PY | |||
waywardness/1M | |||
wazoo/1S | |||
we/~sf: | |||
we/~: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we dropping the pronoun properties intentionally? I see they're still in affixes.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I'm not sure why that happened...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I'm not sure why that happened...
I would love to enhance the syntax highlighters for the dictionary format for a kind of "intellisense" that can tell what each annotation means etc. But I think that requires an LSP... Or maybe I could do a specialized VS Code extension for it. But I think everyone else uses Neovim...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think everyone else uses Neovim...
Actually, I think @grantlemons and I are the only consistent committers who use Neovim. I've seen a growing number of commits (from you and @mcecode) who use VS Code.
If you think it would be genuinely useful, go for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think that requires an LSP...
I remember a discussion about making a tree-sitter grammar for the dictionary. If that ever came to fruition, it would make making a language server and providing intellisense much easier since you wouldn't need to do the parsing yourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think that requires an LSP...
I remember a discussion about making a tree-sitter grammar for the dictionary. If that ever came to fruition, it would make making a language server and providing intellisense much easier since you wouldn't need to do the parsing yourself.
Yes I made a Tree-sitter grammar and it uncovered a bug in Zed when I tried to use that grammar. It's on GitHub in two unfished versions due to the trouble I had tracking down the Zed bug. They decided not to accept my PR and addressed it in a different way by not allowing grammars with hyphens in their names... other than than the Tree-sitter-
part.
I also haven't updated it to use the #
delimiter for comments that we decided to go with.
I loaded up some docs on LSPs but didn't ingest much yet. Do they rely on Tree-sitter, because I know VS Code doesn't use it for syntax highlighting. A curated dictionary LSP would only need "hover" support as far as I can see.
Go ahead and open a feature request for a dictionary.dict
LSP so we can continue this discussion there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do they rely on Tree-sitter
Not necessarily, but I assume you'd need some sort of parsing mechanism that's why I brought it up.
I know VS Code doesn't use it for syntax highlighting.
Yup, it uses TextMate (for now but they're experimenting on using tree-sitter), which I remember you made one before as well.
A curated dictionary LSP would only need "hover" support as far as I can see.
If that's how it is, then maybe you won't need tree-sitter as much, I can imagine splitting the symbols after /
wouldn't be too difficult and may not require a full parser.
Go ahead and open a feature request for a
dictionary.dict
LSP so we can continue this discussion there.
I think a discussion would be more appropriate for now.
@@ -83,6 +83,18 @@ | |||
"default": false, | |||
"description": "Make code actions appear in \"stable\" positions by placing code actions that should always be available, like adding misspelled words in the dictionary, first." | |||
}, | |||
"harper.dialect": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean I should have a "Dialect" setting in VS Code? I can't seem to find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these changes aren't merged and released yet, you'd need to run the extension within an extension host for development to see this new setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these changes aren't merged and released yet, you'd need to run the extension within an extension host for development to see this new setting.
Oh thanks. I'll see if I can get it working...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually quite confusing to me. I've done a tiny bit of work on VS Code extensions before so I know about the extension host. But my brain is having trouble resolving that with how I usually hack on and test Harper. I'll try to provide more constructive feedback that might help to improve the docs there in case there are other dummies like me (-:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually quite confusing to me.
No worries! VS Code extension things were confusing to me too when I started working on them (and they still sometimes are). I think your workflow mainly follows the VS Code section in our Author a Rule guide, am I right? I think making some clarifications to that section and linking to the VS Code guide would make things much clearer on how things work. I'll try to make a PR for that and other docs updates that I think would be good when I find the time.
Co-authored-by: mcecode <[email protected]>
@hippietrail, that's actually because the diff between the source dictionaries for American English and Australian English had zero lines difference. I'm not super familiar with Australian English. Would you be able to take up the necessary changes to the dictionary? |
That makes sense. I started to think that might be why. I know vocabulary differences but I can't think of any spelling differences. I might do a linter for wrong dialect vocabulary at some point but I am getting a bit out of touch as I get older and find other Aussies using more and more words I used to only hear on American media. |
Alright. I'm going to merge this. Most of the remaining work for this falls into the category of "dictionary curation," which should take place in separate PRs. |
This MR contains the following updates: | Package | Update | Change | |---|---|---| | [Automattic/harper/harper-ls](https://github.com/Automattic/harper) | minor | `v0.24.0` -> `v0.26.0` | MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot). **Proposed changes to behavior should be submitted there as MRs.** --- ### Release Notes <details> <summary>Automattic/harper (Automattic/harper/harper-ls)</summary> ### [`v0.26.0`](https://github.com/Automattic/harper/releases/tag/v0.26.0) [Compare Source](Automattic/harper@v0.25.1...v0.26.0) #### What's Changed - docs: fix user dictionary by [@​kit494way](https://github.com/kit494way) in Automattic/harper#893 - feat: mask out comments beginning with spellchecker:ignore by [@​grantlemons](https://github.com/grantlemons) in Automattic/harper#861 - feat(harper.js): export both binary and inlinedBinary for different runtimes by [@​Asuka109](https://github.com/Asuka109) in Automattic/harper#607 - feat: linter for "as far back as" to replace "as early back as" by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#889 - feat: flag "explanation mark/point" instead of "exclamation" by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#895 - feat: correct "in anyway" to "in any way" by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#894 - build(deps): bump [@​babel/helpers](https://github.com/babel/helpers) from 7.26.9 to 7.26.10 in /packages by [@​dependabot](https://github.com/dependabot) in Automattic/harper#899 - fix: two spelling mistakes based on homophones by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#886 - feat: allow blank lines and comments in `dictionary.dict` by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#756 - docs: fix typo [#​906](Automattic/harper#906) by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#912 - hotfix(core): properly store spans in `PatternLinter` cache by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#926 - Dictionary curation 2025 03 12 by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#902 - Dialect prototyping by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#925 - feat: insert newline automatically in `just addnoun` by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#931 - docs: fix 3 grammar mistakes by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#933 - feat: linter for "each and everyone" by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#923 - feat: expand the "get rid off" lint to cover "get ride of" by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#900 - fix(vscode-plugin): ignore non-existent ".git" files, support untitled/unsaved files on VS Code by [@​kiding](https://github.com/kiding) in Automattic/harper#927 - feat(core): improve assertion to allow overlapping suggestions by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#935 - build(deps): bump [@​wordpress/editor](https://github.com/wordpress/editor) from 14.19.0 to 14.20.0 in /packages by [@​dependabot](https://github.com/dependabot) in Automattic/harper#915 - build(deps): bump indexmap from 2.7.1 to 2.8.0 by [@​dependabot](https://github.com/dependabot) in Automattic/harper#921 - build(deps): bump tokio from 1.43.0 to 1.44.1 by [@​dependabot](https://github.com/dependabot) in Automattic/harper#919 - build(deps-dev): bump [@​types/node](https://github.com/types/node) from 22.13.9 to 22.13.10 in /packages by [@​dependabot](https://github.com/dependabot) in Automattic/harper#913 - build(deps): bump foldhash from 0.1.4 to 0.1.5 by [@​dependabot](https://github.com/dependabot) in Automattic/harper#917 - feat: correct "along time" to "a long time" by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#910 - Add -able affix to open (openable) by [@​claydugo](https://github.com/claydugo) in Automattic/harper#930 - docs: mention hidden library dependencies by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#943 - feat(core): create new test assertion for `nth` suggestion results by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#942 - build: migrate to pnpm workspace & biome by [@​Asuka109](https://github.com/Asuka109) in Automattic/harper#924 - build(deps): bump serde from 1.0.218 to 1.0.219 by [@​dependabot](https://github.com/dependabot) in Automattic/harper#920 - build(deps): bump clap from 4.5.31 to 4.5.32 by [@​dependabot](https://github.com/dependabot) in Automattic/harper#946 - Web improvements by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#944 - feat: ignore shebang lines by [@​holmanb](https://github.com/holmanb) in Automattic/harper#947 - feat(web): add mask-image to header by [@​Asuka109](https://github.com/Asuka109) in Automattic/harper#951 - fix(core): reduce ambiguity for `AvoidContraction` by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#941 - chore: add comments describing major sections by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#952 #### New Contributors - [@​kit494way](https://github.com/kit494way) made their first contribution in Automattic/harper#893 - [@​holmanb](https://github.com/holmanb) made their first contribution in Automattic/harper#947 **Full Changelog**: Automattic/harper@v0.25.1...v0.26.0 ### [`v0.25.1`](https://github.com/Automattic/harper/releases/tag/v0.25.1) [Compare Source](Automattic/harper@v0.25.0...v0.25.1) #### What's Changed - docs(ls): give example config that disables `sentence_capitalization` by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#879 - fix(core): indexing problem in Regexish work by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#883 - Just getforms improvements by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#862 - Dictionary curation 2025 03 11 by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#884 - fix(core): insert paragraph breaks after code blocks by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#882 **Full Changelog**: Automattic/harper@v0.25.0...v0.25.1 ### [`v0.25.0`](https://github.com/Automattic/harper/releases/tag/v0.25.0) [Compare Source](Automattic/harper@v0.24.0...v0.25.0) #### What's Changed - docs: update integrations section by [@​mcecode](https://github.com/mcecode) in Automattic/harper#755 - Typst Corrections by [@​grantlemons](https://github.com/grantlemons) in Automattic/harper#442 - refactor: add comments to `just addnoun` and tweak logic by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#605 - feat: implements [#​841](Automattic/harper#841) by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#842 - Add WordPress Plugin Documentation and Demo by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#838 - feat: add `just newest-dict-changes` by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#701 - Spellcheck improvements by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#844 - fix: add missing "gotten rid off" to other "rid off" by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#840 - Rules page improvements by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#843 - build(deps): bump axios from 1.8.1 to 1.8.2 in /packages by [@​dependabot](https://github.com/dependabot) in Automattic/harper#845 - Regexish by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#669 - fix: fall back to `grep` when `rg` is not available by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#848 - feat: flag "monumentous" and offer "momentous" and "monumental" by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#864 - build(deps-dev): bump svelte-check from 4.1.4 to 4.1.5 in /packages by [@​dependabot](https://github.com/dependabot) in Automattic/harper#874 - build(deps): bump typst-syntax from 0.13.0 to 0.13.1 by [@​dependabot](https://github.com/dependabot) in Automattic/harper#867 - build(deps-dev): bump typescript from 5.7.3 to 5.8.2 in /packages by [@​dependabot](https://github.com/dependabot) in Automattic/harper#871 - build(deps-dev): bump autoprefixer from 10.4.20 to 10.4.21 in /packages by [@​dependabot](https://github.com/dependabot) in Automattic/harper#873 - Dictionary curation 2025 03 08 by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#860 - feat: add many variants of "change of tact"->"tack" by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#852 - feat: implement [#​525](Automattic/harper#525) (worse/worst confusion) by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#856 - build(deps): bump cached from 0.54.0 to 0.55.1 by [@​dependabot](https://github.com/dependabot) in Automattic/harper#868 - build(deps): bump anyhow from 1.0.96 to 1.0.97 by [@​dependabot](https://github.com/dependabot) in Automattic/harper#865 - Build against an older GLIBC version by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#877 - Cache busting by [@​elijah-potter](https://github.com/elijah-potter) in Automattic/harper#876 - build(deps): bump thiserror from 2.0.11 to 2.0.12 by [@​dependabot](https://github.com/dependabot) in Automattic/harper#866 - build(deps): bump serde_json from 1.0.139 to 1.0.140 by [@​dependabot](https://github.com/dependabot) in Automattic/harper#869 - feat: add a lint to correct "in of itself" to "in and of itself" by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#863 - feat: implement "ticking time clock" by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#851 - feat: implements [#​746](Automattic/harper#746) by [@​hippietrail](https://github.com/hippietrail) in Automattic/harper#855 - feat(dict): added words to dictionary by [@​ficcdaf](https://github.com/ficcdaf) in Automattic/harper#847 - fix: Ignore hex codes inside rgb function calls by [@​grantlemons](https://github.com/grantlemons) in Automattic/harper#857 - feat: Added Linux musl compilations by [@​kiding](https://github.com/kiding) in Automattic/harper#878 #### New Contributors - [@​kiding](https://github.com/kiding) made their first contribution in Automattic/harper#878 **Full Changelog**: Automattic/harper@v0.24.0...v0.25.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this MR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box --- This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOS4xOTIuMCIsInVwZGF0ZWRJblZlciI6IjM5LjIxMC4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJSZW5vdmF0ZSBCb3QiXX0=-->
Issues
#345
Description
This is a large patch, so there's a lot to cover:
Dialect
type to represent the 4 major English dialectsWordMetadata
,Token
, andTokenKind
to no longer beCopy
.This affected performance by about -8%.
MutableDictionary
(and by extension theFstDictionary
) to use word hashes as keys in a single map.derived_from
element to theWordMetadata
. This allows linters to determine the base word (the base word forbananas
isbanana
) for an affixed word.hunspell
module torune
. This module will slowly evolve to container a more bespoke dictionary file format. It has already evolved from the hunspell format sufficiently for a name change.MutableDictionary
at runtime using Rune files.en-GB
,en-CA
anden-AU
.Next Steps Before Merge
Before I can feel comfortable merging this PR, there are a couple things that have to happen:
Rune needs to be extended to be able to describe the relationships between words of various dialects (color
->colour
).Edit: the built-in spell check seems to convert between dialects quite well on its own. There isn't any immediate need for manual tagging.