Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: split, sort, and deduplicate curated dictionary #594

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hippietrail
Copy link
Contributor

@hippietrail hippietrail commented Feb 5, 2025

Splits the curated dictionary using one blank line between the last original/hunspell entry "zymurgy" and entries added for Harper.

Entries added for Harper sorted so it's easier to spot duplicates or near duplicates.

Removed the one duplicate "TODO" existed which sorting revealed.

The name 'affixes' turns out to be a misnomer since about a quarter or a third of them don't add any prefix or suffix but change a property such as the part-of-speech, or singular vs plural.

Another approach would be to switch from `json` to either [`JSON5`](https://json5.org/) or [`Hjson`](https://hjson.github.io/)
@hippietrail hippietrail changed the title feat: adds a brief helpful comment to each entry describing its function feat: split, sort, and deduplicate curated dictionary Feb 5, 2025
@hippietrail hippietrail marked this pull request as draft February 5, 2025 09:52
@hippietrail
Copy link
Contributor Author

Oh well I did test this but I think I haven't installed all of the tools necessary for a complete just precommit so it seems the one blank line does break that. Turned this into a draft as it may still be of interest to people.

Removing the one blank line dividing the old from the new will probably make it pass precommit tests.

@elijah-potter
Copy link
Collaborator

Just let me know when you want me to review this.

@hippietrail
Copy link
Contributor Author

Just let me know when you want me to review this.

Well it turns out having a blank line breaks something in the builing and/or testing even though it runs fine.

But you're reminding me that I had another idea. To have two curated dictionary files in the source that simply get concatenated into one as a build step to make the FstDictionary.

We could make the rule that domain-specific proper nouns for tech product names go into another file and anything that needs more than just /M to get MyProduct + MyProduct's qualifies it for the normal-word dictionary.

@hippietrail
Copy link
Contributor Author

Just let me know when you want me to review this.

I made a PR that will support blank lines in dictionary.dict and also end-of-line comments.

This would allow me to try a second time at separating the dictionary into "normal" evergreen dictionary words and "speciality" words such as domain-specific company and product names, which tend to vary in terms of capitalization, hyphenation, etc, anyway.

(And also a VS Code syntax highlighter extension that supports it all. But not in a PR as of yet.)

One thing holding it up is whether we officially support entries with spaces. So far there is exactly one. (I'm thinking more and more a separate curated dictionary for multi-word lexemes/listemes would be the best way forward.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants