-
Notifications
You must be signed in to change notification settings - Fork 196
Add sesotho Stemmer #260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add sesotho Stemmer #260
Conversation
full_build.log
Outdated
| @@ -0,0 +1,138 @@ | |||
| libstemmer/mkalgorithms.pl algorithms.mk libstemmer/modules.txt | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log shouldn't be in git.
algorithms/sesotho.sbl
Outdated
| do remove_nominal_suffixes | ||
| do remove_verb_suffixes | ||
| do remove_noun_prefixes | ||
| ) No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor nit, but please include a newline character on the final line (github's red icon here means there isn't one). It's better to include as some text processing tools can behave unhelpfully without one (and some editors will add automatically add one which can create noise in future PRs).
libstemmer/modules.txt
Outdated
| tamil UTF_8 tamil,ta,tam | ||
| turkish UTF_8 turkish,tr,tur | ||
| yiddish UTF_8 yiddish,yi,yid | ||
| sesotho UTF_8 sesotho,st,sot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Judging from the stemmer code, this language seems to use only ASCII characters so we can put UTF_8,ISO_8859_1 here.
|
Thanks for submitting this. There's a bit of a queue of stemmers waiting to be reviewed currently but I'll at least do a quick initial review. The CI failures look to be due to the test data - I'll comment on the snowball-data PR about that. |
…e modules text file
| [substring] among( | ||
| 'nyana' /* diminutive form */ | ||
| 'ana' /* diminutive form */ | ||
| 'ano' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this change intended? (Asking because it wasn't covered by the commit message)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it was intentional.
This is a stemmer for Sesotho (Southern Sotho), a Bantu language spoken in South Africa and Lesotho.