-
Notifications
You must be signed in to change notification settings - Fork 196
Description
Is it possible to add official support for Slovene stemming algorithm to Snowball?
Martin Porter started working on Slovene stemmer in 2005, but never finished it because it had some problems. That stemmer could probably be used as a starting point.
I found some papers about Slovene stemming that might be useful:
- Stemming of Slovenian library science texts: https://www.researchgate.net/publication/50392133_Stemming_of_Slovenian_library_science_texts
- The effectiveness of stemming for natural-language access to Slovene textual data: https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/%28SICI%291097-4571%28199206%2943%3A5%3C384%3A%3AAID-ASI6%3E3.0.CO%3B2-L
- Processing of documents and queries in a Slovene language free text retrieval system: https://academic.oup.com/dsh/article-abstract/5/2/182/943275 (this one is actually referenced in the Snowball introduction)
I'm not familiar with now Snowball algorithms work, but here are my suggestions for some of the questions for the original algorithm:
Would not sloven (or slov), be a more desirable stem in this case?
I think "sloven" would be the most appropriate.
Another point. I notice a common -ah suffix, which you have not removed, as for example here [...] besedah besedah [...] Could this be added to the list of suffixes?
I don't know which other things that will affect, but -ah suffix should be removed in cases like "besedah".