-
Notifications
You must be signed in to change notification settings - Fork 196
Description
Spun off from #165 (comment)
@pramsey Reported the two plurals of bus not being conflated:
$ printf 'bus\nbuses\nbusses\n'|./stemwords -l en -p
bus -> bus
buses -> buse
busses -> buss
There is a cost to exceptions, especially ones that need to be checked for every word stemmed, so we don't generally worry about irregular cases if we're no worse off than we would be without any stemming. However if there's unwanted conflation of the irregular forms with words which have different (or different enough) meanings then that's a different matter.
Here the only potentially unwanted conflation seems to be with buss (archaic word for a kiss/to kiss). If we're worrying about buss then busses is also the plural of the noun and third person singular of the verb, so it's inherently ambiguous.
If this is part of a wider pattern which we can come up with a sensible rule for then it might be worth an exception. So far I've spotted these other words ending -s which add -es for the plural and can double the s or not:
- The plural of
biascan bebiases(stembias) orbiasses(stembiass, same asbiassedandbiassing) - The plural of
gascan begases(stemgase) orgasses(stemgass, same asgassedandgassing) - The plural of
yes(as a noun) can beyeses(stemyese) oryesses(stemyess)
(We already have an exceptional invariant entry for bias to prevent us removing the s.)
It seems any new rule for this can't just look at the ending since the current handling of e.g. vases->vase and masses->mass is what we want.
I checked the mailing list archives and gas/gases/gasses has been noted at least twice before (and gas improved to not stem to ga). Martin summarised that change (probably the last in this area):
-s removal has been changed. You now need a vowel somewhere before the letter before the s. So 'gas', 'this', 'has', 'was' keep the s, 'dogs', 'cats', 'woos', 'kiwis' lose the s. Usefully, the s is not removed from non-words like 'cvs', 'spss', 'lms' etc.
In general there is a problem identifying plurals of words ending Xs, where
X is vowel other than e. As you know, porter2 leaves -us alone but removes s
after a,i,o. This works fairly well.