-
Couldn't load subscription status.
- Fork 1
Description
Howdy, I have my own implementation of graphemes and I'm adding this implementation to my benchmarks.
As part of this I've been checking consistency between this and my implementation, one issue I've found here is that negative index slicing isn't consistently supported, in grapheme_slice it doesn't work:
ugrapheme/ugrapheme/ugrapheme.pyx
Lines 633 to 634 in 8ba5a96
| if startpos < 0: | |
| startpos = 0 |
but in
gslice it does:ugrapheme/ugrapheme/graphemes.pyx
Lines 2625 to 2628 in 8ba5a96
| if pos < 0: | |
| pos += sgl | |
| if pos < 0: | |
| pos = 0 |
In terms of benchmarks, this implementation looks to be between 2x and 10x faster than my implementation, so I might switch to this if I can't improve mine. My benchmark is pretty hacky, if you're interested you're welcome to try it or I can share some results. It generates random strings of ~1000 clusters from the example test cases and times each implementation, giving fine grained results for different use cases.
In my implementation I'm translating code points into a character for each character category then using a regex. It's interesting to see the different approaches everyone takes. Mine was the fastest I'd found for most stuff until now, although mine is pure python, no cython. That said, I don't pre-parse any of the tables so the first use has to parse the data files and cache results which is a bit slow but easier to read.