Description
pythainlp.util.collate() results a wrong ordering,
as current implementation ignores tone marks and symbols in the ordering.
Try this code:
from pythainlp.util import collate
collate(["ก้วย", "ก๋วย", "ก่วย", "กวย", "ก้วย", "ก่วย", "ก๊วย"])
Expected results
Ordering according to Thai dictionary
['กวย', 'ก่วย', 'ก่วย', 'ก้วย', 'ก้วย', 'ก๊วย', 'ก๋วย']
Current results
['ก้วย', 'ก๋วย', 'ก่วย', 'ก้วย', 'ก่วย', 'ก๊วย', 'กวย']
Your environment
Files
pythainlp/util/collate.py
Proposed test case
class TestUtilPackage(unittest.TestCase):
# ### pythainlp.util.collate
def test_collate(self):
self.assertEqual(
collate(["ก้วย", "ก๋วย", "กวย", "ก่วย", "ก๊วย"]),
collate(["ก๋วย", "ก่วย", "ก้วย", "ก๊วย", "กวย"]),
) # should guarantee same order
self.assertEqual(
collate(["ก้วย", "ก๋วย", "ก่วย", "กวย", "ก้วย", "ก่วย", "ก๊วย"]),
["กวย", "ก่วย", "ก่วย", "ก้วย", "ก้วย", "ก๊วย", "ก๋วย"],
)
Description
pythainlp.util.collate()results a wrong ordering,as current implementation ignores tone marks and symbols in the ordering.
Try this code:
Expected results
Ordering according to Thai dictionary
Current results
Your environment
Files
pythainlp/util/collate.pyProposed test case