Use KoreanAnalyzer for Korean language (ko)#2174
Use KoreanAnalyzer for Korean language (ko)#2174sudokim wants to merge 1 commit intocastorini:masterfrom
KoreanAnalyzer for Korean language (ko)#2174Conversation
|
Hi @sudokim thanks for the PR! Do you have any idea if effectiveness improves as a result of switching the analyzer? E.g., on MIRACL or Mr.Tydi? |
|
Hi @lintool, here is the comparison result between
It seems that |
|
Great! Do you happen to have MRR scores? And also results on MIRACL? (Which will give us nDCG scores.) |
|
Sure! Here are the results: Mr.Tydi v1.1
MIRACL
|
|
Awesome, that's great! We'll get this merged in... but it triggers a long dependency chain... we need to fix the regression... we also need to fix the pre-built indexes for pyserini, etc. Let me queue this up and figure out the cleanest way to do this. In the meantime, would you be willing to add a test case that confirms tokenization is done "correctly"? |
This PR enables the use of
KoreanAnalyzer, an analyzer specialized for Korean.The previous
CJKAnalyzeronly splits sequences into bi-grams, whileKoreanAnalyzersplits a sentence into morphemes.LUCENE-8231