This repository contains the source code of Kss, a representative Korean sentence segmentation toolkit. I also conduct ongoing research about Korean sentence segmentation algorithms and report the results to this repository. If you have some good ideas about Korean sentence segmentation, please feel free to talk through the issue.
- December 19, 2022 Released Kss 4.0 Python.
- May 5, 2022 Released Kss Fluter.
- August 25, 2021 Released Kss Java.
- August 18, 2021 Released Kss 3.0 Python.
- December 21, 2020 Released Kss 2.0 Python.
- August 16, 2019 Released Kss 1.0 C++.
Kss can be easily installed using the pip package manager.
pip install kss
Please install mecab or konlpy.tag.Mecab to use Kss much faster.
- mecab (Linux/MacOS): https://github.com/hyunwoongko/python-mecab-kor
- mecab (Windows): https://cleancode-ws.tistory.com/97
- konlpy.tag.Mecab (Linux/MacOS): https://konlpy.org/en/latest/api/konlpy.tag/#mecab-class
- konlpy.tag.Mecab (Windows): https://uwgdqo.tistory.com/363
from kss import split_sentences
split_sentences(
text: Union[str, List[str], Tuple[str]],
backend: str = "auto",
num_workers: Union[int, str] = "auto" ,
strip: bool = True,
)
Parameters
- text: String or List/Tuple of strings
- string: single text segmentation
- list/tuple of strings: batch texts segmentation
- backend: Morpheme analyzer backend
backend='auto'
: findmecab
βkonlpy.tag.Mecab
βpecab
and use first found analyzer (default)backend='mecab'
: findmecab
βkonlpy.tag.Mecab
and use first found analyzerbackend='pecab'
: usepecab
analyzer
- num_workers: The number of multiprocessing workers
num_workers='auto'
: use multiprocessing with the maximum number of workers if possible (default)num_workers=1
: don't use multiprocessingnum_workers=2~N
: use multiprocessing with the specified number of workers
- strip: Whether it does
strip()
for all output sentences or notstrip=True
: dostrip()
for all output sentences (default)strip=False
: do notstrip()
for all output sentences
Usages
-
Single text segmentation
import kss text = "νμ¬ λλ£ λΆλ€κ³Ό λ€λ μλλ° λΆμκΈ°λ μ’κ³ μμλ λ§μμμ΄μ λ€λ§, κ°λ¨ ν λΌμ μ΄ κ°λ¨ μμλ²κ±° 골λͺ©κΈΈλ‘ μ μ¬λΌκ°μΌ νλλ° λ€λ€ μμλ²κ±°μ μ νΉμ λμ΄κ° λ» νλ΅λλ€ κ°λ¨μ λ§μ§ ν λΌμ μ μΈλΆ λͺ¨μ΅." kss.split_sentences(text) # ['νμ¬ λλ£ λΆλ€κ³Ό λ€λ μλλ° λΆμκΈ°λ μ’κ³ μμλ λ§μμμ΄μ', 'λ€λ§, κ°λ¨ ν λΌμ μ΄ κ°λ¨ μμλ²κ±° 골λͺ©κΈΈλ‘ μ μ¬λΌκ°μΌ νλλ° λ€λ€ μμλ²κ±°μ μ νΉμ λμ΄κ° λ» νλ΅λλ€', 'κ°λ¨μ λ§μ§ ν λΌμ μ μΈλΆ λͺ¨μ΅.']
-
Batch texts segmentation
import kss texts = [ "νμ¬ λλ£ λΆλ€κ³Ό λ€λ μλλ° λΆμκΈ°λ μ’κ³ μμλ λ§μμμ΄μ λ€λ§, κ°λ¨ ν λΌμ μ΄ κ°λ¨ μμλ²κ±° 골λͺ©κΈΈλ‘ μ μ¬λΌκ°μΌ νλλ° λ€λ€ μμλ²κ±°μ μ νΉμ λμ΄κ° λ» νλ΅λλ€", "κ°λ¨μ λ§μ§ ν λΌμ μ μΈλΆ λͺ¨μ΅. κ°λ¨ ν λΌμ μ 4μΈ΅ 건물 λ μ±λ‘ μ΄λ£¨μ΄μ Έ μμ΅λλ€.", "μμ ν λΌμ λ³Έ μ λ΅μ£ ?γ γ γ 건물μ ν¬μ§λ§ κ°νμ΄ μκΈ° λλ¬Έμ μ§λμΉ μ μμΌλ μ‘°μ¬νμΈμ κ°λ¨ ν λΌμ μ λ΄λΆ μΈν 리μ΄.", ] kss.split_sentences(texts) # [['νμ¬ λλ£ λΆλ€κ³Ό λ€λ μλλ° λΆμκΈ°λ μ’κ³ μμλ λ§μμμ΄μ', 'λ€λ§, κ°λ¨ ν λΌμ μ΄ κ°λ¨ μμλ²κ±° 골λͺ©κΈΈλ‘ μ μ¬λΌκ°μΌ νλλ° λ€λ€ μμλ²κ±°μ μ νΉμ λμ΄κ° λ» νλ΅λλ€'] # ['κ°λ¨μ λ§μ§ ν λΌμ μ μΈλΆ λͺ¨μ΅.', 'κ°λ¨ ν λΌμ μ 4μΈ΅ 건물 λ μ±λ‘ μ΄λ£¨μ΄μ Έ μμ΅λλ€.'] # ['μμ ν λΌμ λ³Έ μ λ΅μ£ ?γ γ γ ', '건물μ ν¬μ§λ§ κ°νμ΄ μκΈ° λλ¬Έμ μ§λμΉ μ μμΌλ μ‘°μ¬νμΈμ', 'κ°λ¨ ν λΌμ μ λ΄λΆ μΈν 리μ΄.']]
-
Remain all prefixes/suffixes space characters for original text recoverability
import kss text = "νμ¬ λλ£ λΆλ€κ³Ό λ€λ μλλ° λΆμκΈ°λ μ’κ³ μμλ λ§μμμ΄μ\nλ€λ§, κ°λ¨ ν λΌμ μ΄ κ°λ¨ μμλ²κ±° 골λͺ©κΈΈλ‘ μ μ¬λΌκ°μΌ νλλ° λ€λ€ μμλ²κ±°μ μ νΉμ λμ΄κ° λ» νλ΅λλ€ κ°λ¨μ λ§μ§ ν λΌμ μ μΈλΆ λͺ¨μ΅." kss.split_sentences(text) # ['νμ¬ λλ£ λΆλ€κ³Ό λ€λ μλλ° λΆμκΈ°λ μ’κ³ μμλ λ§μμμ΄μ\n', 'λ€λ§, κ°λ¨ ν λΌμ μ΄ κ°λ¨ μμλ²κ±° 골λͺ©κΈΈλ‘ μ μ¬λΌκ°μΌ νλλ° λ€λ€ μμλ²κ±°μ μ νΉμ λμ΄κ° λ» νλ΅λλ€ ', 'κ°λ¨μ λ§μ§ ν λΌμ μ μΈλΆ λͺ¨μ΅.']
Performance Analysis
You can reproduce all the following analyses using source code and datasets in ./bench/
directory and the source code was copied from here.
Note that the Baseline
is regex based segmentation method (re.split(r"(?<=[.!?])\s", text)
).
Name | Command (in root directory) |
---|---|
Baseline | python3 ./bench/test_baseline.py ./bench/testset/*.txt |
Kiwi | python3 ./bench/test_kiwi.py ./bench/testset/*.txt |
Koalanlp | python3 ./bench/test_koalanlp.py ./bench/testset/*.txt --backend=OKT/HNN/KMR/RHINO/EUNJEON/ARIRANG/KKMA |
Kss (ours) | python3 ./bench/test_kss.py ./bench/testset/*.txt --backend=mecab/pecab |
I used the following 6 evaluation datasets for analyses. Thanks to Minchul Lee for creating various sentence segmentation datasets.
Name | Descriptions | The number of sentences | Creator |
---|---|---|---|
blogs_lee | Dataset for testing blog style text segmentation | 170 | Minchul Lee |
blogs_ko | Dataset for testing blog style text segmentation, which is harder than Lee's blog dataset | 336 | Hyunwoong Ko |
tweets | Dataset for testing tweeter style text segmentation | 178 | Minchul Lee |
nested | Dataset for testing text which have parentheses and quotation marks segmentation | 91 | Minchul Lee |
v_ending | Dataset for testing difficult eomi segmentation, it contains various dialect sentences | 30 | Minchul Lee |
sample | An example used in README.md (κ°λ¨ ν λΌμ ) | 41 | Isaac, modified by Hyunwoong Ko |
Note that I modified labels of two sentences in sample.txt
made by Issac
because the original blog post was written like the following:


But Issac's labels were:

In fact, μ¬μ€ μ κ³ κΈ°λ₯Ό μ λ¨Ήμ΄μ λ¬΄μ¨ λ§μΈμ§ λͺ¨λ₯΄κ² μ§λ§..
and (λ¬Όλ‘ μ μ λ¨Ήμμ§λ§
are embraced sentences (μκΈ΄λ¬Έμ₯), not independent sentences. So sentence segmentation tools should do not split that parts.
The following table shows the segmentation performance based on exact match (EM). If you are unfamilar with EM score and F1 score, please refer to this. Kss performed best in most cases, and Kiwi performed well. Both baseline and koalanlp performed poorly.
Name | Library version | Backend | blogs_lee | blogs_ko | tweets | nested | v_ending | sample | Average |
---|---|---|---|---|---|---|---|---|---|
Baseline | N/A | N/A | 0.53529 | 0.44940 | 0.51124 | 0.68132 | 0.00000 | 0.34146 | 0.41987 |
Koalanlp | 2.1.7 | OKT | 0.53529 | 0.44940 | 0.53371 | 0.79121 | 0.00000 | 0.36585 | 0.44591 |
Koalanlp | 2.1.7 | HNN | 0.54118 | 0.44345 | 0.54494 | 0.78022 | 0.00000 | 0.34146 | 0.44187 |
Koalanlp | 2.1.7 | KMR | 0.51176 | 0.39583 | 0.42135 | 0.79121 | 0.00000 | 0.26829 | 0.39807 |
Koalanlp | 2.1.7 | RHINO | 0.52941 | 0.40774 | 0.39326 | 0.79121 | 0.00000 | 0.29268 | 0.40238 |
Koalanlp | 2.1.7 | EUNJEON | 0.51176 | 0.37500 | 0.38202 | 0.70330 | 0.00000 | 0.21951 | 0.36526 |
Koalanlp | 2.1.7 | ARIRANG | 0.51176 | 0.41071 | 0.44382 | 0.79121 | 0.00000 | 0.29268 | 0.40836 |
Koalanlp | 2.1.7 | KKMA | 0.52941 | 0.45238 | 0.38202 | 0.58242 | 0.06667 | 0.31707 | 0.38832 |
Kiwi | 0.14.0 | N/A | 0.78235 | 0.60714 | 0.66292 | 0.83516 | 0.20000 | 0.90244 | 0.66500 |
Kss (ours) | 4.0.0 | pecab | 0.86471 | 0.82440 | 0.71910 | 0.87912 | 0.36667 | 0.95122 | 0.76753 |
Kss (ours) | 4.0.0 | mecab | 0.86471 | 0.82440 | 0.73034 | 0.87912 | 0.36667 | 0.95122 | 0.76941 |
You can also compare the performances with the following graphs.
The evaluation source code which I copied from kiwipiepy also provides F1 score (dice similarity), and F1 scores of Kss are also best among the segmentation tools. but I don't believe this is proper metric to measure sentence segmentation performance. For example, EM score of text.split(" ")
on tweets.txt
is 0.06742. This means it's terrible sentence segmentation method on tweeter style text. However, F1 score of it on tweets.txt
is 0.54083, and it is similar with the F1 score of Koalanlp KKMA backend (0.56832).
What I want to say is the actual performances of segmentation could be vastly different even if the F1 scores were similar.
You can reproduce this with python3 ./bench/test_word_split.py ./bench/testset/tweets.txt
, and here is one of the segmentation example of both method.
Input:
κΈ°μ΅ν΄. λ κ·Έ μ μ μΉκ΅¬μΌ. λ€κ° μ£½μΌλ©΄ λ§ λ€λ λκ° νν μΈ κ±°μΌ. λΉ μ²΄λ μ¬νΌνκ² μ§. μ΄ μμ νλ₯Ό λΌ κ±°μΌ. λ©μ΄ μλ μ΄μ©λ©΄ μ‘°κΈμ μκ° ν΄ μ£Όμ§ μμκΉ. μ€μν 건 그건 λ€κ° μ§ν€κ³ μΆμ΄ νλ μ¬λλ€μ΄μμ. μ΄μ κ°.
Method: Koalanlp KKMA backend
EM score: 0.38202
F1 score: 0.56832
Output:
κΈ°μ΅ν΄. λ κ·Έ μ μ μΉκ΅¬μΌ.
λ€κ° μ£½μΌλ©΄ λ§ λ€λ λκ° νν μΈ κ±°μΌ.
λΉ μ²΄λ μ¬νΌνκ² μ§.
μ΄ μμ νλ₯Ό λΌ κ±°μΌ.
λ©μ΄ μλ μ΄μ©λ©΄ μ‘°κΈμ μκ° ν΄ μ£Όμ§ μμκΉ.
μ€μν 건 그건 λ€κ° μ§ν€κ³ μΆμ΄ νλ μ¬λλ€μ΄μμ.
μ΄μ κ°.
Method: text.split(" ")
EM score: 0.06742
F1 score: 0.54083
Output:
κΈ°μ΅ν΄.
λ
κ·Έ
μ μ
μΉκ΅¬μΌ.
λ€κ°
μ£½μΌλ©΄
λ§λ€λ λκ°
νν
μΈκ±°μΌ.
λΉμ²΄λ
μ¬νΌνκ² μ§.
μ΄μμ
νλ₯Ό
λΌκ±°μΌ.
λ©μ΄μλ
μ΄μ©λ©΄
μ‘°κΈμ
μκ°
ν΄μ£Όμ§
μμκΉ.
μ€μν건
그건
λ€κ°
μ§ν€κ³
μΆμ΄νλ
μ¬λλ€μ΄μμ.
μ΄μ
κ°.
This means that the F1 score has the huge advantages for method that cut sentences too finely. Of course, measuring the performance of the sentence segmentation algorithm is difficult, and we need to think more about metrics. However, the character level F1 score may cause users to misunderstand the tool's real performance. So I have more confidence in the EM score, which is a somewhat clunky but safe metric.
It is meaningless to simply compare them by number. I definitely want you to see the segmentation results.
Let's take blogs_ko
samples as examples, and compare performance of each library.
For this, I will take the best backend of each library (Kss=mecab, Koalanlp=KKMA), because looking results of all backends may make you tired.
- Input text
κ±°μ λ΄λ €κ°λ κΈΈμ ν΄κ²μλ₯Ό λ€λ Έλλ° μλ‘ μκ²Όλ보λλΌκ΅¬μ!? λ¨νΈκ³Ό μ , λ λ€ λΉ΅λ¬λ²λΌ μ§λμΉ μ μμ΄ κ΅¬λ§€ν΄ λ¨Ήμ΄λ΄€λ΅λλΉπ 보μ±λ
Ήμ°¨ν΄κ²μ μμΌλ‘ λ€μ΄μ€μλ©΄ λ± κ°μ΄λ° μμΉν΄ μμ΄μγ
γ
κ·Έλμ μ΄λ λ¬ΈμΌλ‘λΌλ λ€μ΄μ€μ
λ κ°κΉλ΅λλ€π λ©λ΄νμ μ΄λ κ³ , κ°κ²©μ 2000μ~3000μ μ¬μ΄μ νμ± λμ΄ μμ΄μ! μ΄λ°κ±° νλνλ λ§λ³΄λκ±° λ무 μ’μνλλ°... μ§μ νκ³ μλ―Έλ―Έ λ¨ν₯λΉ΅ νλ, μ₯μμ μΉμ¦λΉ΅ νλ, ꡬ리볼 νλ 골λμ΅λλ€! λ€μμ κ°λ©΄ κ°λ콩μ΄λ λ°€ κΌ λ¨Ήμ΄λ΄μΌκ² μ΄μπ
- Label
κ±°μ λ΄λ €κ°λ κΈΈμ ν΄κ²μλ₯Ό λ€λ Έλλ° μλ‘ μκ²Όλ보λλΌκ΅¬μ!?
λ¨νΈκ³Ό μ , λ λ€ λΉ΅λ¬λ²λΌ μ§λμΉ μ μμ΄ κ΅¬λ§€ν΄ λ¨Ήμ΄λ΄€λ΅λλΉπ
보μ±λ
Ήμ°¨ν΄κ²μ μμΌλ‘ λ€μ΄μ€μλ©΄ λ± κ°μ΄λ° μμΉν΄ μμ΄μγ
γ
κ·Έλμ μ΄λ λ¬ΈμΌλ‘λΌλ λ€μ΄μ€μ
λ κ°κΉλ΅λλ€π
λ©λ΄νμ μ΄λ κ³ , κ°κ²©μ 2000μ~3000μ μ¬μ΄μ νμ± λμ΄ μμ΄μ!
μ΄λ°κ±° νλνλ λ§λ³΄λκ±° λ무 μ’μνλλ°... μ§μ νκ³ μλ―Έλ―Έ λ¨ν₯λΉ΅ νλ, μ₯μμ μΉμ¦λΉ΅ νλ, ꡬ리볼 νλ 골λμ΅λλ€!
λ€μμ κ°λ©΄ κ°λ콩μ΄λ λ°€ κΌ λ¨Ήμ΄λ΄μΌκ² μ΄μπ
- Source
https://hi-e2e2.tistory.com/193
- Output texts
Baseline:
κ±°μ λ΄λ €κ°λ κΈΈμ ν΄κ²μλ₯Ό λ€λ Έλλ° μλ‘ μκ²Όλ보λλΌκ΅¬μ!?
λ¨νΈκ³Ό μ , λ λ€ λΉ΅λ¬λ²λΌ μ§λμΉ μ μμ΄ κ΅¬λ§€ν΄ λ¨Ήμ΄λ΄€λ΅λλΉπ 보μ±λ
Ήμ°¨ν΄κ²μ μμΌλ‘ λ€μ΄μ€μλ©΄ λ± κ°μ΄λ° μμΉν΄ μμ΄μγ
γ
κ·Έλμ μ΄λ λ¬ΈμΌλ‘λΌλ λ€μ΄μ€μ
λ κ°κΉλ΅λλ€π λ©λ΄νμ μ΄λ κ³ , κ°κ²©μ 2000μ~3000μ μ¬μ΄μ νμ± λμ΄ μμ΄μ!
μ΄λ°κ±° νλνλ λ§λ³΄λκ±° λ무 μ’μνλλ°...
μ§μ νκ³ μλ―Έλ―Έ λ¨ν₯λΉ΅ νλ, μ₯μμ μΉμ¦λΉ΅ νλ, ꡬ리볼 νλ 골λμ΅λλ€!
λ€μμ κ°λ©΄ κ°λ콩μ΄λ λ°€ κΌ λ¨Ήμ΄λ΄μΌκ² μ΄μπ
Baseline separates input text into 5 sentences. First of all, the first sentence was separated well because it has final symbols. However, since these final symbols don't appear from the second sentence, you can see that these sentences were not separated well.
Koalanlp (KKMA):
κ±°μ λ΄λ €κ°λ κΈΈμ ν΄κ² μλ₯Ό λ€λ Έλλ° μλ‘ μκ²Όλ
보λλΌκ΅¬μ!?
λ¨νΈκ³Ό μ , λ λ€ λΉ΅ λ¬λ²λΌ μ§λμΉ μ μμ΄ κ΅¬λ§€ν΄ λ¨Ήμ΄ λ΄€λ΅λλΉ
π λ³΄μ± λ
Ήμ°¨ ν΄κ²μ μμΌλ‘ λ€μ΄μ€μλ©΄ λ± κ°μ΄λ° μμΉν΄ μμ΄μ
γ
γ
κ·Έλμ μ΄λ λ¬ΈμΌλ‘ λΌλ λ€μ΄μ€μ
λ κ°κΉλ΅λλ€
π λ©λ΄νμ μ΄λ κ³ , κ°κ²©μ 2000μ ~3000 μ μ¬μ΄μ νμ± λμ΄ μμ΄μ!
μ΄λ° κ±° νλνλ λ§λ³΄λ κ±° λ무 μ’μνλλ°... μ§μ νκ³ μλ―Έ λ―Έ λ¨ν₯λΉ΅ νλ, μ₯μμ μΉμ¦ λΉ΅ νλ, ꡬ리 λ³Ό νλ 골λμ΅λλ€!
λ€μμ κ°λ©΄ κ°λ콩μ΄λ λ°€ κΌ λ¨Ήμ΄λ΄μΌκ² μ΄μπ
Koalanlp splits sentences better than baseline because it uses morphological information. It splits input text into 8 sentences in total.
But many mispartitions still exist. The first thing that catches your eye is the immature emoji handling.
People usually put emojis at the end of a sentence, and in this case, the emojis should be included in the sentence.
The second thing is the mispartition between μκ²Όλ
and 보λλΌκ΅¬μ!?
.
Probably this is because the KKMA morpheme analyzer recognized μκ²Όλ
as a final eomi (μ’
κ²°μ΄λ―Έ). but it's a connecting eomi (μ°κ²°μ΄λ―Έ).
This is because the performance of the morpheme analyzer. Rather, the baseline is a little safer in this area.
Kiwi:
κ±°μ λ΄λ €κ°λ κΈΈμ ν΄κ²μλ₯Ό λ€λ Έλλ° μλ‘ μκ²Όλ보λλΌκ΅¬μ!?
λ¨νΈκ³Ό μ , λ λ€ λΉ΅λ¬λ²λΌ μ§λμΉ μ μμ΄ κ΅¬λ§€ν΄ λ¨Ήμ΄λ΄€λ΅λλΉπ
보μ±λ
Ήμ°¨ν΄κ²μ μμΌλ‘ λ€μ΄μ€μλ©΄ λ± κ°μ΄λ° μμΉν΄ μμ΄μγ
γ
κ·Έλμ μ΄λ λ¬ΈμΌλ‘λΌλ λ€μ΄μ€μ
λ κ°κΉλ΅λλ€π λ©λ΄νμ μ΄λ κ³ , κ°κ²©μ 2000μ~3000μ μ¬μ΄μ νμ± λμ΄ μμ΄μ!
μ΄λ°κ±° νλνλ λ§λ³΄λκ±° λ무 μ’μνλλ°...
μ§μ νκ³ μλ―Έλ―Έ λ¨ν₯λΉ΅ νλ, μ₯μμ μΉμ¦λΉ΅ νλ, ꡬ리볼 νλ 골λμ΅λλ€!
λ€μμ κ°λ©΄ κ°λ콩μ΄λ λ°€ κΌ λ¨Ήμ΄λ΄μΌκ² μ΄μπ
Kiwi shows better performance than Koalanlp. It splits input text into 7 sentences.
Most sentences are pretty good, but it doesn't split κ°κΉλ΅λλ€π
and λ©λ΄νμ
.
The second thing is it separates μ’μνλλ°...
and μ§μ νκ³
.
This part may be recognized as an independent sentence depending on the viewer,
but the author of the original article didn't write this as an independent sentence, but an embraced sentence (μκΈ΄λ¬Έμ₯).
The original article was written like:
Kss (mecab):
κ±°μ λ΄λ €κ°λ κΈΈμ ν΄κ²μλ₯Ό λ€λ Έλλ° μλ‘ μκ²Όλ보λλΌκ΅¬μ!?
λ¨νΈκ³Ό μ , λ λ€ λΉ΅λ¬λ²λΌ μ§λμΉ μ μμ΄ κ΅¬λ§€ν΄ λ¨Ήμ΄λ΄€λ΅λλΉπ
보μ±λ
Ήμ°¨ν΄κ²μ μμΌλ‘ λ€μ΄μ€μλ©΄ λ± κ°μ΄λ° μμΉν΄ μμ΄μγ
γ
κ·Έλμ μ΄λ λ¬ΈμΌλ‘λΌλ λ€μ΄μ€μ
λ κ°κΉλ΅λλ€π
λ©λ΄νμ μ΄λ κ³ , κ°κ²©μ 2000μ~3000μ μ¬μ΄μ νμ± λμ΄ μμ΄μ!
μ΄λ°κ±° νλνλ λ§λ³΄λκ±° λ무 μ’μνλλ°... μ§μ νκ³ μλ―Έλ―Έ λ¨ν₯λΉ΅ νλ, μ₯μμ μΉμ¦λΉ΅ νλ, ꡬ리볼 νλ 골λμ΅λλ€!
λ€μμ κ°λ©΄ κ°λ콩μ΄λ λ°€ κΌ λ¨Ήμ΄λ΄μΌκ² μ΄μπ
The result of Kss is same with gold label. Especially it succesfully separates κ°κΉλ΅λλ€π
and λ©λ΄νμ
. In fact, this part is the final eomi (μ’
κ²°μ΄λ―Έ), but many morpheme analyzers confuse the final eomi (μ’
κ²°μ΄λ―Έ) with the connecting eomi (μ°κ²°μ΄λ―Έ). Actually, mecab and pecab morpheme analyzers which are backend of Kss also recognizes that part as a connecting eomi (μ°κ²°μ΄λ―Έ). For this reason, Kss has a feature to recognize wrongly recognized connecting eomi (μ°κ²°μ΄λ―Έ) and to correct those eomis. Thus, it is able to separate this part effectively. Next, Kss doesn't split μ’μνλλ°...
and μ§μ νκ³
becuase μ’μνλλ°...
is not an independent sentence, but an embraced sentence (μκΈ΄λ¬Έμ₯). This means Kss doesn't split sentences simply because .
appears, unlike baseline. In most cases, .
could be the delimiter of sentences, actually there are many exceptions about this.
- Input text
μ΄λνμ°½νλ μΆκ·Όμ μ λ무μΌμ°μΌμ΄λ λ²λ Έμ (μΆκ·Όμκ° 19μ) ν κΊΌλμκ³ ν΄μ μΉ΄νλ₯Ό μ°Ύμ μλ΄λ‘ λκ°μ μλ‘μκΈ΄κ³³μ μ¬μ₯λμ΄ μ»€νΌμ μμΈμ§ 컀νΌλ°μ¬λΌκ³ ν΄μ κ°μ μ€ννμ§ μΌλ§μλμ κ·Έλ°μ§ μλμ΄ μΌλ§μμμ μ‘°μ©νκ³ μ’λ€λ©° μ’μνλκ±ΈμμΌμ ν
λΌμ€μ μμ κ·Όλ° μ‘°μ©νλ μΉ΄νκ° μ°λ§ν΄μ§ μ리μ μΆμ²λ μΉ΄μ΄ν°μμ(ν
λΌμ€κ° μΉ΄μ΄ν° λ°λ‘μ) λ€μλΌκ³ λ€μκ² μλλΌ κ·λ μ΄λ €μμΌλ λ£κ²λ λμ¬.
- Label
μ΄λνμ°½νλ μΆκ·Όμ μ λ무μΌμ°μΌμ΄λ λ²λ Έμ (μΆκ·Όμκ° 19μ)
ν κΊΌλμκ³ ν΄μ μΉ΄νλ₯Ό μ°Ύμ μλ΄λ‘ λκ°μ
μλ‘μκΈ΄κ³³μ μ¬μ₯λμ΄ μ»€νΌμ μμΈμ§ 컀νΌλ°μ¬λΌκ³ ν΄μ κ°μ
μ€ννμ§ μΌλ§μλμ κ·Έλ°μ§ μλμ΄ μΌλ§μμμ
μ‘°μ©νκ³ μ’λ€λ©° μ’μνλκ±ΈμμΌμ ν
λΌμ€μ μμ
κ·Όλ° μ‘°μ©νλ μΉ΄νκ° μ°λ§ν΄μ§
μ리μ μΆμ²λ μΉ΄μ΄ν°μμ(ν
λΌμ€κ° μΉ΄μ΄ν° λ°λ‘μ)
λ€μλΌκ³ λ€μκ² μλλΌ κ·λ μ΄λ €μμΌλ λ£κ²λ λμ¬.
- Source
https://mrsign92.tistory.com/6099371
- Output texts
Baseline:
μ΄λνμ°½νλ μΆκ·Όμ μ λ무μΌμ°μΌμ΄λ λ²λ Έμ (μΆκ·Όμκ° 19μ) ν κΊΌλμκ³ ν΄μ μΉ΄νλ₯Ό μ°Ύμ μλ΄λ‘ λκ°μ μλ‘μκΈ΄κ³³μ μ¬μ₯λμ΄ μ»€νΌμ μμΈμ§ 컀νΌλ°μ¬λΌκ³ ν΄μ κ°μ μ€ννμ§ μΌλ§μλμ κ·Έλ°μ§ μλμ΄ μΌλ§μμμ μ‘°μ©νκ³ μ’λ€λ©° μ’μνλκ±ΈμμΌμ ν
λΌμ€μ μμ κ·Όλ° μ‘°μ©νλ μΉ΄νκ° μ°λ§ν΄μ§ μ리μ μΆμ²λ μΉ΄μ΄ν°μμ(ν
λΌμ€κ° μΉ΄μ΄ν° λ°λ‘μ) λ€μλΌκ³ λ€μκ² μλλΌ κ·λ μ΄λ €μμΌλ λ£κ²λ λμ¬.
Baseline doesn't split any sentences because there's no .!?
in the input text.
Koalanlp (KKMA)
μ΄λ νμ°½ν λ μΆκ·Ό μ μ λ무 μΌμ° μΌμ΄λ λ²λ Έμ ( μΆκ·Όμκ° 19μ) ν κΊΌλ μκ³ ν΄μ μΉ΄νλ₯Ό μ°Ύμ μλ΄λ‘ λκ°μ μλ‘ μκΈ΄ κ³³μ μ¬μ₯λμ΄ μ»€νΌμ μμΈμ§ 컀νΌλ°μ¬λΌκ³ ν΄μ κ°μ μ€ννμ§ μΌλ§ μ λ μ κ·Έλ°μ§ μλμ΄ μΌλ§ μμμ μ‘°μ©νκ³ μ’λ€λ©° μ’μνλ κ±Έ μμΌμ ν
λΌμ€μ μμ κ·Όλ° μ‘°μ©νλ μΉ΄νκ° μ°λ§ ν΄μ§ μ리μ μΆμ²λ μΉ΄μ΄ν°μμ( ν
λΌμ€κ° μΉ΄μ΄ν° λ°λ‘ μ) λ€μλΌκ³
λ€μ κ² μλλΌ κ·λ μ΄λ € μμΌλ λ£κ² λ λμ¬.
Koalanlp separates λ€μλΌκ³
and λ€μ
but it is not correct split point.
And I think it doesn't consider predicative use of eomi transferred from noun (λͺ
μ¬ν μ μ±μ΄λ―Έμ μμ μ μ©λ²).
Kiwi
μ΄λνμ°½νλ μΆκ·Όμ μ λ무μΌμ°μΌμ΄λ λ²λ Έμ (μΆκ·Όμκ° 19μ) ν κΊΌλμκ³ ν΄μ μΉ΄νλ₯Ό μ°Ύμ μλ΄λ‘ λκ°μ μλ‘μκΈ΄κ³³μ μ¬μ₯λμ΄ μ»€νΌμ μμΈμ§ 컀νΌλ°μ¬λΌκ³ ν΄μ κ°μ μ€ννμ§ μΌλ§μλμ κ·Έλ°μ§ μλμ΄ μΌλ§μμμ μ‘°μ©νκ³ μ’λ€λ©° μ’μνλκ±ΈμμΌμ ν
λΌμ€μ μμ κ·Όλ° μ‘°μ©νλ μΉ΄νκ° μ°λ§ν΄μ§ μ리μ μΆμ²λ μΉ΄μ΄ν°μμ(ν
λΌμ€κ° μΉ΄μ΄ν° λ°λ‘μ) λ€μλΌκ³ λ€μκ² μλλΌ κ·λ μ΄λ €μμΌλ λ£κ²λ λμ¬.
Kiwi doesn't separate any sentence, similar with baseline. Similarly, it doesn't consider predicative use of eomi transferred from noun (λͺ μ¬ν μ μ±μ΄λ―Έμ μμ μ μ©λ²).
Kss (Mecab)
μ΄λνμ°½νλ μΆκ·Όμ μ λ무μΌμ°μΌμ΄λ λ²λ Έμ (μΆκ·Όμκ° 19μ)
ν κΊΌλμκ³ ν΄μ μΉ΄νλ₯Ό μ°Ύμ μλ΄λ‘ λκ°μ
μλ‘μκΈ΄κ³³μ μ¬μ₯λμ΄ μ»€νΌμ μμΈμ§ 컀νΌλ°μ¬λΌκ³ ν΄μ κ°μ
μ€ννμ§ μΌλ§μλμ κ·Έλ°μ§ μλμ΄ μΌλ§μμμ
μ‘°μ©νκ³ μ’λ€λ©° μ’μνλκ±ΈμμΌμ ν
λΌμ€μ μμ
κ·Όλ° μ‘°μ©νλ μΉ΄νκ° μ°λ§ν΄μ§ μ리μ μΆμ²λ μΉ΄μ΄ν°μμ(ν
λΌμ€κ° μΉ΄μ΄ν° λ°λ‘μ)
λ€μλΌκ³ λ€μκ² μλλΌ κ·λ μ΄λ €μμΌλ λ£κ²λ λμ¬.
The result of Kss is very similar with gold label, Kss considers predicative use of eomi transferred from noun (λͺ
μ¬ν μ μ±μ΄λ―Έμ μμ μ μ©λ²).
But Kss couldn't split μ°λ§ν΄μ§
and μ리μ
. That part is a correct split point, but it was blocked by one of the exceptions which I built to prevent wrong segmentation. Splitting eomi transferred from noun (λͺ
μ¬ν μ μ±μ΄λ―Έ) is one of the unsafe and difficult tasks, so Kss has many exceptions to prevent wrong segmentation.
- Input text
μ±
μκ°μ μ΄κ±΄ μμ€μΈκ° μ€μ μΈκ°λΌλ 문ꡬλ₯Ό λ³΄κ³ μ¬λ°κ² λ€ μΆμ΄ λ³΄κ² λμλ€. 'λ°μΉ΄λΌ'λΌλ λλ°μ 2μ₯μ μΉ΄λ ν©μ΄ λμ μ¬λμ΄ μ΄κΈ°λ κ²μμΌλ‘ μμ£Ό λ¨μν κ²μμ΄λ€. μ΄λ°κ² μ€λ
μ΄ λλ? μΆμλλ° μ΄ μ±
μ΄ λ°μΉ΄λΌμ λΉμ·ν λ§€λ ₯μ΄ μλ€ μκ°λ€μλ€. λ΄μ©μ΄ μ€νΌλνκ² μ§νλκ³ λ§νλ ꡬκ°μμ΄ μ½νλκ² λλ λͺ¨λ₯΄κ² νμ΄μ§λ₯Ό μ₯μ₯ λκΈ°κ³ μμλ€. λ¬Όλ‘ μ½μμΌλ‘μ¨ ν° λμ λ²μ§ μμ§λ§ μ΄λ° μ€νΌλν¨μ λλ λͺ¨λ₯΄κ² κ³μ κ²μμ μ°Έμ¬νκ² λκ³ λμ€λ νμ΄λ°μ μ‘μ§ λͺ»ν΄ λΉ μ§μ§ μμμκΉ? λΌλ μκ°μ νκ² λλ€. μ΄ μ±
μμ νμ§μ κΏμ κ°κ²©νλ₯Ό λ³΄μ§ μλ μΆμ΄λΌ νλ€. μ΄ λΆλΆμ μ½κ³ λλλ°! λΌλ μκ°νλ©΄μ μκ° λλ°μ΄λΌλκ±Έλ‘λΌλ λμ λ§μ΄ λ²μλ νμ§κ° λΆλ¬μ λ€. κ·Έλ¬λ©΄μ λ΄κ° λλ°μ νλ€λ©΄?λΌλ μμμ ν΄λ΄€λ€. κ·Έλ¦¬κ³ μ΄λ° μμμ ν μ μκ² λ§λ€μ΄μ€μ μ΄ μ±
μ΄ λ μ¬λ°κ² λ€κ°μλ€. μΌμμ μ§λ£¨ν¨μ λκ»΄ λλ°κ°μ μΆμ μ΄κ³ μΆλ€λ©΄ λλ°νμ§λ§κ³ μ°¨λΌλ¦¬ μ΄ μ±
μ 보길^^γ
- Label
μ±
μκ°μ μ΄κ±΄ μμ€μΈκ° μ€μ μΈκ°λΌλ 문ꡬλ₯Ό λ³΄κ³ μ¬λ°κ² λ€ μΆμ΄ λ³΄κ² λμλ€.
'λ°μΉ΄λΌ'λΌλ λλ°μ 2μ₯μ μΉ΄λ ν©μ΄ λμ μ¬λμ΄ μ΄κΈ°λ κ²μμΌλ‘ μμ£Ό λ¨μν κ²μμ΄λ€.
μ΄λ°κ² μ€λ
μ΄ λλ? μΆμλλ° μ΄ μ±
μ΄ λ°μΉ΄λΌμ λΉμ·ν λ§€λ ₯μ΄ μλ€ μκ°λ€μλ€.
λ΄μ©μ΄ μ€νΌλνκ² μ§νλκ³ λ§νλ ꡬκ°μμ΄ μ½νλκ² λλ λͺ¨λ₯΄κ² νμ΄μ§λ₯Ό μ₯μ₯ λκΈ°κ³ μμλ€.
λ¬Όλ‘ μ½μμΌλ‘μ¨ ν° λμ λ²μ§ μμ§λ§ μ΄λ° μ€νΌλν¨μ λλ λͺ¨λ₯΄κ² κ³μ κ²μμ μ°Έμ¬νκ² λκ³ λμ€λ νμ΄λ°μ μ‘μ§ λͺ»ν΄ λΉ μ§μ§ μμμκΉ? λΌλ μκ°μ νκ² λλ€.
μ΄ μ±
μμ νμ§μ κΏμ κ°κ²©νλ₯Ό λ³΄μ§ μλ μΆμ΄λΌ νλ€.
μ΄ λΆλΆμ μ½κ³ λλλ°! λΌλ μκ°νλ©΄μ μκ° λλ°μ΄λΌλκ±Έλ‘λΌλ λμ λ§μ΄ λ²μλ νμ§κ° λΆλ¬μ λ€.
κ·Έλ¬λ©΄μ λ΄κ° λλ°μ νλ€λ©΄?λΌλ μμμ ν΄λ΄€λ€.
κ·Έλ¦¬κ³ μ΄λ° μμμ ν μ μκ² λ§λ€μ΄μ€μ μ΄ μ±
μ΄ λ μ¬λ°κ² λ€κ°μλ€.
μΌμμ μ§λ£¨ν¨μ λκ»΄ λλ°κ°μ μΆμ μ΄κ³ μΆλ€λ©΄ λλ°νμ§λ§κ³ μ°¨λΌλ¦¬ μ΄ μ±
μ 보길^^γ
- Source
https://hi-e2e2.tistory.com/63
- Output texts
Baseline:
μ±
μκ°μ μ΄κ±΄ μμ€μΈκ° μ€μ μΈκ°λΌλ 문ꡬλ₯Ό λ³΄κ³ μ¬λ°κ² λ€ μΆμ΄ λ³΄κ² λμλ€.
'λ°μΉ΄λΌ'λΌλ λλ°μ 2μ₯μ μΉ΄λ ν©μ΄ λμ μ¬λμ΄ μ΄κΈ°λ κ²μμΌλ‘ μμ£Ό λ¨μν κ²μμ΄λ€.
μ΄λ°κ² μ€λ
μ΄ λλ?
μΆμλλ° μ΄ μ±
μ΄ λ°μΉ΄λΌμ λΉμ·ν λ§€λ ₯μ΄ μλ€ μκ°λ€μλ€.
λ΄μ©μ΄ μ€νΌλνκ² μ§νλκ³ λ§νλ ꡬκ°μμ΄ μ½νλκ² λλ λͺ¨λ₯΄κ² νμ΄μ§λ₯Ό μ₯μ₯ λκΈ°κ³ μμλ€.
λ¬Όλ‘ μ½μμΌλ‘μ¨ ν° λμ λ²μ§ μμ§λ§ μ΄λ° μ€νΌλν¨μ λλ λͺ¨λ₯΄κ² κ³μ κ²μμ μ°Έμ¬νκ² λκ³ λμ€λ νμ΄λ°μ μ‘μ§ λͺ»ν΄ λΉ μ§μ§ μμμκΉ?
λΌλ μκ°μ νκ² λλ€.
μ΄ μ±
μμ νμ§μ κΏμ κ°κ²©νλ₯Ό λ³΄μ§ μλ μΆμ΄λΌ νλ€.
μ΄ λΆλΆμ μ½κ³ λλλ°!
λΌλ μκ°νλ©΄μ μκ° λλ°μ΄λΌλκ±Έλ‘λΌλ λμ λ§μ΄ λ²μλ νμ§κ° λΆλ¬μ λ€.
κ·Έλ¬λ©΄μ λ΄κ° λλ°μ νλ€λ©΄?λΌλ μμμ ν΄λ΄€λ€.
κ·Έλ¦¬κ³ μ΄λ° μμμ ν μ μκ² λ§λ€μ΄μ€μ μ΄ μ±
μ΄ λ μ¬λ°κ² λ€κ°μλ€.
μΌμμ μ§λ£¨ν¨μ λκ»΄ λλ°κ°μ μΆμ μ΄κ³ μΆλ€λ©΄ λλ°νμ§λ§κ³ μ°¨λΌλ¦¬ μ΄ μ±
μ 보길^^γ
Baseline separates input text into 13 sentences. You can see it can't distinguish final eomi(μ’
κ²°μ΄λ―Έ) and connecting eomi(μ°κ²°μ΄λ―Έ), for example it splits μ΄λ°κ² μ€λ
μ΄ λλ?
and μΆμλλ°
. But λλ?
is connecting eomi (μ°κ²°μ΄λ―Έ). And here's one more problem. It doesn't recognize embraced sentences (μκΈ΄λ¬Έμ₯). For example it splits λͺ»ν΄ λΉ μ§μ§ μμμκΉ?
and λΌλ μκ°μ νκ² λλ€.
.
Koalanlp (KKMA)
μ±
μκ°μ μ΄κ±΄ μμ€μΈκ° μ€μ μΈκ°λΌλ 문ꡬλ₯Ό λ³΄κ³ μ¬λ°κ² λ€ μΆμ΄ λ³΄κ² λμλ€.
' λ°μΉ΄λΌ' λΌλ λλ°μ 2 μ₯μ μΉ΄λ ν©μ΄ λμ μ¬λμ΄ μ΄κΈ°λ κ²μμΌλ‘ μμ£Ό λ¨μν κ²μμ΄λ€.
μ΄λ° κ² μ€λ
μ΄ λλ?
μΆμλλ° μ΄ μ±
μ΄ λ°μΉ΄λΌμ λΉμ·ν λ§€λ ₯μ΄ μλ€ μκ° λ€μλ€.
λ΄μ©μ΄ μ€νΌλνκ² μ§νλκ³ λ§νλ κ΅¬κ° μμ΄ μ½νλ κ² λλ λͺ¨λ₯΄κ² νμ΄μ§λ₯Ό μ₯μ₯ λκΈ°κ³ μμλ€.
λ¬Όλ‘ μ½μμΌλ‘μ¨ ν° λμ λ²μ§ μμ§λ§ μ΄λ° μ€νΌλν¨μ λλ λͺ¨λ₯΄κ² κ³μ κ²μμ μ°Έμ¬νκ² λκ³ λμ€λ νμ΄λ°μ μ‘μ§ λͺ»ν΄ λΉ μ§μ§ μμμκΉ?
λΌλ μκ°μ νκ² λλ€.
μ΄ μ±
μμ νμ§μ κΏμ κ°κ²©νλ₯Ό λ³΄μ§ μλ μΆμ΄λΌ νλ€.
μ΄ λΆλΆμ μ½κ³ λλλ°!
λΌλ μκ°νλ©΄μ μκ° λλ°μ΄λΌλ κ±Έλ‘λΌλ λμ λ§μ΄ λ²μλ νμ§κ° λΆλ¬μ λ€.
κ·Έλ¬λ©΄μ λ΄κ° λλ°μ νλ€λ©΄? λΌλ μμμ ν΄λ΄€λ€.
κ·Έλ¦¬κ³ μ΄λ° μμμ ν μ μκ² λ§λ€μ΄ μ€μ μ΄ μ±
μ΄ λ μ¬λ°κ² λ€κ°μλ€.
μΌμμ μ§λ£¨ν¨μ λκ»΄ λλ° κ°μ μΆμ μ΄κ³ μΆλ€λ©΄ λλ°νμ§ λ§κ³ μ°¨λΌλ¦¬ μ΄ μ±
μ 보길 ^^ γ
The result of Koalanlp was really similar with baseline, the two problems (final-connecting eomi distinction, embracing sentences recognization) still exist.
Kiwi
μ±
μκ°μ μ΄κ±΄ μμ€μΈκ° μ€μ μΈκ°
λΌλ 문ꡬλ₯Ό λ³΄κ³ μ¬λ°κ² λ€ μΆμ΄ λ³΄κ² λμλ€.
'λ°μΉ΄λΌ'λΌλ λλ°μ 2μ₯μ μΉ΄λ ν©μ΄ λμ μ¬λμ΄ μ΄κΈ°λ κ²μμΌλ‘ μμ£Ό λ¨μν κ²μμ΄λ€.
μ΄λ°κ² μ€λ
μ΄ λλ?
μΆμλλ° μ΄ μ±
μ΄ λ°μΉ΄λΌμ λΉμ·ν λ§€λ ₯μ΄ μλ€ μκ°λ€μλ€.
λ΄μ©μ΄ μ€νΌλνκ² μ§νλκ³ λ§νλ ꡬκ°μμ΄ μ½νλκ² λλ λͺ¨λ₯΄κ² νμ΄μ§λ₯Ό μ₯μ₯ λκΈ°κ³ μμλ€.
λ¬Όλ‘ μ½μμΌλ‘μ¨ ν° λμ λ²μ§ μμ§λ§ μ΄λ° μ€νΌλν¨μ λλ λͺ¨λ₯΄κ² κ³μ κ²μμ μ°Έμ¬νκ² λκ³ λμ€λ νμ΄λ°μ μ‘μ§ λͺ»ν΄ λΉ μ§μ§ μμμκΉ?
λΌλ μκ°μ νκ² λλ€.
μ΄ μ±
μμ νμ§μ κΏμ κ°κ²©νλ₯Ό λ³΄μ§ μλ μΆμ΄λΌ νλ€.
μ΄ λΆλΆμ μ½κ³ λλλ°!
λΌλ μκ°νλ©΄μ μκ° λλ°μ΄λΌλκ±Έλ‘λΌλ λμ λ§μ΄ λ²μλ νμ§κ° λΆλ¬μ λ€.
κ·Έλ¬λ©΄μ λ΄κ° λλ°μ νλ€λ©΄?
λΌλ μμμ ν΄λ΄€λ€.
κ·Έλ¦¬κ³ μ΄λ° μμμ ν μ μκ² λ§λ€μ΄μ€μ μ΄ μ±
μ΄ λ μ¬λ°κ² λ€κ°μλ€.
μΌμμ μ§λ£¨ν¨μ λκ»΄ λλ°κ°μ μΆμ μ΄κ³ μΆλ€λ©΄ λλ°νμ§λ§κ³ μ°¨λΌλ¦¬ μ΄ μ±
μ 보길^^γ
The two problems are also shown in result of Kiwi. And it additionally splits μ€μ μΈκ°
and λΌλ
, but μ΄κ±΄ μμ€μΈκ° μ€μ μΈκ°
is not an independent sentence, but an embraced sentence (μκΈ΄λ¬Έμ₯).
Kss (Mecab)
μ±
μκ°μ μ΄κ±΄ μμ€μΈκ° μ€μ μΈκ°λΌλ 문ꡬλ₯Ό λ³΄κ³ μ¬λ°κ² λ€ μΆμ΄ λ³΄κ² λμλ€.
'λ°μΉ΄λΌ'λΌλ λλ°μ 2μ₯μ μΉ΄λ ν©μ΄ λμ μ¬λμ΄ μ΄κΈ°λ κ²μμΌλ‘ μμ£Ό λ¨μν κ²μμ΄λ€.
μ΄λ°κ² μ€λ
μ΄ λλ? μΆμλλ° μ΄ μ±
μ΄ λ°μΉ΄λΌμ λΉμ·ν λ§€λ ₯μ΄ μλ€ μκ°λ€μλ€.
λ΄μ©μ΄ μ€νΌλνκ² μ§νλκ³ λ§νλ ꡬκ°μμ΄ μ½νλκ² λλ λͺ¨λ₯΄κ² νμ΄μ§λ₯Ό μ₯μ₯ λκΈ°κ³ μμλ€.
λ¬Όλ‘ μ½μμΌλ‘μ¨ ν° λμ λ²μ§ μμ§λ§ μ΄λ° μ€νΌλν¨μ λλ λͺ¨λ₯΄κ² κ³μ κ²μμ μ°Έμ¬νκ² λκ³ λμ€λ νμ΄λ°μ μ‘μ§ λͺ»ν΄ λΉ μ§μ§ μμμκΉ? λΌλ μκ°μ νκ² λλ€.
μ΄ μ±
μμ νμ§μ κΏμ κ°κ²©νλ₯Ό λ³΄μ§ μλ μΆμ΄λΌ νλ€.
μ΄ λΆλΆμ μ½κ³ λλλ°! λΌλ μκ°νλ©΄μ μκ° λλ°μ΄λΌλκ±Έλ‘λΌλ λμ λ§μ΄ λ²μλ νμ§κ° λΆλ¬μ λ€.
κ·Έλ¬λ©΄μ λ΄κ° λλ°μ νλ€λ©΄?λΌλ μμμ ν΄λ΄€λ€.
κ·Έλ¦¬κ³ μ΄λ° μμμ ν μ μκ² λ§λ€μ΄μ€μ μ΄ μ±
μ΄ λ μ¬λ°κ² λ€κ°μλ€.
μΌμμ μ§λ£¨ν¨μ λκ»΄ λλ°κ°μ μΆμ μ΄κ³ μΆλ€λ©΄ λλ°νμ§λ§κ³ μ°¨λΌλ¦¬ μ΄ μ±
μ 보길^^γ
The result of Kss is same with gold label. This means that Kss considers the two problems. Of course, it's not easy to detect that parts while splitting sentences, so Kss has one more step after splitting sentences. It's postprocessing step which corrects some problems in segmenration results. For example, Korean sentence doesn't start from josa (μ‘°μ¬) in general. Therefore if segmented results (sentences) started from josa (μ‘°μ¬), Kss recognizes them as embraced sentences (μκΈ΄λ¬Έμ₯), and attaches them to their previous sentence. For your information, Kss has many more powerful postprocessing algorithms which correct wrong segmentation results like this.
In conclusion, Kss considers more than other libraries in Korean sentences. And these considerations led to difference in performance.
I also measured speed of tools to compare their computation efficiency. The following table shows computation time of each tool when it splits sample.txt
(41 sentences).
It is a single blog post, so you can expect the following time when you split a blog post into sentences.
Since the computation time may vary depending on the current CPU status, so I measured 5 times and calculated the average.
Note that every experiment was conducted on single thread / process environment with my M1 macbook pro (2021, 13'inch).
Name | Library version | Backend | Average time (msec) |
---|---|---|---|
Baseline | N/A | N/A | 0.22 |
koalanlp | 2.1.7 | OKT | 27.37 |
koalanlp | 2.1.7 | HNN | 50.39 |
koalanlp | 2.1.7 | KMR | 757.08 |
koalanlp | 2.1.7 | RHINO | 978.53 |
koalanlp | 2.1.7 | EUNJEON | 881.24 |
koalanlp | 2.1.7 | ARIRANG | 1415.53 |
koalanlp | 2.1.7 | KKMA | 1971.31 |
Kiwi | 0.14.0 | N/A | 36.41 |
Kss (ours) | 4.0.0 | pecab | 6929.27 |
Kss (ours) | 4.0.0 | mecab | 43.80 |
You can also compare the speed of tools with the following graphs.
You can also compare the speed of faster tools the following graphs (under 100 msec).
The baseline was fastest (because it's a just regex function), and Koalanlp (OKT backend), Kiwi, Kss (mecab backend) followed. The slowest library was Kss (pecab backend) and it was about 160 times slower than its mecab backend. Mecab and Kiwi were written in C++, All Koalanlp backends were written in Java and Pecab was written in pure python. I think this difference was caused by speed of each language. Therefore, if you can install mecab, it makes most sense to use Kss Mecab backend.
-
For Linux/MacOS users: Kss tries to install
python-mecab-kor
when you install kss. so you can use mecab backend very easily. But if it was failed, please install mecab yourself to use mecab backend. -
For Windows users: Kss supports
mecab-ko-msvc
(mecab for Microsoft Visual C++), and its konlpy wrapper. To use mecab backend, you need to install one of mecab and konlpy.tag.Mecab on your machine. There are much information about mecab installing on Windows machine in internet like the following.- mecab: https://cleancode-ws.tistory.com/97
- konlpy.tag.Mecab: https://uwgdqo.tistory.com/363
I've measured the performance of Kss and other libraries using 6 evaluation datasets, and also measured their speed. In terms of segmentation performance, Kss performed best on most datasets. In terms of speed, baseline was the fastest, and Koalanlp (OKT backend) and Kiwi followed. but Kss (mecab backend) also showed a speed that could compete with others.
Although much progress has been made by Kiwi and Kss, there are still many difficulties and limitations in Korean sentence segmentation libraries. In fact, it's also because very few people attack this task. If anyone wants to discuss Korean sentence segmentation algorithms with me or contribute to my work, feel free to send an email to kevin.ko@tunib.ai or let me know on the Github issue page.
from kss import split_morphemes
split_morphemes(
text: Union[str, List[str], Tuple[str]],
backend: str = "auto",
num_workers: Union[int, str] = "auto",
drop_space: bool = True,
)
Parameters
- text: String or List/Tuple of strings
- string: single text segmentation
- list/tuple of strings: batch texts segmentation
- backend: Morpheme analyzer backend.
backend='auto'
: findmecab
βkonlpy.tag.Mecab
βpecab
and use first found analyzer (default)backend='mecab'
: findmecab
βkonlpy.tag.Mecab
and use first found analyzerbackend='pecab'
: usepecab
analyzer
- num_workers: The number of multiprocessing workers
num_workers='auto'
: use multiprocessing with the maximum number of workers if possible (default)num_workers=1
: don't use multiprocessingnum_workers=2~N
: use multiprocessing with the specified number of workers
- drop_space: Whether it drops all space characters or not
drop_space=True
: drop all space characters from output (default)drop_space=False
: remain all space characters from output
Usages
-
Single text segmentation
import kss text = "νμ¬ λλ£ λΆλ€κ³Ό λ€λ μλλ° λΆμκΈ°λ μ’κ³ μμλ λ§μμμ΄μ λ€λ§, κ°λ¨ ν λΌμ μ΄ κ°λ¨ μμλ²κ±° 골λͺ©κΈΈλ‘ μ μ¬λΌκ°μΌ νλλ° λ€λ€ μμλ²κ±°μ μ νΉμ λμ΄κ° λ» νλ΅λλ€ κ°λ¨μ λ§μ§ ν λΌμ μ μΈλΆ λͺ¨μ΅." kss.split_morphemes(text) # [('νμ¬', 'NNG'), ('λλ£', 'NNG'), ('λΆ', 'NNB'), ('λ€', 'XSN'), ('κ³Ό', 'JKB'), ('λ€λ μ', 'VV+EP'), ('λλ°', 'EC'), ('λΆμκΈ°', 'NNG'), ('λ', 'JX'), ('μ’', 'VA'), ('κ³ ', 'EC'), ('μμ', 'NNG'), ('λ', 'JX'), ('λ§μ', 'VA'), ('μ', 'EP'), ('μ΄μ', 'EF'), ('λ€λ§', 'MAJ'), (',', 'SC'), ('κ°λ¨', 'NNP'), ('ν λΌ', 'NNG'), ('μ ', 'NNG'), ('μ΄', 'JKS'), ('κ°λ¨', 'NNP'), ('μμ', 'MAG'), ('λ²κ±°', 'NNG'), ('골λͺ©κΈΈ', 'NNG'), ('λ‘', 'JKB'), ('μ', 'MAG'), ('μ¬λΌκ°', 'VV'), ('μΌ', 'EC'), ('ν', 'VV'), ('λλ°', 'EC'), ('λ€', 'MAG'), ('λ€', 'XSN'), ('μμ', 'MAG'), ('λ²κ±°', 'NNG'), ('μ', 'JKG'), ('μ νΉ', 'NNG'), ('μ', 'JKB'), ('λμ΄κ°', 'VV+ETM'), ('λ»', 'NNB'), ('ν', 'VV+EP'), ('λ΅λλ€', 'EC'), ('κ°λ¨μ', 'NNP'), ('λ§μ§', 'NNG'), ('ν λΌ', 'NNG'), ('μ μ', 'NNG'), ('μΈλΆ', 'NNG'), ('λͺ¨μ΅', 'NNG'), ('.', 'SF')]
-
Batch texts segmentation
import kss texts = [ "νμ¬ λλ£ λΆλ€κ³Ό λ€λ μλλ° λΆμκΈ°λ μ’κ³ μμλ λ§μμμ΄μ λ€λ§, κ°λ¨ ν λΌμ μ΄ κ°λ¨ μμλ²κ±° 골λͺ©κΈΈλ‘ μ μ¬λΌκ°μΌ νλλ° λ€λ€ μμλ²κ±°μ μ νΉμ λμ΄κ° λ» νλ΅λλ€", "κ°λ¨μ λ§μ§ ν λΌμ μ μΈλΆ λͺ¨μ΅. κ°λ¨ ν λΌμ μ 4μΈ΅ 건물 λ μ±λ‘ μ΄λ£¨μ΄μ Έ μμ΅λλ€.", "μμ ν λΌμ λ³Έ μ λ΅μ£ ?γ γ γ 건물μ ν¬μ§λ§ κ°νμ΄ μκΈ° λλ¬Έμ μ§λμΉ μ μμΌλ μ‘°μ¬νμΈμ κ°λ¨ ν λΌμ μ λ΄λΆ μΈν 리μ΄.", ] kss.split_morphemes(texts) # [[('νμ¬', 'NNG'), ('λλ£', 'NNG'), ('λΆ', 'NNB'), ('λ€', 'XSN'), ('κ³Ό', 'JKB'), ('λ€λ μ', 'VV+EP'), ('λλ°', 'EC'), ('λΆμκΈ°', 'NNG'), ('λ', 'JX'), ('μ’', 'VA'), ('κ³ ', 'EC'), ('μμ', 'NNG'), ('λ', 'JX'), ('λ§μ', 'VA'), ('μ', 'EP'), ('μ΄μ', 'EF'), ('λ€λ§', 'MAJ'), (',', 'SC'), ('κ°λ¨', 'NNP'), ('ν λΌ', 'NNG'), ('μ ', 'NNG'), ('μ΄', 'JKS'), ('κ°λ¨', 'NNP'), ('μμ', 'MAG'), ('λ²κ±°', 'NNG'), ('골λͺ©κΈΈ', 'NNG'), ('λ‘', 'JKB'), ('μ', 'MAG'), ('μ¬λΌκ°', 'VV'), ('μΌ', 'EC'), ('ν', 'VV'), ('λλ°', 'EC'), ('λ€', 'MAG'), ('λ€', 'XSN'), ('μμ', 'MAG'), ('λ²κ±°', 'NNG'), ('μ', 'JKG'), ('μ νΉ', 'NNG'), ('μ', 'JKB'), ('λμ΄κ°', 'VV+ETM'), ('λ»', 'NNB'), ('ν', 'VV+EP'), ('λ΅λλ€', 'EC')], # [('κ°λ¨μ', 'NNP'), ('λ§μ§', 'NNG'), ('ν λΌ', 'NNG'), ('μ μ', 'NNG'), ('μΈλΆ', 'NNG'), ('λͺ¨μ΅', 'NNG'), ('.', 'SF'), ('κ°λ¨', 'NNP'), ('ν λΌ', 'NNG'), ('μ μ', 'NNP'), ('4', 'SN'), ('μΈ΅', 'NNG'), ('건물', 'NNG'), ('λ μ±', 'NNG'), ('λ‘', 'JKB'), ('μ΄λ£¨μ΄μ Έ', 'VV+EC'), ('μ', 'VX'), ('μ΅λλ€', 'EF'), ('.', 'SF')], # [('μμ', 'MAJ'), ('ν λΌ', 'NNG'), ('μ ', 'NNG'), ('λ³Έ', 'VV+ETM'), ('μ ', 'NNB'), ('λ΅', 'MAG+VCP'), ('μ£ ', 'EF'), ('?', 'SF'), ('γ ', 'IC'), ('γ ', 'NNG'), ('γ ', 'IC'), ('건물', 'NNG'), ('μ', 'JX'), ('ν¬', 'VA'), ('μ§λ§', 'EC'), ('κ°ν', 'NNG'), ('μ΄', 'JKS'), ('μ', 'VA'), ('κΈ°', 'ETN'), ('λλ¬Έ', 'NNB'), ('μ', 'JKB'), ('μ§λμΉ ', 'VV+ETM'), ('μ', 'NNB'), ('μ', 'VV'), ('μΌλ', 'EC'), ('μ‘°μ¬', 'NNG'), ('ν', 'XSV'), ('μΈμ', 'EP+EF'), ('κ°λ¨', 'NNP'), ('ν λΌ', 'NNG'), ('μ μ', 'NNG'), ('λ΄λΆ', 'NNG'), ('μΈν 리μ΄', 'NNG'), ('.', 'SF')]]
-
Remain space characters for original text recoverability
import kss text = "νμ¬ λλ£ λΆλ€κ³Ό λ€λ μλλ° λΆμκΈ°λ μ’κ³ μμλ λ§μμμ΄μ\nλ€λ§,\tκ°λ¨ ν λΌμ μ΄ κ°λ¨ μμλ²κ±° 골λͺ©κΈΈλ‘ μ μ¬λΌκ°μΌ νλλ° λ€λ€ μμλ²κ±°μ μ νΉμ λμ΄κ° λ» νλ΅λλ€ κ°λ¨μ λ§μ§ ν λΌμ μ μΈλΆ λͺ¨μ΅." kss.split_morphemes(text, drop_space=False) # [('νμ¬', 'NNG'), (' ', 'SP'), ('λλ£', 'NNG'), (' ', 'SP'), ('λΆ', 'NNB'), ('λ€', 'XSN'), ('κ³Ό', 'JKB'), (' ', 'SP'), ('λ€λ μ', 'VV+EP'), ('λλ°', 'EC'), (' ', 'SP'), ('λΆμκΈ°', 'NNG'), ('λ', 'JX'), (' ', 'SP'), ('μ’', 'VA'), ('κ³ ', 'EC'), (' ', 'SP'), ('μμ', 'NNG'), ('λ', 'JX'), (' ', 'SP'), ('λ§μ', 'VA'), ('μ', 'EP'), ('μ΄μ', 'EF'), ('\n', 'SP'), ('λ€λ§', 'MAJ'), (',', 'SC'), ('\t', 'SP'), ('κ°λ¨', 'NNP'), (' ', 'SP'), ('ν λΌ', 'NNG'), ('μ ', 'NNG'), ('μ΄', 'JKS'), (' ', 'SP'), ('κ°λ¨', 'NNP'), (' ', 'SP'), ('μμ', 'MAG'), ('λ²κ±°', 'NNG'), (' ', 'SP'), ('골λͺ©κΈΈ', 'NNG'), ('λ‘', 'JKB'), (' ', 'SP'), ('μ', 'MAG'), (' ', 'SP'), ('μ¬λΌκ°', 'VV'), ('μΌ', 'EC'), (' ', 'SP'), ('ν', 'VV'), ('λλ°', 'EC'), (' ', 'SP'), ('λ€', 'MAG'), ('λ€', 'XSN'), (' ', 'SP'), ('μμ', 'MAG'), ('λ²κ±°', 'NNG'), ('μ', 'JKG'), (' ', 'SP'), ('μ νΉ', 'NNG'), ('μ', 'JKB'), (' ', 'SP'), ('λμ΄κ°', 'VV+ETM'), (' ', 'SP'), ('λ»', 'NNB'), (' ', 'SP'), ('ν', 'VV+EP'), ('λ΅λλ€', 'EC'), (' ', 'SP'), ('κ°λ¨μ', 'NNP'), (' ', 'SP'), ('λ§μ§', 'NNG'), (' ', 'SP'), ('ν λΌ', 'NNG'), ('μ μ', 'NNG'), (' ', 'SP'), ('μΈλΆ', 'NNG'), (' ', 'SP'), ('λͺ¨μ΅', 'NNG'), ('.', 'SF')]
Kss is available in various programming languages.
If you find this toolkit useful, please consider citing:
@misc{kss,
author = {Ko, Hyunwoong and Park, Sang-kil},
title = {Kss: A Toolkit for Korean sentence segmentation},
howpublished = {\url{https://github.com/hyunwoongko/kss}},
year = {2021},
}
Kss project is licensed under the terms of the BSD 3-Clause "New" or "Revised" License.
Copyright 2021 Hyunwoong Ko and Sang-kil Park. All Rights Reserved.