Releases: hyunwoongko/kss
Releases Β· hyunwoongko/kss
v3.2.0
- change default value of use_quotes_brackets_processing to False
- it is for speed. if you want to use this option, set to True.
- make preprocessing part parallelizable.
- preprocessing is much faster now :-)
v3.1.0
- Fix default rule using morpheme features.
- previous version segmented "μλ€ κ±°λ" -> ["μλ€", "κ±°λ"]
- current version doesn't segment these cases.
- Remove
none
backend option. - segment error rate
- 3.5%+ -> 1.3% (mecab) / 2.3% (pynori)
v3.0.3
v3.0.2
- Hot fix of logging bugs for longer text.
- Add Memoization with LRU Cache for quotes calibration.
- Quote calibration algorithm has time complexity of O(2^N).
- It is very poor. So I applied memoization with caching.
v3.0.1
1. Use morpheme features
- Unlike 2.xx, unspecified eomi can also be segmented. (default backend is
pynori
) - e.g. ~μμ(κ²½μ΄), ~μΈμ©(μ μ‘°μ΄), ~νμ/μ(μ μ±μ΄λ―Έ) ~ꡬλ(λ―Έλ±λ‘ μ΄λ―Έ), etc.
>>> split_sentences("λΆλ λ§μλ¬΄κ° νμ΅μμ μ²μ²ν κ°μΈμ©~ λ λ°₯μ λ¨Ήλꡬλ μ λ§μ λ κ·Όλ° μ΄μ μ΄μ¬νμ κ·Έλ¬κ΅¬λ μ΄μ λ§μ§λ§μ μμ") ['λΆλ λ§μλ¬΄κ° νμ΅μμ', 'μ²μ²ν κ°μΈμ©~', 'λ λ°₯μ λ¨Ήλꡬλ', 'μ λ§μ λ κ·Όλ° μ΄μ μ΄μ¬νμ', 'κ·Έλ¬κ΅¬λ μ΄μ λ§μ§λ§μ', 'μμ']
- Boost segmentation speed via changing morpheme analyzer backend to
mecab
.>>> split_sentences("λΆλ λ§μλ¬΄κ° νμ΅μμ μ²μ²ν κ°μΈμ©~ λ λ°₯μ λ¨Ήλꡬλ μ λ§μ λ κ·Όλ° μ΄μ μ΄μ¬νμ κ·Έλ¬κ΅¬λ μ΄μ λ§μ§λ§μ μμ", backend="mecab") ['λΆλ λ§μλ¬΄κ° νμ΅μμ', 'μ²μ²ν κ°μΈμ©~', 'λ λ°₯μ λ¨Ήλꡬλ', 'μ λ§μ λ κ·Όλ° μ΄μ μ΄μ¬νμ', 'κ·Έλ¬κ΅¬λ μ΄μ λ§μ§λ§μ', 'μμ']
- You can turn off this by changing morpheme analyzer backend to
none
.>>> split_sentences("λΆλ λ§μλ¬΄κ° νμ΅μμ μ²μ²ν κ°μΈμ©~ λ λ°₯μ λ¨Ήλꡬλ μ λ§μ λ κ·Όλ° μ΄μ μ΄μ¬νμ κ·Έλ¬κ΅¬λ μ΄μ λ§μ§λ§μ μμ", backend="none") ['λΆλ λ§μλ¬΄κ° νμ΅μμ μ²μ²ν κ°μΈμ©~ λ λ°₯μ λ¨Ήλꡬλ μ λ§μ λ κ·Όλ° μ΄μ μ΄μ¬νμ κ·Έλ¬κ΅¬λ μ΄μ λ§μ§λ§μ μμ']
2. Support multiprocessing and batch processing
- You can input
Tuple[str]
andList[str]
as input text for batch processing.>>> split_sentences(["μλ νμΈμ λ°κ°μμ", "λ°κ°μ΅λλ€. μ μ§λ΄μλμ?"]) [['μλ νμΈμ', 'λ°κ°μμ'], ['λ°κ°μ΅λλ€.', 'μ μ§λ΄μλμ?']]
- You can change the number of multiprocess worker. default is
-1
(max)>>> split_sentences(["μλ νμΈμ λ°κ°μμ", "λ°κ°μ΅λλ€. μ μ§λ΄μλμ?"], num_workers=4) [['μλ νμΈμ', 'λ°κ°μμ'], ['λ°κ°μ΅λλ€.', 'μ μ§λ΄μλμ?']]