Skip to content

Releases: hyunwoongko/kss

v3.2.0

09 Sep 13:35
Compare
Choose a tag to compare
  • change default value of use_quotes_brackets_processing to False
    • it is for speed. if you want to use this option, set to True.
  • make preprocessing part parallelizable.
    • preprocessing is much faster now :-)

v3.1.0

19 Aug 02:30
Compare
Choose a tag to compare
  • Fix default rule using morpheme features.
    • previous version segmented "μ—†λ‹€ κ±°λ‚˜" -> ["μ—†λ‹€", "κ±°λ‚˜"]
    • current version doesn't segment these cases.
  • Remove none backend option.
  • segment error rate
    • 3.5%+ -> 1.3% (mecab) / 2.3% (pynori)

v3.0.3

18 Aug 16:27
Compare
Choose a tag to compare
  • Fix bug reported in #7

v3.0.2

18 Aug 15:34
94d7274
Compare
Choose a tag to compare
  • Hot fix of logging bugs for longer text.
  • Add Memoization with LRU Cache for quotes calibration.
    • Quote calibration algorithm has time complexity of O(2^N).
    • It is very poor. So I applied memoization with caching.

v3.0.1

18 Aug 10:31
6993c16
Compare
Choose a tag to compare

1. Use morpheme features

  • Unlike 2.xx, unspecified eomi can also be segmented. (default backend is pynori)
  • e.g. ~μ†Œμ„œ(κ²½μ–΄), ~μ„Έμš©(μ‹ μ‘°μ–΄), ~ν–ˆμŒ/μž„(μ „μ„±μ–΄λ―Έ) ~κ΅¬λ‚˜(미등둝 μ–΄λ―Έ), etc.
    >>> split_sentences("λΆ€λ”” λ§Œμˆ˜λ¬΄κ°• ν•˜μ˜΅μ†Œμ„œ 천천히 κ°€μ„Έμš©~ λ„ˆ λ°₯을 λ¨ΉλŠ”κ΅¬λ‚˜ 응 λ§žμ•„ λ‚œ 근데 μ–΄μ œ μ΄μ‚¬ν–ˆμŒ κ·Έλž¬κ΅¬λ‚˜ 이제 λ§ˆμ§€λ§‰μž„ 응응")
    ['λΆ€λ”” λ§Œμˆ˜λ¬΄κ°• ν•˜μ˜΅μ†Œμ„œ', '천천히 κ°€μ„Έμš©~', 'λ„ˆ λ°₯을 λ¨ΉλŠ”κ΅¬λ‚˜', '응 λ§žμ•„ λ‚œ 근데 μ–΄μ œ μ΄μ‚¬ν–ˆμŒ', 'κ·Έλž¬κ΅¬λ‚˜ 이제 λ§ˆμ§€λ§‰μž„', '응응']
  • Boost segmentation speed via changing morpheme analyzer backend to mecab.
    >>> split_sentences("λΆ€λ”” λ§Œμˆ˜λ¬΄κ°• ν•˜μ˜΅μ†Œμ„œ 천천히 κ°€μ„Έμš©~ λ„ˆ λ°₯을 λ¨ΉλŠ”κ΅¬λ‚˜ 응 λ§žμ•„ λ‚œ 근데 μ–΄μ œ μ΄μ‚¬ν–ˆμŒ κ·Έλž¬κ΅¬λ‚˜ 이제 λ§ˆμ§€λ§‰μž„ 응응", backend="mecab")
    ['λΆ€λ”” λ§Œμˆ˜λ¬΄κ°• ν•˜μ˜΅μ†Œμ„œ', '천천히 κ°€μ„Έμš©~', 'λ„ˆ λ°₯을 λ¨ΉλŠ”κ΅¬λ‚˜', '응 λ§žμ•„ λ‚œ 근데 μ–΄μ œ μ΄μ‚¬ν–ˆμŒ', 'κ·Έλž¬κ΅¬λ‚˜ 이제 λ§ˆμ§€λ§‰μž„', '응응']  
  • You can turn off this by changing morpheme analyzer backend to none.
    >>> split_sentences("λΆ€λ”” λ§Œμˆ˜λ¬΄κ°• ν•˜μ˜΅μ†Œμ„œ 천천히 κ°€μ„Έμš©~ λ„ˆ λ°₯을 λ¨ΉλŠ”κ΅¬λ‚˜ 응 λ§žμ•„ λ‚œ 근데 μ–΄μ œ μ΄μ‚¬ν–ˆμŒ κ·Έλž¬κ΅¬λ‚˜ 이제 λ§ˆμ§€λ§‰μž„ 응응", backend="none") 
    ['λΆ€λ”” λ§Œμˆ˜λ¬΄κ°• ν•˜μ˜΅μ†Œμ„œ 천천히 κ°€μ„Έμš©~ λ„ˆ λ°₯을 λ¨ΉλŠ”κ΅¬λ‚˜ 응 λ§žμ•„ λ‚œ 근데 μ–΄μ œ μ΄μ‚¬ν–ˆμŒ κ·Έλž¬κ΅¬λ‚˜ 이제 λ§ˆμ§€λ§‰μž„ 응응']

2. Support multiprocessing and batch processing

  • You can input Tuple[str] and List[str] as input text for batch processing.
    >>> split_sentences(["μ•ˆλ…•ν•˜μ„Έμš” λ°˜κ°€μ›Œμš”", "λ°˜κ°‘μŠ΅λ‹ˆλ‹€. 잘 μ§€λ‚΄μ‹œλ‚˜μš”?"])
    [['μ•ˆλ…•ν•˜μ„Έμš”', 'λ°˜κ°€μ›Œμš”'], ['λ°˜κ°‘μŠ΅λ‹ˆλ‹€.', '잘 μ§€λ‚΄μ‹œλ‚˜μš”?']]  
  • You can change the number of multiprocess worker. default is -1 (max)
    >>> split_sentences(["μ•ˆλ…•ν•˜μ„Έμš” λ°˜κ°€μ›Œμš”", "λ°˜κ°‘μŠ΅λ‹ˆλ‹€. 잘 μ§€λ‚΄μ‹œλ‚˜μš”?"], num_workers=4)
    [['μ•ˆλ…•ν•˜μ„Έμš”', 'λ°˜κ°€μ›Œμš”'], ['λ°˜κ°‘μŠ΅λ‹ˆλ‹€.', '잘 μ§€λ‚΄μ‹œλ‚˜μš”?']]