Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: hyunwoongko/kss
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v4.5.1
Choose a base ref
...
head repository: hyunwoongko/kss
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: main
Choose a head ref
Loading
Showing with 290,780 additions and 922 deletions.
  1. +24 −0 .github/workflows/test_macos.yaml
  2. +24 −0 .github/workflows/test_ubuntu.yaml
  3. +24 −0 .github/workflows/test_windows.yaml
  4. +15 −0 .gitignore
  5. +2 −2 LICENSE
  6. +12 −0 MANIFEST.in
  7. +1,202 −767 README.md
  8. +165 −0 bench/__init__.py
  9. +20,000 −0 bench/preprocessing/news.jsonl
  10. +145 −0 bench/preprocessing/test_preprocess.py
  11. +11,109 −0 bench/safety/hatescore.csv
  12. +18 −0 bench/safety/test_kss.py
  13. 0 bench/sentence_split/__init__.py
  14. 0 bench/{ → sentence_split}/metrics/em_problem.txt
  15. 0 bench/{ → sentence_split}/metrics/f1_problem.txt
  16. 0 bench/{ → sentence_split}/sentence_split.py
  17. 0 bench/{ → sentence_split}/test_baseline.py
  18. 0 bench/{ → sentence_split}/test_kiwi.py
  19. 0 bench/{ → sentence_split}/test_koalanlp.py
  20. +1 −1 bench/{ → sentence_split}/test_kss.py
  21. 0 bench/{ → sentence_split}/test_word_split.py
  22. 0 bench/{ → sentence_split}/testset/blogs_ko.txt
  23. 0 bench/{ → sentence_split}/testset/blogs_lee.txt
  24. 0 bench/{ → sentence_split}/testset/nested.txt
  25. 0 bench/{ → sentence_split}/testset/sample.txt
  26. 0 bench/{ → sentence_split}/testset/tweets.txt
  27. 0 bench/{ → sentence_split}/testset/v_ending.txt
  28. 0 bench/{ → sentence_split}/testset/wikipedia.txt
  29. +27 −0 bench/space/README.md
  30. +59 −0 bench/space/make_space_errors.py
  31. +145 −0 bench/space/space.py
  32. +81 −0 bench/space/testset.legacy/relabel_evidence.txt
  33. +200 −0 bench/space/testset.legacy/written.txt
  34. +50 −0 bench/space/testset/colloquial.txt
  35. +200 −0 bench/space/testset/written.txt
  36. +2 −0 csrc/__init__.py
  37. +19 −0 csrc/kss_cython.pyx
  38. +179 −0 csrc/sentence_splitter.cpp
  39. +224 −0 csrc/sentence_splitter.h
  40. +199 −3 kss/__init__.py
  41. +1 −1 kss/_elements/__init__.py
  42. +1 −1 kss/_elements/element.py
  43. +1 −1 kss/_elements/empty.py
  44. +8 −1 kss/_elements/subclasses.py
  45. +1 −1 kss/_modules/__init__.py
  46. +3 −0 kss/_modules/augmentation/__init__.py
  47. +203,905 −0 kss/_modules/augmentation/assets/wordnet.json
  48. +98 −0 kss/_modules/augmentation/augment.py
  49. +97 −0 kss/_modules/augmentation/distance.py
  50. +100 −0 kss/_modules/augmentation/replacement.py
  51. +53 −0 kss/_modules/augmentation/utils.py
  52. +3 −0 kss/_modules/collocation/__init__.py
  53. +78 −0 kss/_modules/collocation/collocate.py
  54. +3 −0 kss/_modules/g2p/__init__.py
  55. +136 −0 kss/_modules/g2p/assets/idioms.txt
  56. +240 −0 kss/_modules/g2p/assets/rules.txt
  57. +28 −0 kss/_modules/g2p/assets/table.csv
  58. +148 −0 kss/_modules/g2p/english.py
  59. +210 −0 kss/_modules/g2p/g2p.py
  60. +126 −0 kss/_modules/g2p/numerals.py
  61. +104 −0 kss/_modules/g2p/regular.py
  62. +173 −0 kss/_modules/g2p/special.py
  63. +300 −0 kss/_modules/g2p/utils.py
  64. +3 −0 kss/_modules/hangulization/__init__.py
  65. +123 −0 kss/_modules/hangulization/hangulization.py
  66. +87 −0 kss/_modules/hangulization/hangulize/__init__.py
  67. +342 −0 kss/_modules/hangulization/hangulize/hangul.py
  68. +51 −0 kss/_modules/hangulization/hangulize/langs/__init__.py
  69. +185 −0 kss/_modules/hangulization/hangulize/langs/aze/__init__.py
  70. +180 −0 kss/_modules/hangulization/hangulize/langs/bel/__init__.py
  71. +166 −0 kss/_modules/hangulization/hangulize/langs/bul/__init__.py
  72. +174 −0 kss/_modules/hangulization/hangulize/langs/cat/__init__.py
  73. +157 −0 kss/_modules/hangulization/hangulize/langs/ces/__init__.py
  74. +223 −0 kss/_modules/hangulization/hangulize/langs/cym/__init__.py
  75. +207 −0 kss/_modules/hangulization/hangulize/langs/deu/__init__.py
  76. +229 −0 kss/_modules/hangulization/hangulize/langs/ell/__init__.py
  77. +126 −0 kss/_modules/hangulization/hangulize/langs/epo/__init__.py
  78. +158 −0 kss/_modules/hangulization/hangulize/langs/est/__init__.py
  79. +138 −0 kss/_modules/hangulization/hangulize/langs/fin/__init__.py
  80. +217 −0 kss/_modules/hangulization/hangulize/langs/grc/__init__.py
  81. +130 −0 kss/_modules/hangulization/hangulize/langs/hbs/__init__.py
  82. +157 −0 kss/_modules/hangulization/hangulize/langs/hun/__init__.py
  83. +219 −0 kss/_modules/hangulization/hangulize/langs/isl/__init__.py
  84. +133 −0 kss/_modules/hangulization/hangulize/langs/ita/__init__.py
  85. +146 −0 kss/_modules/hangulization/hangulize/langs/jpn/__init__.py
  86. +165 −0 kss/_modules/hangulization/hangulize/langs/kat/__init__.py
  87. +171 −0 kss/_modules/hangulization/hangulize/langs/kat/narrow.py
  88. +133 −0 kss/_modules/hangulization/hangulize/langs/lat/__init__.py
  89. +167 −0 kss/_modules/hangulization/hangulize/langs/lav/__init__.py
  90. +164 −0 kss/_modules/hangulization/hangulize/langs/lit/__init__.py
  91. +179 −0 kss/_modules/hangulization/hangulize/langs/mkd/__init__.py
  92. +738 −0 kss/_modules/hangulization/hangulize/langs/nld/__init__.py
  93. +171 −0 kss/_modules/hangulization/hangulize/langs/pol/__init__.py
  94. +222 −0 kss/_modules/hangulization/hangulize/langs/por/__init__.py
  95. +221 −0 kss/_modules/hangulization/hangulize/langs/por/br.py
  96. +135 −0 kss/_modules/hangulization/hangulize/langs/ron/__init__.py
  97. +166 −0 kss/_modules/hangulization/hangulize/langs/rus/__init__.py
  98. +254 −0 kss/_modules/hangulization/hangulize/langs/slk/__init__.py
  99. +156 −0 kss/_modules/hangulization/hangulize/langs/slv/__init__.py
  100. +113 −0 kss/_modules/hangulization/hangulize/langs/spa/__init__.py
  101. +128 −0 kss/_modules/hangulization/hangulize/langs/sqi/__init__.py
  102. +225 −0 kss/_modules/hangulization/hangulize/langs/swe/__init__.py
  103. +141 −0 kss/_modules/hangulization/hangulize/langs/tur/__init__.py
  104. +167 −0 kss/_modules/hangulization/hangulize/langs/ukr/__init__.py
  105. +130 −0 kss/_modules/hangulization/hangulize/langs/vie/__init__.py
  106. +215 −0 kss/_modules/hangulization/hangulize/langs/wlm/__init__.py
  107. +494 −0 kss/_modules/hangulization/hangulize/models.py
  108. +40 −0 kss/_modules/hangulization/hangulize/normalization.py
  109. +90 −0 kss/_modules/hangulization/hangulize/processing.py
  110. +3 −0 kss/_modules/hanja/__init__.py
  111. +27,497 −0 kss/_modules/hanja/assets/table.yml
  112. +169 −0 kss/_modules/hanja/hanja.py
  113. +179 −0 kss/_modules/hanja/utils.py
  114. +3 −0 kss/_modules/jamo/__init__.py
  115. +468 −0 kss/_modules/jamo/_jamo.py
  116. +27 −0 kss/_modules/jamo/utils.py
  117. +3 −0 kss/_modules/josa/__init__.py
  118. +88 −0 kss/_modules/josa/josa.py
  119. +170 −0 kss/_modules/josa/utils.py
  120. +3 −0 kss/_modules/keywords/__init__.py
  121. +114 −0 kss/_modules/keywords/extract_keywords.py
  122. +533 −0 kss/_modules/keywords/utils.py
  123. +2 −1 kss/_modules/morphemes/__init__.py
  124. +5 −1 kss/_modules/morphemes/analyzers.py
  125. +14 −5 kss/_modules/morphemes/split_morphemes.py
  126. +6 −1 kss/_modules/morphemes/utils.py
  127. +3 −0 kss/_modules/paradigm/__init__.py
  128. +82 −0 kss/_modules/paradigm/paradigm.py
  129. +3 −0 kss/_modules/preprocessing/__init__.py
  130. +273 −0 kss/_modules/preprocessing/anonymize.py
  131. +399 −0 kss/_modules/preprocessing/clean_news.py
  132. +113 −0 kss/_modules/preprocessing/completed_form.py
  133. +478 −0 kss/_modules/preprocessing/filter_out.py
  134. +103 −0 kss/_modules/preprocessing/half2full.py
  135. +145 −0 kss/_modules/preprocessing/normalize.py
  136. +502 −0 kss/_modules/preprocessing/preprocess.py
  137. +233 −0 kss/_modules/preprocessing/reduce_repeats.py
  138. +200 −0 kss/_modules/preprocessing/remove_invisible_chars.py
  139. +3 −0 kss/_modules/qwerty/__init__.py
  140. +223 −0 kss/_modules/qwerty/qwerty.py
  141. +150 −0 kss/_modules/qwerty/utils.py
  142. +3 −0 kss/_modules/romanization/__init__.py
  143. +197 −0 kss/_modules/romanization/romanize.py
  144. +212 −0 kss/_modules/romanization/utils.py
  145. +3 −0 kss/_modules/safety/__init__.py
  146. +110 −0 kss/_modules/safety/check_safety.py
  147. +233 −0 kss/_modules/safety/utils.py
  148. +2 −1 kss/_modules/sentences/__init__.py
  149. +2 −1 kss/_modules/sentences/embracing_processor.py
  150. +2 −1 kss/_modules/sentences/sentence_postprocessor.py
  151. +3 −2 kss/_modules/sentences/sentence_preprocessor.py
  152. +13 −16 kss/_modules/sentences/sentence_processor.py
  153. +8 −7 kss/_modules/sentences/sentence_splitter.py
  154. +403 −0 kss/_modules/sentences/sentence_splitter_fast.py
  155. +43 −10 kss/_modules/sentences/split_sentences.py
  156. +3 −0 kss/_modules/spacing/__init__.py
  157. +174 −0 kss/_modules/spacing/correct_spacing.py
  158. +7,440 −0 kss/_modules/spacing/utils.py
  159. +2 −0 kss/_modules/summarization/__init__.py
  160. +0 −26 kss/_modules/summarization/sentence.py
  161. +16 −3 kss/_modules/summarization/summarize_sentences.py
  162. +28 −3 kss/_modules/summarization/utils.py
  163. +2 −1 kss/_utils/__init__.py
  164. +4 −11 kss/_utils/const.py
  165. +5 −4 kss/_utils/emojis.py
  166. +27 −0 kss/_utils/logger.py
  167. +0 −8 kss/_utils/logging.py
  168. +4 −1 kss/_utils/multiprocessing.py
  169. +143 −32 kss/_utils/sanity_checks.py
  170. +136 −9 setup.py
  171. 0 tests/__init__.py
  172. +11 −0 tests/test_augmentation.py
  173. +67 −0 tests/test_collocation.py
  174. +12 −0 tests/test_g2p.py
  175. +12 −0 tests/test_hangulize.py
  176. +31 −0 tests/test_hanja.py
  177. +55 −0 tests/test_jamo.py
  178. +22 −0 tests/test_josa.py
  179. +24 −0 tests/test_keywords.py
  180. +12 −0 tests/test_morphemes.py
  181. +115 −0 tests/test_paradigm.py
  182. +669 −0 tests/test_preprocessing.py
  183. +11 −0 tests/test_qwerty.py
  184. +11 −0 tests/test_romanize.py
  185. +11 −0 tests/test_safety.py
  186. +10 −0 tests/test_sentences.py
  187. +8 −0 tests/test_spacing.py
  188. +8 −0 tests/test_summarization.py
24 changes: 24 additions & 0 deletions .github/workflows/test_macos.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: macos-latest
on: push

jobs:
test:
runs-on: macos-latest
steps:
- name: Checkout repository
uses: actions/checkout@v2

- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: 3.10.11

- name: Install kss locally
run: |
pip3 install -e .
- name: Install pytest
run: |
python3 -m pip install pytest
- name: Run the test suite
run: |
cd tests && pytest -v
24 changes: 24 additions & 0 deletions .github/workflows/test_ubuntu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: ubuntu-latest
on: push

jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v2

- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: 3.10.11

- name: Install kss locally
run: |
pip3 install -e .
- name: Install pytest
run: |
python3 -m pip install pytest
- name: Run the test suite
run: |
cd tests && pytest -v
24 changes: 24 additions & 0 deletions .github/workflows/test_windows.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: windows-latest
on: push

jobs:
test:
runs-on: windows-latest
steps:
- name: Checkout repository
uses: actions/checkout@v2

- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: 3.10.11

- name: Install kss locally
run: |
pip3 install -e .
- name: Install pytest
run: |
python3 -m pip install pytest
- name: Run the test suite
run: |
cd tests && pytest -v
15 changes: 15 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -129,3 +129,18 @@ dmypy.json

bench/.java/
.java/

csrc/kss_cython.cpp
csrc/sentence_splitter.o
DS_Store
*/DS_Store
*/*/DS_Store
*/*/*/DS_Store
*/*/*/*/DS_Store
*/*/*/*/*/DS_Store
*/*/*/*/*/*/DS_Store
*/*/*/*/*/*/*/DS_Store
*/*/*/*/*/*/*/*/DS_Store
*/*/*/*/*/*/*/*/*/DS_Store
*/*/*/*/*/*/*/*/*/*/DS_Store
*/*/*/*/*/*/*/*/*/*/*/DS_Store
4 changes: 2 additions & 2 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright (C) 2021 Hyunwoong Ko <kevin.ko@tunib.ai> and Sang-Kil Park <skpark1224@hyundai.com>
Copyright (C) 2021 Hyunwoong Ko <kevin.brain@kakaobrain.com> and Sang-Kil Park <skpark1224@hyundai.com>
All rights reserved.

Redistribution and use in source and binary forms, with or without
@@ -23,4 +23,4 @@ SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
POSSIBILITY OF SUCH DAMAGE.
12 changes: 12 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
exclude csrc/kss_cython.cpp
exclude csrc/sentence_splitter.o
include csrc/kss_cython.pyx
include csrc/sentence_splitter.cpp
include csrc/sentence_splitter.h
include csrc/__init__.py
include kss/_modules/g2p/assets/rules.txt
include kss/_modules/g2p/assets/idioms.txt
include kss/_modules/g2p/assets/table.csv
include kss/_modules/augmentation/assets/wordnet.json
include kss/_modules/hanja/assets/table.yml
include setup.py
Loading