You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Readme.md
+28-3Lines changed: 28 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,6 +19,21 @@ An important feature in the DeReKo/KorAP context is also that token character of
19
19
-**`de`** (default): Modern German with support for gender-sensitive forms. Forms like `Nutzer:in`, `Nutzer/innen`, `Kaufmann/frau` are kept as single tokens.
20
20
-**`de_old`**: Traditional German without gender-sensitive rules. These forms are split into separate tokens (e.g., `Nutzer:in` → `Nutzer``:``in`). Useful for processing older texts or when gender forms should not be treated specially.
21
21
22
+
### Complexity and Performance
23
+
24
+
Unlike simple script-based or regex-based tokenizers, the KorAP Tokenizer uses high-performance Deterministic Finite Automata (DFA) generated by JFlex. This allows for extremely high throughput (5-20 MB/s) while handling thousands of complex rules and abbreviations simultaneously (see Diewald/Kupietz/Lüngen 2022).
25
+
26
+
The following table shows the complexity of the underlying automata for each language variant:
27
+
28
+
| Language | DFA States | DFA Transitions (Edges) | Generated Java Code |
The significant size of the German DFA is primarily due to the integrated list of over 5,000 specialized abbreviations and the complex lookahead rules for gender-neutral forms (e.g., handling `:in` vs. namespace colons).
36
+
22
37
23
38
The included implementations of the `KorapTokenizer` interface also implement the [`opennlp.tools.tokenize.Tokenizer`](https://opennlp.apache.org/docs/2.3.0/apidocs/opennlp-tools/opennlp/tools/tokenize/Tokenizer.html)
24
39
and [`opennlp.tools.sentdetect.SentenceDetector`](https://opennlp.apache.org/docs/2.3.0/apidocs/opennlp-tools/opennlp/tools/sentdetect/SentenceDetector.html)
@@ -31,31 +46,40 @@ and some updates for handling computer mediated communication, optimized and tes
31
46
32
47
33
48
## Installation
49
+
34
50
```shell script
35
51
mvn clean package
36
52
```
53
+
37
54
#### Note
38
-
Because of the large table of abbreviations, the conversion from the jflex source to java,
39
-
i.e. the calculation of the DFA, takes about 20 to 40 minutes, depending on your hardware,
55
+
56
+
Because of the complexity of the task and the large table of abbreviations, the conversion from the JFlex source to Java,
57
+
i.e. the calculation of the DFA, takes about 15 to 60 minutes, depending on your hardware,
40
58
and requires a lot of heap space.
41
59
42
60
For development, you can disable the large abbreviation lists to speed up the build:
43
61
```shell script
44
62
mvn clean generate-sources -Dforce.fast=true
45
63
```
46
64
65
+
66
+
47
67
## Examples Usage
68
+
48
69
By default, KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations.
49
70
50
71
#### Split English text into tokens
72
+
51
73
```
52
74
$ echo "It's working." | java -jar target/KorAP-Tokenizer-*-standalone.jar -l en
53
75
It
54
76
's
55
77
working
56
78
.
57
79
```
80
+
58
81
#### Split French text into tokens and sentences
82
+
59
83
```
60
84
$ echo "C'est une phrase. Ici, il s'agit d'une deuxième phrase." \
0 commit comments