Skip to content

Commit e030297

Browse files
committed
Update complexity info in Readme
Change-Id: I11a909c063bda51b7e11abcd3c8469c38479a439
1 parent 4d59ee4 commit e030297

File tree

1 file changed

+28
-3
lines changed

1 file changed

+28
-3
lines changed

Readme.md

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,21 @@ An important feature in the DeReKo/KorAP context is also that token character of
1919
- **`de`** (default): Modern German with support for gender-sensitive forms. Forms like `Nutzer:in`, `Nutzer/innen`, `Kaufmann/frau` are kept as single tokens.
2020
- **`de_old`**: Traditional German without gender-sensitive rules. These forms are split into separate tokens (e.g., `Nutzer:in``Nutzer` `:` `in`). Useful for processing older texts or when gender forms should not be treated specially.
2121

22+
### Complexity and Performance
23+
24+
Unlike simple script-based or regex-based tokenizers, the KorAP Tokenizer uses high-performance Deterministic Finite Automata (DFA) generated by JFlex. This allows for extremely high throughput (5-20 MB/s) while handling thousands of complex rules and abbreviations simultaneously (see Diewald/Kupietz/Lüngen 2022).
25+
26+
The following table shows the complexity of the underlying automata for each language variant:
27+
28+
| Language | DFA States | DFA Transitions (Edges) | Generated Java Code |
29+
| :--- | :--- | :--- | :--- |
30+
| **German** (`de`) | ~15,000 | 1,737,648 | ~67,000 lines |
31+
| **German** (`de_old`) | ~15,000 | 1,669,140 | ~61,000 lines |
32+
| **English** (`en`) | ~15,000 | 1,186,205 | ~38,000 lines |
33+
| **French** (`fr`) | ~15,000 | 1,188,825 | ~38,000 lines |
34+
35+
The significant size of the German DFA is primarily due to the integrated list of over 5,000 specialized abbreviations and the complex lookahead rules for gender-neutral forms (e.g., handling `:in` vs. namespace colons).
36+
2237

2338
The included implementations of the `KorapTokenizer` interface also implement the [`opennlp.tools.tokenize.Tokenizer`](https://opennlp.apache.org/docs/2.3.0/apidocs/opennlp-tools/opennlp/tools/tokenize/Tokenizer.html)
2439
and [`opennlp.tools.sentdetect.SentenceDetector`](https://opennlp.apache.org/docs/2.3.0/apidocs/opennlp-tools/opennlp/tools/sentdetect/SentenceDetector.html)
@@ -31,31 +46,40 @@ and some updates for handling computer mediated communication, optimized and tes
3146

3247

3348
## Installation
49+
3450
```shell script
3551
mvn clean package
3652
```
53+
3754
#### Note
38-
Because of the large table of abbreviations, the conversion from the jflex source to java,
39-
i.e. the calculation of the DFA, takes about 20 to 40 minutes, depending on your hardware,
55+
56+
Because of the complexity of the task and the large table of abbreviations, the conversion from the JFlex source to Java,
57+
i.e. the calculation of the DFA, takes about 15 to 60 minutes, depending on your hardware,
4058
and requires a lot of heap space.
4159

4260
For development, you can disable the large abbreviation lists to speed up the build:
4361
```shell script
4462
mvn clean generate-sources -Dforce.fast=true
4563
```
4664

65+
66+
4767
## Examples Usage
68+
4869
By default, KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations.
4970

5071
#### Split English text into tokens
72+
5173
```
5274
$ echo "It's working." | java -jar target/KorAP-Tokenizer-*-standalone.jar -l en
5375
It
5476
's
5577
working
5678
.
5779
```
80+
5881
#### Split French text into tokens and sentences
82+
5983
```
6084
$ echo "C'est une phrase. Ici, il s'agit d'une deuxième phrase." \
6185
| java -jar target/KorAP-Tokenizer-*-standalone.jar -s -l fr
@@ -79,6 +103,7 @@ phrase
79103
```
80104

81105
#### Print token character offsets
106+
82107
With the `--positions` option, for example, the tokenizer prints all offsets of the first character of a token and the first character after a token.
83108
In order to end a text, flush the output and reset the character position, an EOT character (0x04) can be used.
84109
```
@@ -125,7 +150,7 @@ Alternatively, you can also provide `KorAPTokenizer` implementations independent
125150
**Contributor**:
126151
* [Gregor Middell](https://github.com/gremid)
127152

128-
Copyright (c) 2023-2025, [Leibniz Institute for the German Language](http://www.ids-mannheim.de/), Mannheim, Germany
153+
Copyright (c) 2023-2026, [Leibniz Institute for the German Language](http://www.ids-mannheim.de/), Mannheim, Germany
129154

130155
This package is developed as part of the [KorAP](http://korap.ids-mannheim.de/)
131156
Corpus Analysis Platform at the Leibniz Institute for German Language

0 commit comments

Comments
 (0)