Update complexity info in Readme

kupietz · kupietz · commit e030297934cc · 2026-02-08T12:43:12.000+01:00
Change-Id: I11a909c063bda51b7e11abcd3c8469c38479a439
diff --git a/Readme.md b/Readme.md
@@ -19,6 +19,21 @@ An important feature in the DeReKo/KorAP context is also that token character of
 - **`de`** (default): Modern German with support for gender-sensitive forms. Forms like `Nutzer:in`, `Nutzer/innen`, `Kaufmann/frau` are kept as single tokens.
 - **`de_old`**: Traditional German without gender-sensitive rules. These forms are split into separate tokens (e.g., `Nutzer:in` → `Nutzer` `:` `in`). Useful for processing older texts or when gender forms should not be treated specially.
 
+### Complexity and Performance
+
+Unlike simple script-based or regex-based tokenizers, the KorAP Tokenizer uses high-performance Deterministic Finite Automata (DFA) generated by JFlex. This allows for extremely high throughput (5-20 MB/s) while handling thousands of complex rules and abbreviations simultaneously (see Diewald/Kupietz/Lüngen 2022).
+
+The following table shows the complexity of the underlying automata for each language variant:
+
+| Language | DFA States | DFA Transitions (Edges) | Generated Java Code |
+| :--- | :--- | :--- | :--- |
+| **German** (`de`) | ~15,000 | 1,737,648 | ~67,000 lines |
+| **German** (`de_old`) | ~15,000 | 1,669,140 | ~61,000 lines |
+| **English** (`en`) | ~15,000 | 1,186,205 | ~38,000 lines |
+| **French** (`fr`) | ~15,000 | 1,188,825 | ~38,000 lines |
+
+The significant size of the German DFA is primarily due to the integrated list of over 5,000 specialized abbreviations and the complex lookahead rules for gender-neutral forms (e.g., handling `:in` vs. namespace colons).
+
  
 The included implementations of the `KorapTokenizer` interface also implement the [`opennlp.tools.tokenize.Tokenizer`](https://opennlp.apache.org/docs/2.3.0/apidocs/opennlp-tools/opennlp/tools/tokenize/Tokenizer.html)
 and [`opennlp.tools.sentdetect.SentenceDetector`](https://opennlp.apache.org/docs/2.3.0/apidocs/opennlp-tools/opennlp/tools/sentdetect/SentenceDetector.html)
@@ -31,31 +46,40 @@ and some updates for handling computer mediated communication, optimized and tes
 
 
 ## Installation
+
 ```shell script
 mvn clean package
 ```
+
 #### Note
-Because of the large table of abbreviations, the conversion from the jflex source to java,
-i.e. the calculation of the DFA, takes about 20 to 40 minutes, depending on your hardware,
+
+Because of the complexity of the task and the large table of abbreviations, the conversion from the JFlex source to Java,
+i.e. the calculation of the DFA, takes about 15 to 60 minutes, depending on your hardware,
 and requires a lot of heap space.
 
 For development, you can disable the large abbreviation lists to speed up the build:
 ```shell script
 mvn clean generate-sources -Dforce.fast=true
 ```
 
+
+
 ## Examples Usage
+
 By default, KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations.
 
 #### Split English text into tokens
+
 ```
 $ echo "It's working." | java -jar target/KorAP-Tokenizer-*-standalone.jar -l en
 It
 's
 working
 .
 ```
+
 #### Split French text into tokens and sentences
+
 ```
 $ echo "C'est une phrase. Ici, il s'agit d'une deuxième phrase." \
   | java -jar target/KorAP-Tokenizer-*-standalone.jar -s -l fr
@@ -79,6 +103,7 @@ phrase
 ```
 
 #### Print token character offsets
+
 With the `--positions` option, for example, the tokenizer prints all offsets of the first character of a token and the first character after a token.
 In order to end a text, flush the output and reset the character position, an EOT character (0x04) can be used.
 ```
@@ -125,7 +150,7 @@ Alternatively, you can also provide `KorAPTokenizer` implementations independent
 **Contributor**:
 * [Gregor Middell](https://github.com/gremid)
 
-Copyright (c) 2023-2025, [Leibniz Institute for the German Language](http://www.ids-mannheim.de/), Mannheim, Germany
+Copyright (c) 2023-2026, [Leibniz Institute for the German Language](http://www.ids-mannheim.de/), Mannheim, Germany
 
 This package is developed as part of the [KorAP](http://korap.ids-mannheim.de/)
 Corpus Analysis Platform at the Leibniz Institute for German Language