|
1 | 1 | # KorAP Tokenizer |
2 | 2 | Interface and implementation of a tokenizer and sentence splitter that can be used |
3 | 3 |
|
| 4 | +* for German, English, French, and with some limitations also for other languages |
4 | 5 | * as standalone tokenizer and/or sentence splitter |
5 | | -* within the KorAP ingestion pipeline |
6 | | -* within the [OpenNLP tools](https://opennlp.apache.org) framework |
7 | | - |
8 | | -## DeReKo Tokenizer (included default implementation) |
9 | | -The included default implementation (`DerekoDfaTokenizer_de`) is a highly efficient DFA tokenizer and sentence splitter with character offset output based on [JFlex](https://www.jflex.de/), suitable for German and other European languages. |
10 | | -It is used for the German Reference Corpus DeReKo. Being based on a finite state automaton, |
11 | | -it is not as accurate as language model based tokenizers, but with ~5 billion words per hour typically more efficient. |
12 | | -An important feature in the DeReKo/KorAP context is also, that it reliably reports the character offsets of the tokens |
13 | | -so that this information can be used for applying standoff annotations. |
| 6 | +* or within the KorAP ingestion pipeline |
| 7 | +* or within the [OpenNLP tools](https://opennlp.apache.org) framework |
| 8 | + |
| 9 | +The included implementations (`DerekoDfaTokenizer_de, DerekoDfaTokenizer_en, DerekoDfaTokenizer_fr`) are highly efficient DFA tokenizers and sentence splitters with character offset output based on [JFlex](https://www.jflex.de/). |
| 10 | +The de-variant is used for the German Reference Corpus DeReKo. Being based on finite state automata, |
| 11 | +the tokenizers are potentially not as accurate as language model based ones, but with ~5 billion words per hour typically more efficient. |
| 12 | +An important feature in the DeReKo/KorAP context is also that token character offsets can be reported, which can be used for applying standoff annotations. |
14 | 13 |
|
15 | | -`DerekoDfaTokenizer_de` and any implementation of the `KorapTokenizer` interface also implement the [`opennlp.tools.tokenize.Tokenizer`](https://opennlp.apache.org/docs/1.8.2/apidocs/opennlp-tools/opennlp/tools/tokenize/Tokenizer.html) |
| 14 | +The include mplementations of the `KorapTokenizer` interface also implement the [`opennlp.tools.tokenize.Tokenizer`](https://opennlp.apache.org/docs/1.8.2/apidocs/opennlp-tools/opennlp/tools/tokenize/Tokenizer.html) |
16 | 15 | and [`opennlp.tools.sentdetect.SentenceDetector`](https://opennlp.apache.org/docs/1.8.2/apidocs/opennlp-tools/opennlp/tools/sentdetect/SentenceDetector.html) |
17 | | -interfaces and can thus be used as a drop-in replacement in OpenNLP applications. |
| 16 | +interfaces and can thus be used as a drop-in replacements in OpenNLP applications. |
18 | 17 |
|
19 | | -The scanner is based on the Lucene scanner with modifications from [David Hall](https://github.com/dlwh). |
| 18 | +The underlying scanner is based on the Lucene scanner with modifications from [David Hall](https://github.com/dlwh). |
20 | 19 |
|
21 | | -Our changes mainly concern a good coverage of German abbreviations, |
22 | | -and some updates for handling computer mediated communication, optimized and tested against the gold data from the [EmpiriST 2015](https://sites.google.com/site/empirist2015/) shared task (Beißwenger et al. 2016). |
| 20 | +Our changes mainly concern a good coverage of German, or optionally of some English and French abbreviations, |
| 21 | +and some updates for handling computer mediated communication, optimized and tested, in the case of German, against the gold data from the [EmpiriST 2015](https://sites.google.com/site/empirist2015/) shared task (Beißwenger et al. 2016). |
23 | 22 |
|
24 | | -### Adding Support for more Languages |
25 | | -To adapt the included implementations to more languages, take one of the `language-specific_<language>.jflex-macro` files as template and |
26 | | -modify for example the macro for abbreviations `SEABBR`. Then add an `execution` section for the new language |
27 | | -to the jcp ([java-comment-preprocessor](https://github.com/raydac/java-comment-preprocessor)) artifact in `pom.xml` following the example of one of the configurations there. |
28 | | -After building the project (see below) your added language specific tokenizer / sentence splitter should be selectable with the `--language` option. |
29 | | - |
30 | | -Alternatively, you can also provide `KorAPTokenizer` implementations independently on the class path and select them with the `--tokenizer-class` option. |
31 | 23 |
|
32 | 24 | ## Installation |
33 | 25 | ```shell script |
34 | | -$ MAVEN_OPTS="-Xss2m" mvn clean install |
| 26 | +mvn clean install |
35 | 27 | ``` |
36 | 28 | #### Note |
37 | 29 | Because of the large table of abbreviations, the conversion from the jflex source to java, |
38 | | -i.e. the calculation of the DFA, takes about 4 to 20 minutes, depending on your hardware, |
| 30 | +i.e. the calculation of the DFA, takes about 5 to 30 minutes, depending on your hardware, |
39 | 31 | and requires a lot of heap space. |
40 | 32 |
|
41 | | -## Documentation |
42 | | -The KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations. |
| 33 | +## Examples Usage |
| 34 | +By default, KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations. |
43 | 35 |
|
44 | | -#### Split into tokens |
| 36 | +#### Split English text into tokens |
45 | 37 | ``` |
46 | | -$ echo 'This is a sentence. This is a second sentence.' | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar |
47 | | -This |
48 | | -is |
49 | | -a |
50 | | -sentence |
51 | | -. |
52 | | -This |
53 | | -is |
54 | | -a |
55 | | -second |
56 | | -sentence |
| 38 | +$ echo "It's working." | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar -l en |
| 39 | +It |
| 40 | +'s |
| 41 | +working |
57 | 42 | . |
58 | | -
|
59 | 43 | ``` |
60 | | -#### Split into tokens and sentences |
| 44 | +#### Split French text into tokens and sentences |
61 | 45 | ``` |
62 | | -$ echo 'This is a sentence. This is a second sentence.' | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar -s |
63 | | -This |
64 | | -is |
65 | | -a |
66 | | -sentence |
| 46 | +$ echo "C'est une phrase. Ici, il s'agit d'une deuxième phrase." \ |
| 47 | + | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar -s -l fr |
| 48 | +C' |
| 49 | +est |
| 50 | +une |
| 51 | +phrase |
67 | 52 | . |
68 | 53 |
|
69 | | -This |
70 | | -is |
71 | | -a |
72 | | -second |
73 | | -sentence |
| 54 | +Ici |
| 55 | +, |
| 56 | +il |
| 57 | +s' |
| 58 | +agit |
| 59 | +d' |
| 60 | +une |
| 61 | +deuxième |
| 62 | +phrase |
74 | 63 | . |
75 | 64 |
|
76 | 65 | ``` |
@@ -105,6 +94,14 @@ echo -n -e ' This ist a start of a text. And this is a sentence!!! But what the |
105 | 94 | 0 25 |
106 | 95 | ``` |
107 | 96 |
|
| 97 | +### Adding Support for more Languages |
| 98 | +To adapt the included implementations to more languages, take one of the `language-specific_<language>.jflex-macro` files as template and |
| 99 | +modify for example the macro for abbreviations `SEABBR`. Then add an `execution` section for the new language |
| 100 | +to the jcp ([java-comment-preprocessor](https://github.com/raydac/java-comment-preprocessor)) artifact in `pom.xml` following the example of one of the configurations there. |
| 101 | +After building the project (see below) your added language specific tokenizer / sentence splitter should be selectable with the `--language` option. |
| 102 | + |
| 103 | +Alternatively, you can also provide `KorAPTokenizer` implementations independently on the class path and select them with the `--tokenizer-class` option. |
| 104 | + |
108 | 105 | ## Development and License |
109 | 106 |
|
110 | 107 | **Authors**: |
|
0 commit comments