Skip to content

Commit 0abcb59

Browse files
kupietzGerrit Code Review
authored andcommitted
Merge "Update Readme.md"
2 parents ab9187d + 8c7488b commit 0abcb59

File tree

1 file changed

+46
-49
lines changed

1 file changed

+46
-49
lines changed

Readme.md

Lines changed: 46 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,76 +1,65 @@
11
# KorAP Tokenizer
22
Interface and implementation of a tokenizer and sentence splitter that can be used
33

4+
* for German, English, French, and with some limitations also for other languages
45
* as standalone tokenizer and/or sentence splitter
5-
* within the KorAP ingestion pipeline
6-
* within the [OpenNLP tools](https://opennlp.apache.org) framework
7-
8-
## DeReKo Tokenizer (included default implementation)
9-
The included default implementation (`DerekoDfaTokenizer_de`) is a highly efficient DFA tokenizer and sentence splitter with character offset output based on [JFlex](https://www.jflex.de/), suitable for German and other European languages.
10-
It is used for the German Reference Corpus DeReKo. Being based on a finite state automaton,
11-
it is not as accurate as language model based tokenizers, but with ~5 billion words per hour typically more efficient.
12-
An important feature in the DeReKo/KorAP context is also, that it reliably reports the character offsets of the tokens
13-
so that this information can be used for applying standoff annotations.
6+
* or within the KorAP ingestion pipeline
7+
* or within the [OpenNLP tools](https://opennlp.apache.org) framework
8+
9+
The included implementations (`DerekoDfaTokenizer_de, DerekoDfaTokenizer_en, DerekoDfaTokenizer_fr`) are highly efficient DFA tokenizers and sentence splitters with character offset output based on [JFlex](https://www.jflex.de/).
10+
The de-variant is used for the German Reference Corpus DeReKo. Being based on finite state automata,
11+
the tokenizers are potentially not as accurate as language model based ones, but with ~5 billion words per hour typically more efficient.
12+
An important feature in the DeReKo/KorAP context is also that token character offsets can be reported, which can be used for applying standoff annotations.
1413

15-
`DerekoDfaTokenizer_de` and any implementation of the `KorapTokenizer` interface also implement the [`opennlp.tools.tokenize.Tokenizer`](https://opennlp.apache.org/docs/1.8.2/apidocs/opennlp-tools/opennlp/tools/tokenize/Tokenizer.html)
14+
The include mplementations of the `KorapTokenizer` interface also implement the [`opennlp.tools.tokenize.Tokenizer`](https://opennlp.apache.org/docs/1.8.2/apidocs/opennlp-tools/opennlp/tools/tokenize/Tokenizer.html)
1615
and [`opennlp.tools.sentdetect.SentenceDetector`](https://opennlp.apache.org/docs/1.8.2/apidocs/opennlp-tools/opennlp/tools/sentdetect/SentenceDetector.html)
17-
interfaces and can thus be used as a drop-in replacement in OpenNLP applications.
16+
interfaces and can thus be used as a drop-in replacements in OpenNLP applications.
1817

19-
The scanner is based on the Lucene scanner with modifications from [David Hall](https://github.com/dlwh).
18+
The underlying scanner is based on the Lucene scanner with modifications from [David Hall](https://github.com/dlwh).
2019

21-
Our changes mainly concern a good coverage of German abbreviations,
22-
and some updates for handling computer mediated communication, optimized and tested against the gold data from the [EmpiriST 2015](https://sites.google.com/site/empirist2015/) shared task (Beißwenger et al. 2016).
20+
Our changes mainly concern a good coverage of German, or optionally of some English and French abbreviations,
21+
and some updates for handling computer mediated communication, optimized and tested, in the case of German, against the gold data from the [EmpiriST 2015](https://sites.google.com/site/empirist2015/) shared task (Beißwenger et al. 2016).
2322

24-
### Adding Support for more Languages
25-
To adapt the included implementations to more languages, take one of the `language-specific_<language>.jflex-macro` files as template and
26-
modify for example the macro for abbreviations `SEABBR`. Then add an `execution` section for the new language
27-
to the jcp ([java-comment-preprocessor](https://github.com/raydac/java-comment-preprocessor)) artifact in `pom.xml` following the example of one of the configurations there.
28-
After building the project (see below) your added language specific tokenizer / sentence splitter should be selectable with the `--language` option.
29-
30-
Alternatively, you can also provide `KorAPTokenizer` implementations independently on the class path and select them with the `--tokenizer-class` option.
3123

3224
## Installation
3325
```shell script
34-
$ MAVEN_OPTS="-Xss2m" mvn clean install
26+
mvn clean install
3527
```
3628
#### Note
3729
Because of the large table of abbreviations, the conversion from the jflex source to java,
38-
i.e. the calculation of the DFA, takes about 4 to 20 minutes, depending on your hardware,
30+
i.e. the calculation of the DFA, takes about 5 to 30 minutes, depending on your hardware,
3931
and requires a lot of heap space.
4032

41-
## Documentation
42-
The KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations.
33+
## Examples Usage
34+
By default, KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations.
4335

44-
#### Split into tokens
36+
#### Split English text into tokens
4537
```
46-
$ echo 'This is a sentence. This is a second sentence.' | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar
47-
This
48-
is
49-
a
50-
sentence
51-
.
52-
This
53-
is
54-
a
55-
second
56-
sentence
38+
$ echo "It's working." | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar -l en
39+
It
40+
's
41+
working
5742
.
58-
5943
```
60-
#### Split into tokens and sentences
44+
#### Split French text into tokens and sentences
6145
```
62-
$ echo 'This is a sentence. This is a second sentence.' | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar -s
63-
This
64-
is
65-
a
66-
sentence
46+
$ echo "C'est une phrase. Ici, il s'agit d'une deuxième phrase." \
47+
| java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar -s -l fr
48+
C'
49+
est
50+
une
51+
phrase
6752
.
6853
69-
This
70-
is
71-
a
72-
second
73-
sentence
54+
Ici
55+
,
56+
il
57+
s'
58+
agit
59+
d'
60+
une
61+
deuxième
62+
phrase
7463
.
7564
7665
```
@@ -105,6 +94,14 @@ echo -n -e ' This ist a start of a text. And this is a sentence!!! But what the
10594
0 25
10695
```
10796

97+
### Adding Support for more Languages
98+
To adapt the included implementations to more languages, take one of the `language-specific_<language>.jflex-macro` files as template and
99+
modify for example the macro for abbreviations `SEABBR`. Then add an `execution` section for the new language
100+
to the jcp ([java-comment-preprocessor](https://github.com/raydac/java-comment-preprocessor)) artifact in `pom.xml` following the example of one of the configurations there.
101+
After building the project (see below) your added language specific tokenizer / sentence splitter should be selectable with the `--language` option.
102+
103+
Alternatively, you can also provide `KorAPTokenizer` implementations independently on the class path and select them with the `--tokenizer-class` option.
104+
108105
## Development and License
109106

110107
**Authors**:

0 commit comments

Comments
 (0)