@@ -35,7 +35,7 @@ By default, KorAP tokenizer reads from standard input and writes to standard out
3535
3636#### Split English text into tokens
3737```
38- $ echo "It's working." | java -jar target/KorAP-Tokenizer-2.2.0.9000 -standalone.jar -l en
38+ $ echo "It's working." | java -jar target/KorAP-Tokenizer-2.2.2 -standalone.jar -l en
3939It
4040's
4141working
@@ -44,7 +44,7 @@ working
4444#### Split French text into tokens and sentences
4545```
4646$ echo "C'est une phrase. Ici, il s'agit d'une deuxième phrase." \
47- | java -jar target/KorAP-Tokenizer-2.2.0.9000 -standalone.jar -s -l fr
47+ | java -jar target/KorAP-Tokenizer-2.2.2 -standalone.jar -s -l fr
4848C'
4949est
5050une
@@ -69,7 +69,7 @@ With the `--positions` option, for example, the tokenizer prints all offsets of
6969In order to end a text, flush the output and reset the character position, an EOT character (0x04) can be used.
7070```
7171$ echo -n -e 'This is a text.\x0a\x04\x0aAnd this is another text.\n\x04\n' |\
72- java -jar target/KorAP-Tokenizer-2.2.0.9000 -standalone.jar --positions
72+ java -jar target/KorAP-Tokenizer-2.2.2 -standalone.jar --positions
7373This
7474is
7575a
8787#### Print token and sentence offset
8888```
8989echo -n -e ' This ist a start of a text. And this is a sentence!!! But what the hack????\x0a\x04\x0aAnd this is another text.' |\
90- java -jar target/KorAP-Tokenizer-2.2.0.9000 -standalone.jar --no-tokens --positions --sentence-boundaries
90+ java -jar target/KorAP-Tokenizer-2.2.2 -standalone.jar --no-tokens --positions --sentence-boundaries
91911 5 6 9 10 11 12 17 18 20 21 22 23 27 27 28 29 32 33 37 38 40 41 42 43 51 51 54 55 58 59 63 64 67 68 72 72 76
92921 28 29 54 55 76
93930 3 4 8 9 11 12 19 20 24 24 25
0 commit comments