Docker image for Helmut Schmid's TreeTagger (based on Stefan Fischer's docker-treetagger) with support for input and output in CoNLL-U format.
Based on Stefan Fischer's docker-treetagger.
Please read Helmut Schmid's license terms before using this Dockerfile.
From Docker Hub.
docker pull korap/conllu-treetaggergit clone https://github.com/KorAP/conllu-treetagger-docker.git
cd conllu-treetagger-docker
make build-docker$ docker run --rm -i korap/conllu-treetagger < goe.conllu | head -8
# foundry = tree_tagger
# filename = GOE/AGA/00000/base/tokens.xml
# text_id = GOE_AGA.00000
# start_offsets = 0 0 9 12
# end_offsets = 22 8 11 22
1 Campagne <unknown> _ NN _ _ _ _ _
2 in in _ APPR _ _ _ _ _
3 Frankreich Frankreich _ NE _ _ _ _ _To output different pos/lemma interpretations with their probabilities, use the -p option. You can optionally specify a threshold with -t (default: 0.1):
$ docker run --rm -i korap/conllu-treetagger -p -t 0.01 < goe.conllu | head -8
# foundry = tree_tagger
# filename = GOE/AGA/00000/base/tokens.xml
# text_id = GOE_AGA.00000
# start_offsets = 0 0 9 12
# end_offsets = 22 8 11 22
1 Campagne <unknown> _ NN _ _ _ _ _
2 in in _ APPR _ _ _ _ _
3 Frankreich Frankreich _ NE|NN|ADJD _ _ _ _ 0.956|0.032|0.012
korapxmltool, which includes korapxml2conllu as a shortcut, can be downloaded from https://github.com/KorAP/korapxmltool.
korapxml2conllu goe.zip | docker run --rm -i korap/conllu-treetagger -l german -pkorapxmltool -A "docker run --rm -i korap/conllu-treetagger" -t zip t24.zipTo avoid downloading the language model on every run, you can mount a local directory to /local/models:
korapxml2conllu goe.zip | docker run --rm -i -v /path/to/local/models:/local/models korap/conllu-treetagger -l germanFor an overview of the available languages / models, run one of the following command:
docker run --rm -i korap/conllu-treetagger -LOpen a shell within the container:
docker run --rm -it --entrypoint /bin/bash korap/conllu-treetaggerThe language can be specified with the -l option. Parameter files will be downloaded automatically from the tagger's website.
The following languages are available: Bulgarian, Catalan, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Middle High german, Greek, Ancient Greek, Ancient Greek (beta encoding), Italian, Korean, Latin, Norwegian (Bokmål), Polish, Portuguese, Portuguese (fine-grained tagset), Portuguese (alternative corpus), Romanian, Russian, Slovak, Slovenian, Spanish, Spanish (Ancora corpus), Swahili, Swedish.