Extraction of plain text from TEI-XML generated using Grobid
tei2txt was built and tested with Python3.9. It should work for Python >= 3.9 but it has not been tested with other versions than 3.9.
For creating the virtual environment and installing the dependencies (from requirements.txt), run:
bash setup.shpython tei2txt.py --helppython tei2txt.py [options]| Option | Default | Description |
|---|---|---|
-i, --input |
None |
Required. Xml input directory |
-o, --output |
None |
Required. Text output directory |
--force |
False |
Optional. To reprocess tei.xml input files |
-f, --filter |
None |
Optional. If filter by lang |
-s, --stats |
None |
Optional. lang_id stats in json format |
-S, --selector |
"head, p" |
Optional. Xml css selector, Use of double quotes is mandatory |
python tei2txt.py --input ./xml --output ./txt --selector "article-meta article-title, abstract, body title, body p"
To build the Docker image, navigate to the root directory of the project and run the following command:
docker build -t tei2txt:1.0 .Once the Docker image is built, you can use the following command to run the container:
docker run --rm -v path/to/tei-xml/input:/app/input -v path/to/txt/output:/app/output tei2txt:1.0 -i ./input -o ./outputExample of running tei2txt using custom selectors
docker run --rm -v ./examples/tei-xml/:/app/input -v ./output:/app/in-container-output tei2txt:1.0 -i ./input -o ./in-container-output --selector "title,author"