GitHub - projecte-aina/tei2txt: Files to process pdf into txt, using grobid

Extraction of plain text from TEI-XML generated using Grobid

Install and run

Virtual environment

tei2txt was built and tested with Python3.9. It should work for Python >= 3.9 but it has not been tested with other versions than 3.9.

For creating the virtual environment and installing the dependencies (from requirements.txt), run:

bash setup.sh

Help

python tei2txt.py --help

Usage

python tei2txt.py [options]

Option	Default	Description
`-i`, `--input`	`None`	Required. Xml input directory
`-o`, `--output`	`None`	Required. Text output directory
`--force`	`False`	Optional. To reprocess tei.xml input files
`-f`, `--filter`	`None`	Optional. If filter by lang
`-s`, `--stats`	`None`	Optional. lang_id stats in json format
`-S`, `--selector`	`"head, p"`	Optional. Xml css selector, Use of double quotes is mandatory

Example:

python tei2txt.py --input ./xml --output ./txt --selector "article-meta article-title, abstract, body title, body p"

Docker

Build Docker Image

To build the Docker image, navigate to the root directory of the project and run the following command:

docker build -t tei2txt:1.0 .

Run Docker Container

Once the Docker image is built, you can use the following command to run the container:

docker run --rm -v path/to/tei-xml/input:/app/input -v path/to/txt/output:/app/output tei2txt:1.0 -i ./input -o ./output

Example of running tei2txt using custom selectors

docker run --rm -v ./examples/tei-xml/:/app/input -v ./output:/app/in-container-output tei2txt:1.0 -i ./input -o ./in-container-output --selector "title,author"

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
examples/tei-xml		examples/tei-xml
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh
tei2txt.py		tei2txt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Install and run

Virtual environment

Help

Usage

Example:

Docker

Build Docker Image

Run Docker Container

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

projecte-aina/tei2txt

Folders and files

Latest commit

History

Repository files navigation

Install and run

Virtual environment

Help

Usage

Example:

Docker

Build Docker Image

Run Docker Container

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages