Automated entity tagging pipeline for early modern Dutch PageXML documents using Republic project NER models trained on 17th-18th century Dutch States General resolutions.
- Pre-trained models: Uses Republic NER models specifically trained on historical Dutch administrative texts
- Multiple entity types: Supports persons (PER) and dates (DAT) with extensible architecture
- Transkribus integration: Direct import/export workflow with Transkribus PageXML format
- Collection-based processing: Automatically discovers and processes entire Transkribus export collections
- Interactive selection: Choose specific collections to process through an interactive interface
- Batch processing: Process multiple collections efficiently
- Isolated environments: Handles dependency conflicts through separate virtual environments
- Preserves structure: Maintains existing annotations, reading order, and XML structure
This repository contains scripts for tagging named entities in PageXML documents exported from Transkribus. The script is designed to work with Transkribus export structures (numbered folders containing page/ subdirectories with XML files). Tagged documents can be re-imported into Transkribus with entity annotations preserved in custom attributes, enabling further annotation, searching, and analysis of historical entities.
- PER (persoon) - Persons: names of individuals, titles, and roles
- DAT (datum) - Dates: temporal expressions in various historical formats
The modular architecture allows for future expansion to additional entity types such as organisations (ORG) and geographical locations (LOC).
- Windows operating system (scripts use
.batlaunchers; for Unix systems, adaptation is required) - Python 3.8-3.11 (Python 3.12+ is not compatible with the Republic models' installation requirements)
- Sufficient disk space (~2GB for models and embeddings)
- Internet connection for initial setup (downloading models and embeddings)
Clone this repository or download as ZIP and extract to your local machine:
git clone https://github.com/CARomein/Entity_Recognition_Resolutions.git
cd Entity_Recognition_ResolutionsThe different NER models have incompatible dependencies. The DAT model requires specific Flair configurations that conflict with other models' requirements. To resolve this, each entity type uses an isolated virtual environment.
Open Command Prompt in the project directory and execute the following commands:
python -m venv venv_per
python -m venv venv_datWhy separate environments?
- The Republic NER models were trained with different embedding configurations
- Dependency conflicts prevent simultaneous installation in a single environment
- Isolated environments ensure all models function correctly without mutual interference
Install the required libraries in each environment. Execute these commands one at a time:
venv_per\Scripts\activate
pip install -r requirements.txt
deactivate
venv_dat\Scripts\activate
pip install -r requirements.txt
deactivateNote: The first run will download Flair embeddings (approximately 500MB), which may take several minutes depending on your connection speed. This is a one-time operation.
Download the required .pt model files from marijnkoolen's Hugging Face profile and place them in the models/ directory within this repository:
Required models:
best-model_per.pt- Persons modelbest-model_dat.pt- Dates model
Create the models/ directory if it does not exist:
mkdir modelsThen place the downloaded .pt files in this directory.
After completing these steps, your directory structure should resemble:
Entity_Recognition_Resolutions\
├── tag_entities.py # Main tagging script
├── run_per.bat # Launcher for PER tagging
├── run_dat.bat # Launcher for DAT tagging
├── requirements.txt # Python dependencies
├── venv_per\ # Virtual environment for PER
├── venv_dat\ # Virtual environment for DAT
├── models\ # Downloaded model files
│ ├── best-model_per.pt
│ └── best-model_dat.pt
├── README.md # This file
├── LICENSE # MIT License
└── .gitignore # Git ignore patterns
The script expects PageXML documents in a Transkribus export structure:
base_directory\
├── 12345\ # Collection number
│ └── page\
│ ├── 001.xml
│ ├── 002.xml
│ └── ...
├── 12346\ # Another collection
│ └── page\
│ ├── 001.xml
│ └── ...
└── ...
The script will automatically discover all numbered folders containing page/ subdirectories with XML files.
- Export collections from Transkribus to a base directory
- Run the appropriate entity tagging script
- Select collections to process through the interactive interface
- Review tagged output
- Import collections back into Transkribus
Double-click run_per.bat or run it from the command line. The script will prompt for the base directory:
run_per.batExample interaction:
========================================
Person Tagger (PER)
========================================
Enter directory to process: C:\transkribus_exports\resolutions
Activating environment...
======================================================================
Entity Recognition Tagger
======================================================================
Base directory: C:\transkribus_exports\resolutions
Entity type: PER
Model: ./models/best-model_per.pt
======================================================================
======================================================================
Available Collections:
======================================================================
1. 12345 ( 142 XML files)
2. 12346 ( 89 XML files)
3. 12347 ( 201 XML files)
======================================================================
Options:
- Type 'all' to process all collections
- Type 'include' to select specific collections
- Type 'exclude' to exclude specific collections
Your choice: include
Enter numbers to INCLUDE (space-separated, e.g. 1 3 5):
Numbers: 1 3
You selected 2 collection(s):
- 12345 (142 files)
- 12347 (201 files)
Confirm? (yes/no): yes
======================================================================
Loading PER model...
======================================================================
Model loaded: PER
Tag name: persoon
======================================================================
Processing: 12345 (142 files)
======================================================================
✓ 001.xml 12 tags
✓ 002.xml 8 tags
...
Collection 12345: 142 files processed
======================================================================
Processing: 12347 (201 files)
======================================================================
✓ 001.xml 15 tags
...
======================================================================
SUMMARY
======================================================================
Total files processed: 343
Total PER tags added: 2847
======================================================================
Done!
Double-click run_dat.bat or run it from the command line:
run_dat.batThe interaction pattern is identical to the PER tagger, but processes date entities instead.
To tag multiple entity types, run each script sequentially. Each script preserves existing tags, so running multiple scripts adds cumulative annotations:
run_per.bat
run_dat.batThe script provides three selection modes:
- All collections: Processes every collection found in the base directory
- Include specific collections: Select collections by number (space-separated)
- Exclude specific collections: Process all except specified collections
This flexibility allows for efficient batch processing of large archives or selective processing of specific collections.
The script displays detailed progress information:
- Collection being processed
- Files processed with tag counts
- Summary statistics per collection
- Total files and tags across all collections
Files without any detected entities are not displayed, keeping the output focused on actual results.
PageXML files exported from Transkribus containing transcribed text. The script processes TextLine elements:
<TextLine id="line_1">
<Coords points="100,200 500,200 500,250 100,250"/>
<TextEquiv>
<Unicode>Op heden den 15 january 1650 compareerde Jan Smit</Unicode>
</TextEquiv>
</TextLine>PageXML files with entity annotations embedded in TextLine custom attributes:
<TextLine id="line_1" custom="datum {offset:10;length:17;} persoon {offset:37;length:8;}">
<Coords points="100,200 500,200 500,250 100,250"/>
<TextEquiv>
<Unicode>Op heden den 15 january 1650 compareerde Jan Smit</Unicode>
</TextEquiv>
</TextLine>The format follows Transkribus conventions for custom attributes:
offset: Character position where entity begins (0-indexed)length: Number of characters in the entity
Multiple entities on the same line are separated by spaces in the custom attribute.
Tagged files are modified in place and can be directly re-imported into Transkribus for further annotation, analysis, or export.
Important: Before importing tagged PageXML files, ensure the entity type names are configured in your Transkribus collection:
- Open your collection in Transkribus
- Navigate to Collection → Tags
- Add custom tags matching your entity types:
persoon(for PER entities)datum(for DAT entities)
- Assign colours for visualisation
- Save configuration
Without this configuration, tagged entities will not be visible in the Transkribus interface.
-
Export from Transkribus:
- Select collections to export
- Choose PageXML format
- Export maintains numbered folder structure
-
Tag with this tool:
- Run appropriate launcher scripts
- Select collections to process
- Verify output in console
-
Import back to Transkribus:
- Import entire collection folders
- Verify entity tags appear correctly
- Continue annotation or analysis
The scripts preserve existing custom attributes, allowing cumulative tagging:
- Tag persons first with
run_per.bat - Then tag dates with
run_dat.bat - Both entity types appear in the final PageXML
This approach enables flexible workflows and iterative refinement.
Problem: "ERROR: Directory not found"
Solutions:
- Verify the directory path is correct
- Use absolute paths if relative paths do not work
- Ensure the directory contains numbered folders with
page/subdirectories - Check for trailing backslashes (automatically handled by the scripts)
Problem: "No collections found! Expected structure: base_dir/number/page/*.xml"
Solutions:
- Verify the directory structure matches Transkribus export format
- Ensure numbered folders contain
page/subdirectories - Check that
page/folders contain.xmlfiles - Verify folder names are numbers (not text labels)
Problem: "Error: Model file not found: ./models/best-model_per.pt"
Solutions:
- Verify the
.ptfile exists in themodels/directory - Check the filename exactly matches (case-sensitive)
- Ensure the file downloaded completely (should be several hundred MB)
- Create the
models/directory if it does not exist
Problem: Flair not installed in the virtual environment
Solutions:
- Ensure you ran
pip install -r requirements.txtin the correct environment - Verify you created both virtual environments
- Reinstall:
pip install flair - Check Python version compatibility (3.8-3.11)
Problem: "This version of Flair requires Python 3.8-3.11"
Solutions:
- Check your Python version:
python --version - Install Python 3.11 if necessary
- Create virtual environments with correct Python version
- Use
py -3.11 -m venv venv_perto specify version
Problem: Script takes very long to process files
Solutions:
- First run downloads embeddings (~500MB) - this is normal and one-time only
- Subsequent runs should be faster (10-30 seconds per file)
- Large files with many TextLines naturally take longer
- Close other applications to free up RAM
- Process collections in batches if necessary
Problem: Script completes but reports 0 files processed
Solutions:
- Verify the model file is correct for the entity type
- Check that TextLine elements contain text (Unicode elements)
- Ensure text is in Dutch (models trained on Dutch)
- Try a different collection to verify the model works
- Check that the text is historical Dutch (modern Dutch may have lower recall)
- Verify XML files are well-formed
Problem: UnicodeDecodeError or similar encoding issues
Solutions:
- Ensure PageXML files are UTF-8 encoded
- Re-export from Transkribus if necessary
- Check for corrupted XML files
- Verify namespace declarations are correct
Problem: Imported files do not show entity tags in Transkribus interface
Solutions:
- Configure entity types in Collection → Tags with exact names (
persoon,datum) - Re-import the documents after configuring tags
- Verify the custom attribute format matches Transkribus conventions
- Check that the XML structure was preserved correctly
- Typical speed: 10-30 seconds per PageXML file
- Factors affecting speed:
- Number of TextLines per file
- Text density (characters per line)
- First run (downloads embeddings)
- System specifications
- Number of entities per file
- Small collection (100 pages): ~15-30 minutes
- Medium collection (500 pages): ~1-2 hours
- Large collection (1000 pages): ~2-5 hours
These estimates assume subsequent runs (embeddings already downloaded).
- RAM: 8GB recommended (4GB minimum)
- Storage: ~2GB for models and embeddings
- CPU: Standard processor sufficient (GPU not required)
- OS: Windows (scripts use
.batformat)
- Process collections in batches rather than all at once
- Close unnecessary applications to free RAM
- Use SSD storage for faster file I/O
- After initial embedding download, processing is much faster
- Select specific collections to process rather than entire archives
- Language: Models trained specifically on historical Dutch; performance on other languages or modern Dutch may vary
- Time period: Optimised for 17th-18th century texts; earlier or later periods may have reduced accuracy
- Text type: Best suited for administrative/legal documents similar to States General resolutions
- Entity types: Currently limited to PER and DAT; other types require additional models
- ATR quality: Poor transcription quality affects entity recognition accuracy
- Windows-specific: Launcher scripts use Windows batch format (Unix adaptation required)
- Python version: Limited to Python 3.8-3.11 due to dependency requirements
- No GPU support: Models run on CPU only (sufficient for typical use cases)
- In-place modification: Original PageXML files are modified directly (backup recommended)
- Backup originals: Keep untagged PageXML copies before processing (files are modified in place)
- Test on sample first: Process a small collection before running on entire archive
- Verify quality: Manually review a representative sample of tagged entities
- Sequential processing: Run entity types one at a time, not simultaneously
- Monitor output: Check console output for errors or unexpected results
- Configure Transkribus first: Set up entity tags in collection before importing
- Document decisions: Keep notes on false positives/negatives for future reference
- Update models: Check for updated Republic models periodically
- Batch processing: Use the interactive selection to process related collections together
The separate virtual environments exist because different Republic NER models were trained with incompatible embedding configurations. Isolated environments ensure all models function correctly without mutual interference.
Specific conflicts:
- Different Flair versions required
- Incompatible embedding types
- Conflicting PyTorch dependencies
The script automatically discovers collections by:
- Scanning the base directory for subdirectories
- Checking each subdirectory for a
page/folder - Verifying the
page/folder contains XML files - Presenting all valid collections for selection
This approach handles various Transkribus export configurations and allows for flexible archive organisation.
The script:
- Processes XML files sequentially within each collection
- Preserves all existing XML structure and attributes
- Adds custom attributes to TextLine elements
- Maintains namespace declarations
- Handles multiple entities per line
- Skips files with no detected entities (for cleaner output)
Transkribus custom attributes follow strict formatting:
key {param1:value1;param2:value2;}
No spaces around colons or semicolons. Multiple attributes separated by spaces:
readingOrder {index:0;} persoon {offset:10;length:8;} datum {offset:0;length:9;}
The script preserves existing custom attributes and appends new entity tags.
This tool is part of a suite for working with PageXML annotations:
- Tag Transfer Tool: Transfer entity tags between old and new PageXML versions
- Repository: Transfer_Tags_To_New_PageXML
- Label Normalisation Tool: Normalise structural region labels across collections
- Repository: PageXML_regionname_normalisation
Potential improvements:
- Additional entity types (ORG, LOC)
- Unix/Linux launcher scripts
- Confidence threshold configuration
- Parallel processing for faster throughput
- GUI interface
- Integration with Transkribus API
- Support for Python 3.12+
- Dry-run mode (preview without modification)
- Detailed entity reports (CSV export)
Contributions are welcome. Areas for improvement:
- Additional entity type models
- Cross-platform launcher scripts
- Performance optimisation
- Documentation improvements
- Test suite development
- Error handling enhancements
If you use these scripts or models for academic work, please cite:
This repository:
Entity Recognition Resolutions Overijssel (2025)
Developed as part of the HAICu Project
https://github.com/CARomein/Entity_Recognition_Resolutions
Republic NER models:
Koolen, M., van Veen, T., & Brouwer, M. (2022)
Republic NER Models for Historical Dutch
Huygens Institute, KNAW Humanities Cluster
https://huggingface.co/marijnkoolen
MIT License - See LICENSE file for details.
This tool was developed within the context of the HAICu project on the Resoluties van de Staten van Overijssel (Resolutions of the States of Overijssel), funded by the Dutch Research Council/Nederlandse Organisatie voor Wetenschappelijk Onderzoek/Nationale Wetenschapsagenda [NWA.1518.22.105].
Development was assisted by Claude (Anthropic) for code implementation and documentation.
Special thanks to:
- Marijn Koolen (KNAW Humanities Cluster) for developing and sharing the Republic NER models
- The Republic project (Huygens Institute, Amsterdam) for pioneering NER on historical Dutch texts
- The Flair NLP framework team for the underlying technology
-
v2.0.0 (2025-01): Major architectural revision
- Collection-based processing with automatic discovery
- Interactive collection selection (all/include/exclude)
- Simplified command-line interface (no model path required)
- In-place modification of PageXML files
- Improved progress reporting
- Cleaner output (skips files without entities)
-
v1.0.0 (2025-01): Initial release
- PER and DAT entity recognition
- Transkribus integration
- Batch processing support