A Python tool for analysing PageXML collections and generating comprehensive statistics about text region labelling. This tool provides insights into annotation progress, label distribution, and dataset composition without modifying any source files.
Historical document collections processed with PageXML require systematic annotation of text regions with structural labels (e.g., request, resolution, marginalia). During large-scale annotation projects, it becomes essential to monitor progress, identify unlabelled regions, and understand the composition of the dataset.
This tool addresses these needs by:
- Quantifying annotation progress across the entire collection
- Identifying files with incomplete labelling
- Revealing the distribution of document types and structural elements
- Providing quality control through example regions
- Enabling comparative analysis across different project stages
The Region Calculator operates in a read-only manner, making it safe to run repeatedly throughout the annotation workflow.
- Automatic label detection: Identifies all structure types present in the collection, regardless of labelling scheme
- Comprehensive statistics: Counts total regions, labelled regions, and unlabelled regions
- Label distribution analysis: Shows frequency and percentage of each label type
- Per-file breakdown: Generates detailed statistics for every PageXML file
- Example regions: Displays sample text for each label type for quality verification
- CSV export: Produces both summary and detailed statistics files for further analysis
- Additional metrics: Calculates regions per file, label diversity, and identifies files requiring attention
- Python 3.6 or higher
- pandas (for CSV export functionality)
Install dependencies:
pip install pandasDownload the script directly:
wget https://raw.githubusercontent.com/[YOUR_USERNAME]/pagexml-tools/main/region_calculator_tool.pyOr place the region_calculator_tool.py script in your working directory.
The tool expects Transkribus-style export structures with PageXML files organised as follows:
pagexml_collection/
├── 0018/
│ └── page/
│ ├── 0001.xml
│ ├── 0002.xml
│ └── 0003.xml
├── 0019/
│ └── page/
│ ├── 0001.xml
│ └── 0002.xml
└── 0020/
└── page/
├── 0001.xml
├── 0002.xml
└── 0003.xml
Each archief directory contains a page/ subdirectory with individual PageXML files. The script recursively processes all .xml files in this structure.
To analyse a directory of PageXML files with default settings:
python region_calculator_tool.py <directory>For example, if your PageXML files are in a subdirectory called pagexml_v3_with_continuations_positions:
python region_calculator_tool.py pagexml_v3_with_continuations_positions--output: Specify the filename for the summary statistics CSV (default: region_statistics.csv)
python region_calculator_tool.py <directory> --output my_statistics.csv--details: Specify the filename for the detailed per-file statistics CSV (default: region_details.csv)
python region_calculator_tool.py <directory> --details my_details.csv--no-examples: Suppress the display of example regions in the terminal output
python region_calculator_tool.py <directory> --no-examplespython region_calculator_tool.py pagexml_v3_with_continuations_positions --output region_stats.csv --details file_details.csvThe tool produces a comprehensive report in the terminal, including:
- Total number of files processed and regions found
- Overview of labelled versus unlabelled regions with percentages
- Complete breakdown of region counts by label type
- Example regions for each label type (up to 3 per type) with location and sample text
- Statistical analysis:
- Distribution of regions per file (minimum, maximum, average)
- Files with the most unlabelled regions
- Label diversity metrics per file
Example output:
============================================================
SUMMARY STATISTICS
============================================================
Total files processed: 342
Total regions found: 5247
Labeling overview:
Labeled regions: 3456 (65.9%)
Unlabeled regions: 1791 (34.1%)
============================================================
BREAKDOWN BY LABEL TYPE
============================================================
prop_request_rekest : 1523 ( 29.0%)
attendance_list : 876 ( 16.7%)
resolution : 654 ( 12.5%)
marginalia : 403 ( 7.7%)
unlabeled : 1791 ( 34.1%)
The tool generates two CSV files for further analysis:
Summary statistics file (default: region_statistics.csv)
- One row per label type
- Columns: label_type, count, percentage
- Provides high-level overview of label distribution
Detailed statistics file (default: region_details.csv)
- One row per PageXML file
- Columns: archief, page, total_regions, labeled, unlabeled, label_[type] (one per label found)
- Enables file-by-file analysis and identification of files requiring further annotation
The tool automatically detects all structure labels present in the PageXML files by searching for the pattern structure {type:...} in the custom attributes of TextRegion elements. This approach makes the tool completely generic: it will work with any PageXML collection regardless of the specific labelling scheme employed.
Whether your labels are prop_request_rekest, attendance_list, marginalia, or any other custom type, the tool will detect and count them appropriately. Regions without a structure type in their custom attribute are classified as unlabeled.
- Directory scanning: Recursively identifies all XML files in
[archief]/page/*.xmlstructure - XML parsing: Processes each file using ElementTree
- Region extraction: Finds all TextRegion elements
- Label identification: Extracts structure type from custom attributes using regular expressions
- Statistics accumulation: Counts occurrences of each label type
- Example collection: Stores sample regions for each label (first 3 encountered)
- Report generation: Produces terminal output and CSV files
The tool processes files sequentially, keeping memory usage minimal even for large collections.
Monitor annotation progress throughout a project by running the tool periodically. The percentage of labelled regions provides a clear metric for completion status.
Identify files with high numbers of unlabelled regions that may require attention. The detailed CSV output allows sorting by unlabelled count to prioritise annotation efforts.
Understand the composition of your collection by examining the distribution of document types or structural elements. This information supports analysis planning and can reveal patterns in the historical material.
Run the tool at different stages of the project to track changes in label distribution. Compare statistics across different PageXML collections to identify differences in composition or annotation approaches.
Use the example regions output to verify that labels are being applied correctly. Sample text for each label type enables quick quality checks without opening individual files.
Problem: "Total files to process: 0"
Solutions:
- Verify folder structure matches
[archief]/page/*.xmlpattern - Check that files have
.xmlextension (not.XMLor other variations) - Ensure you provided the root directory containing archief folders, not the
page/directory itself - Verify file permissions allow reading
- Check that the directory path is correct (use absolute path if relative path fails)
Problem: "Total regions found: 0"
Solutions:
- Verify files conform to PAGE 2013-07-15 schema
- Check XML namespace declarations
- Ensure files contain TextRegion elements
- Verify files are valid XML (not corrupted)
- Test with a single file first to isolate the problem
Problem: Every region is classified as unlabeled despite having labels
Solutions:
- Verify labels follow the format
structure {type:label_name;} - Check for extra spaces or formatting differences in custom attributes
- Ensure labels are on TextRegion elements, not other elements
- Examine a sample file manually to confirm label format
- Check that custom attribute exists and is populated
Problem: "Error processing [file]: [message]"
Solutions:
- Check that the specific file is valid XML
- Verify file encoding (should be UTF-8)
- Ensure file is not corrupted
- Check file permissions
- Try processing a smaller subset to identify problematic files
Problem: Terminal output appears but no CSV files generated
Solutions:
- Verify pandas is installed (
pip install pandas) - Check write permissions in the output directory
- Ensure sufficient disk space
- Look for error messages in terminal output
- Try specifying absolute paths for output files
Problem: Label counts differ from manual inspection
Solutions:
- Check for labels with slight spelling variations
- Look for extra whitespace in label names
- Verify that all archief directories are being processed
- Re-run with examples enabled to inspect sample regions
- Export detailed CSV and sort by specific label to investigate
- Processing speed: ~100-300 files per second (depending on file size and region count)
- Memory usage: Minimal (processes one file at a time)
- Scalability: Tested with collections of 1000+ files
Typical processing times:
- 100 files: <2 seconds
- 1,000 files: ~5-15 seconds
- 10,000 files: ~1-2 minutes
Performance depends on:
- Number of regions per file
- Complexity of XML structure
- Disk I/O speed
- System specifications
- Read-only analysis: Does not modify files or assist with labelling (by design)
- Structure format dependency: Assumes
structure {type:...;}format used by Transkribus - Namespace requirement: Expects PAGE 2013-07-15 namespace
- Folder structure dependency: Requires
[archief]/page/*.xmlorganisation - Case-sensitive matching: Label names are case-sensitive
- No label validation: Does not verify that labels conform to any schema
- Single structure type: Assumes each region has at most one structure type
- Run regularly: Execute the tool at regular intervals throughout the annotation project to track progress
- Export both CSVs: Keep both summary and detailed statistics for comprehensive documentation
- Review examples: Always examine example regions to verify label quality
- Sort detailed CSV: Use spreadsheet software to sort by unlabeled count and prioritise files
- Track over time: Maintain statistics from different project stages for progress documentation
- Combine with visualisation: Import CSV data into data analysis tools for visual representations
- Document findings: Record observations about label distribution in project documentation
- Use for planning: Let statistics inform decisions about resource allocation and annotation priorities
The tool processes PageXML files conforming to the PAGE namespace http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15. It extracts structure type information from the custom attribute of TextRegion elements using regular expressions that match the pattern structure\s*\{[^}]*type:([^;}]+).
The tool expects a directory structure where PageXML files are located in subdirectories following the pattern <archief_dir>/page/*.xml. All file processing is performed in read-only mode with no modifications made to source files.
For example region collection, the tool extracts the first two lines of text from each region using the TextLine/TextEquiv/Unicode path in the XML structure.
- The script expects PageXML files in the structure:
[archief_nummer]/page/[pagina_nummer].xml - Only TextRegion elements are analysed
- The tool is safe to run multiple times as it does not modify any files
- Original file structure and content remain completely unchanged
This project is licensed under the MIT Licence.
- Email: c.a.romein@utwente.nl
This Region Calculator Tool was developed within the context of the HAICu project on the Resoluties van de Staten van Overijssel (Resolutions of the States of Overijssel), funded by the Dutch Research Council/Nederlandse Organisatie voor Wetenschappelijk Onderzoek/Nationale Wetenschapsagenda [NWA.1518.22.105].
Development was assisted by Claude (Anthropic) for code implementation and documentation.
Version 1.0 (2025): Initial release with comprehensive region statistics, automatic label detection, and dual CSV export functionality.