Region Calculator Tool

A Python tool for analysing PageXML collections and generating comprehensive statistics about text region labelling. This tool provides insights into annotation progress, label distribution, and dataset composition without modifying any source files.

Purpose

Historical document collections processed with PageXML require systematic annotation of text regions with structural labels (e.g., request, resolution, marginalia). During large-scale annotation projects, it becomes essential to monitor progress, identify unlabelled regions, and understand the composition of the dataset.

This tool addresses these needs by:

Quantifying annotation progress across the entire collection
Identifying files with incomplete labelling
Revealing the distribution of document types and structural elements
Providing quality control through example regions
Enabling comparative analysis across different project stages

The Region Calculator operates in a read-only manner, making it safe to run repeatedly throughout the annotation workflow.

Features

Automatic label detection: Identifies all structure types present in the collection, regardless of labelling scheme
Comprehensive statistics: Counts total regions, labelled regions, and unlabelled regions
Label distribution analysis: Shows frequency and percentage of each label type
Per-file breakdown: Generates detailed statistics for every PageXML file
Example regions: Displays sample text for each label type for quality verification
CSV export: Produces both summary and detailed statistics files for further analysis
Additional metrics: Calculates regions per file, label diversity, and identifies files requiring attention

Requirements

Python 3.6 or higher
pandas (for CSV export functionality)

Install dependencies:

pip install pandas

Installation

Download the script directly:

wget https://raw.githubusercontent.com/[YOUR_USERNAME]/pagexml-tools/main/region_calculator_tool.py

Or place the region_calculator_tool.py script in your working directory.

PageXML Folder Structure

The tool expects Transkribus-style export structures with PageXML files organised as follows:

pagexml_collection/
├── 0018/
│   └── page/
│       ├── 0001.xml
│       ├── 0002.xml
│       └── 0003.xml
├── 0019/
│   └── page/
│       ├── 0001.xml
│       └── 0002.xml
└── 0020/
    └── page/
        ├── 0001.xml
        ├── 0002.xml
        └── 0003.xml

Each archief directory contains a page/ subdirectory with individual PageXML files. The script recursively processes all .xml files in this structure.

Usage

Basic Usage

To analyse a directory of PageXML files with default settings:

python region_calculator_tool.py <directory>

For example, if your PageXML files are in a subdirectory called pagexml_v3_with_continuations_positions:

python region_calculator_tool.py pagexml_v3_with_continuations_positions

Command-Line Options

--output: Specify the filename for the summary statistics CSV (default: region_statistics.csv)

python region_calculator_tool.py <directory> --output my_statistics.csv

--details: Specify the filename for the detailed per-file statistics CSV (default: region_details.csv)

python region_calculator_tool.py <directory> --details my_details.csv

--no-examples: Suppress the display of example regions in the terminal output

python region_calculator_tool.py <directory> --no-examples

Complete Example

python region_calculator_tool.py pagexml_v3_with_continuations_positions --output region_stats.csv --details file_details.csv

Output

Terminal Output

The tool produces a comprehensive report in the terminal, including:

Total number of files processed and regions found
Overview of labelled versus unlabelled regions with percentages
Complete breakdown of region counts by label type
Example regions for each label type (up to 3 per type) with location and sample text
Statistical analysis:
- Distribution of regions per file (minimum, maximum, average)
- Files with the most unlabelled regions
- Label diversity metrics per file

Example output:

============================================================
SUMMARY STATISTICS
============================================================
Total files processed: 342
Total regions found: 5247

Labeling overview:
  Labeled regions: 3456 (65.9%)
  Unlabeled regions: 1791 (34.1%)

============================================================
BREAKDOWN BY LABEL TYPE
============================================================
  prop_request_rekest           : 1523 ( 29.0%)
  attendance_list               :  876 ( 16.7%)
  resolution                    :  654 ( 12.5%)
  marginalia                    :  403 (  7.7%)
  unlabeled                     : 1791 ( 34.1%)

CSV Output Files

The tool generates two CSV files for further analysis:

Summary statistics file (default: region_statistics.csv)

One row per label type
Columns: label_type, count, percentage
Provides high-level overview of label distribution

Detailed statistics file (default: region_details.csv)

One row per PageXML file
Columns: archief, page, total_regions, labeled, unlabeled, label_[type] (one per label found)
Enables file-by-file analysis and identification of files requiring further annotation

How It Works

Label Detection

The tool automatically detects all structure labels present in the PageXML files by searching for the pattern structure {type:...} in the custom attributes of TextRegion elements. This approach makes the tool completely generic: it will work with any PageXML collection regardless of the specific labelling scheme employed.

Whether your labels are prop_request_rekest, attendance_list, marginalia, or any other custom type, the tool will detect and count them appropriately. Regions without a structure type in their custom attribute are classified as unlabeled.

Processing Workflow

Directory scanning: Recursively identifies all XML files in [archief]/page/*.xml structure
XML parsing: Processes each file using ElementTree
Region extraction: Finds all TextRegion elements
Label identification: Extracts structure type from custom attributes using regular expressions
Statistics accumulation: Counts occurrences of each label type
Example collection: Stores sample regions for each label (first 3 encountered)
Report generation: Produces terminal output and CSV files

The tool processes files sequentially, keeping memory usage minimal even for large collections.

Use Cases

Progress Tracking

Monitor annotation progress throughout a project by running the tool periodically. The percentage of labelled regions provides a clear metric for completion status.

Quality Control

Identify files with high numbers of unlabelled regions that may require attention. The detailed CSV output allows sorting by unlabelled count to prioritise annotation efforts.

Dataset Analysis

Understand the composition of your collection by examining the distribution of document types or structural elements. This information supports analysis planning and can reveal patterns in the historical material.

Comparative Analysis

Run the tool at different stages of the project to track changes in label distribution. Compare statistics across different PageXML collections to identify differences in composition or annotation approaches.

Verification

Use the example regions output to verify that labels are being applied correctly. Sample text for each label type enables quick quality checks without opening individual files.

Troubleshooting

No files found

Problem: "Total files to process: 0"

Solutions:

Verify folder structure matches [archief]/page/*.xml pattern
Check that files have .xml extension (not .XML or other variations)
Ensure you provided the root directory containing archief folders, not the page/ directory itself
Verify file permissions allow reading
Check that the directory path is correct (use absolute path if relative path fails)

No regions found

Problem: "Total regions found: 0"

Solutions:

Verify files conform to PAGE 2013-07-15 schema
Check XML namespace declarations
Ensure files contain TextRegion elements
Verify files are valid XML (not corrupted)
Test with a single file first to isolate the problem

All regions show as unlabeled

Problem: Every region is classified as unlabeled despite having labels

Solutions:

Verify labels follow the format structure {type:label_name;}
Check for extra spaces or formatting differences in custom attributes
Ensure labels are on TextRegion elements, not other elements
Examine a sample file manually to confirm label format
Check that custom attribute exists and is populated

Processing errors

Problem: "Error processing [file]: [message]"

Solutions:

Check that the specific file is valid XML
Verify file encoding (should be UTF-8)
Ensure file is not corrupted
Check file permissions
Try processing a smaller subset to identify problematic files

CSV files not created

Problem: Terminal output appears but no CSV files generated

Solutions:

Verify pandas is installed (pip install pandas)
Check write permissions in the output directory
Ensure sufficient disk space
Look for error messages in terminal output
Try specifying absolute paths for output files

Unexpected label counts

Problem: Label counts differ from manual inspection

Solutions:

Check for labels with slight spelling variations
Look for extra whitespace in label names
Verify that all archief directories are being processed
Re-run with examples enabled to inspect sample regions
Export detailed CSV and sort by specific label to investigate

Performance

Processing speed: ~100-300 files per second (depending on file size and region count)
Memory usage: Minimal (processes one file at a time)
Scalability: Tested with collections of 1000+ files

Typical processing times:

100 files: <2 seconds
1,000 files: ~5-15 seconds
10,000 files: ~1-2 minutes

Performance depends on:

Number of regions per file
Complexity of XML structure
Disk I/O speed
System specifications

Limitations

Read-only analysis: Does not modify files or assist with labelling (by design)
Structure format dependency: Assumes structure {type:...;} format used by Transkribus
Namespace requirement: Expects PAGE 2013-07-15 namespace
Folder structure dependency: Requires [archief]/page/*.xml organisation
Case-sensitive matching: Label names are case-sensitive
No label validation: Does not verify that labels conform to any schema
Single structure type: Assumes each region has at most one structure type

Best Practices

Run regularly: Execute the tool at regular intervals throughout the annotation project to track progress
Export both CSVs: Keep both summary and detailed statistics for comprehensive documentation
Review examples: Always examine example regions to verify label quality
Sort detailed CSV: Use spreadsheet software to sort by unlabeled count and prioritise files
Track over time: Maintain statistics from different project stages for progress documentation
Combine with visualisation: Import CSV data into data analysis tools for visual representations
Document findings: Record observations about label distribution in project documentation
Use for planning: Let statistics inform decisions about resource allocation and annotation priorities

Technical Details

The tool processes PageXML files conforming to the PAGE namespace http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15. It extracts structure type information from the custom attribute of TextRegion elements using regular expressions that match the pattern structure\s*\{[^}]*type:([^;}]+).

The tool expects a directory structure where PageXML files are located in subdirectories following the pattern <archief_dir>/page/*.xml. All file processing is performed in read-only mode with no modifications made to source files.

For example region collection, the tool extracts the first two lines of text from each region using the TextLine/TextEquiv/Unicode path in the XML structure.

Notes

The script expects PageXML files in the structure: [archief_nummer]/page/[pagina_nummer].xml
Only TextRegion elements are analysed
The tool is safe to run multiple times as it does not modify any files
Original file structure and content remain completely unchanged

Licence

This project is licensed under the MIT Licence.

Contact

Email: c.a.romein@utwente.nl

Acknowledgements

This Region Calculator Tool was developed within the context of the HAICu project on the Resoluties van de Staten van Overijssel (Resolutions of the States of Overijssel), funded by the Dutch Research Council/Nederlandse Organisatie voor Wetenschappelijk Onderzoek/Nationale Wetenschapsagenda [NWA.1518.22.105].

Development was assisted by Claude (Anthropic) for code implementation and documentation.

Version History

Version 1.0 (2025): Initial release with comprehensive region statistics, automatic label detection, and dual CSV export functionality.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
region_calc_changelog.md		region_calc_changelog.md
region_calculator.py		region_calculator.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Region Calculator Tool

Purpose

Features

Requirements

Installation

PageXML Folder Structure

Usage

Basic Usage

Command-Line Options

Complete Example

Output

Terminal Output

CSV Output Files

How It Works

Label Detection

Processing Workflow

Use Cases

Progress Tracking

Quality Control

Dataset Analysis

Comparative Analysis

Verification

Troubleshooting

No files found

No regions found

All regions show as unlabeled

Processing errors

CSV files not created

Unexpected label counts

Performance

Limitations

Best Practices

Technical Details

Notes

Licence

Contact

Acknowledgements

Version History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages