Skip to content

CARomein/PageXML_RegionCalculator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Region Calculator Tool

A Python tool for analysing PageXML collections and generating comprehensive statistics about text region labelling. This tool provides insights into annotation progress, label distribution, and dataset composition without modifying any source files.

Purpose

Historical document collections processed with PageXML require systematic annotation of text regions with structural labels (e.g., request, resolution, marginalia). During large-scale annotation projects, it becomes essential to monitor progress, identify unlabelled regions, and understand the composition of the dataset.

This tool addresses these needs by:

  • Quantifying annotation progress across the entire collection
  • Identifying files with incomplete labelling
  • Revealing the distribution of document types and structural elements
  • Providing quality control through example regions
  • Enabling comparative analysis across different project stages

The Region Calculator operates in a read-only manner, making it safe to run repeatedly throughout the annotation workflow.

Features

  • Automatic label detection: Identifies all structure types present in the collection, regardless of labelling scheme
  • Comprehensive statistics: Counts total regions, labelled regions, and unlabelled regions
  • Label distribution analysis: Shows frequency and percentage of each label type
  • Per-file breakdown: Generates detailed statistics for every PageXML file
  • Example regions: Displays sample text for each label type for quality verification
  • CSV export: Produces both summary and detailed statistics files for further analysis
  • Additional metrics: Calculates regions per file, label diversity, and identifies files requiring attention

Requirements

  • Python 3.6 or higher
  • pandas (for CSV export functionality)

Install dependencies:

pip install pandas

Installation

Download the script directly:

wget https://raw.githubusercontent.com/[YOUR_USERNAME]/pagexml-tools/main/region_calculator_tool.py

Or place the region_calculator_tool.py script in your working directory.

PageXML Folder Structure

The tool expects Transkribus-style export structures with PageXML files organised as follows:

pagexml_collection/
├── 0018/
│   └── page/
│       ├── 0001.xml
│       ├── 0002.xml
│       └── 0003.xml
├── 0019/
│   └── page/
│       ├── 0001.xml
│       └── 0002.xml
└── 0020/
    └── page/
        ├── 0001.xml
        ├── 0002.xml
        └── 0003.xml

Each archief directory contains a page/ subdirectory with individual PageXML files. The script recursively processes all .xml files in this structure.

Usage

Basic Usage

To analyse a directory of PageXML files with default settings:

python region_calculator_tool.py <directory>

For example, if your PageXML files are in a subdirectory called pagexml_v3_with_continuations_positions:

python region_calculator_tool.py pagexml_v3_with_continuations_positions

Command-Line Options

--output: Specify the filename for the summary statistics CSV (default: region_statistics.csv)

python region_calculator_tool.py <directory> --output my_statistics.csv

--details: Specify the filename for the detailed per-file statistics CSV (default: region_details.csv)

python region_calculator_tool.py <directory> --details my_details.csv

--no-examples: Suppress the display of example regions in the terminal output

python region_calculator_tool.py <directory> --no-examples

Complete Example

python region_calculator_tool.py pagexml_v3_with_continuations_positions --output region_stats.csv --details file_details.csv

Output

Terminal Output

The tool produces a comprehensive report in the terminal, including:

  • Total number of files processed and regions found
  • Overview of labelled versus unlabelled regions with percentages
  • Complete breakdown of region counts by label type
  • Example regions for each label type (up to 3 per type) with location and sample text
  • Statistical analysis:
    • Distribution of regions per file (minimum, maximum, average)
    • Files with the most unlabelled regions
    • Label diversity metrics per file

Example output:

============================================================
SUMMARY STATISTICS
============================================================
Total files processed: 342
Total regions found: 5247

Labeling overview:
  Labeled regions: 3456 (65.9%)
  Unlabeled regions: 1791 (34.1%)

============================================================
BREAKDOWN BY LABEL TYPE
============================================================
  prop_request_rekest           : 1523 ( 29.0%)
  attendance_list               :  876 ( 16.7%)
  resolution                    :  654 ( 12.5%)
  marginalia                    :  403 (  7.7%)
  unlabeled                     : 1791 ( 34.1%)

CSV Output Files

The tool generates two CSV files for further analysis:

Summary statistics file (default: region_statistics.csv)

  • One row per label type
  • Columns: label_type, count, percentage
  • Provides high-level overview of label distribution

Detailed statistics file (default: region_details.csv)

  • One row per PageXML file
  • Columns: archief, page, total_regions, labeled, unlabeled, label_[type] (one per label found)
  • Enables file-by-file analysis and identification of files requiring further annotation

How It Works

Label Detection

The tool automatically detects all structure labels present in the PageXML files by searching for the pattern structure {type:...} in the custom attributes of TextRegion elements. This approach makes the tool completely generic: it will work with any PageXML collection regardless of the specific labelling scheme employed.

Whether your labels are prop_request_rekest, attendance_list, marginalia, or any other custom type, the tool will detect and count them appropriately. Regions without a structure type in their custom attribute are classified as unlabeled.

Processing Workflow

  1. Directory scanning: Recursively identifies all XML files in [archief]/page/*.xml structure
  2. XML parsing: Processes each file using ElementTree
  3. Region extraction: Finds all TextRegion elements
  4. Label identification: Extracts structure type from custom attributes using regular expressions
  5. Statistics accumulation: Counts occurrences of each label type
  6. Example collection: Stores sample regions for each label (first 3 encountered)
  7. Report generation: Produces terminal output and CSV files

The tool processes files sequentially, keeping memory usage minimal even for large collections.

Use Cases

Progress Tracking

Monitor annotation progress throughout a project by running the tool periodically. The percentage of labelled regions provides a clear metric for completion status.

Quality Control

Identify files with high numbers of unlabelled regions that may require attention. The detailed CSV output allows sorting by unlabelled count to prioritise annotation efforts.

Dataset Analysis

Understand the composition of your collection by examining the distribution of document types or structural elements. This information supports analysis planning and can reveal patterns in the historical material.

Comparative Analysis

Run the tool at different stages of the project to track changes in label distribution. Compare statistics across different PageXML collections to identify differences in composition or annotation approaches.

Verification

Use the example regions output to verify that labels are being applied correctly. Sample text for each label type enables quick quality checks without opening individual files.

Troubleshooting

No files found

Problem: "Total files to process: 0"

Solutions:

  • Verify folder structure matches [archief]/page/*.xml pattern
  • Check that files have .xml extension (not .XML or other variations)
  • Ensure you provided the root directory containing archief folders, not the page/ directory itself
  • Verify file permissions allow reading
  • Check that the directory path is correct (use absolute path if relative path fails)

No regions found

Problem: "Total regions found: 0"

Solutions:

  • Verify files conform to PAGE 2013-07-15 schema
  • Check XML namespace declarations
  • Ensure files contain TextRegion elements
  • Verify files are valid XML (not corrupted)
  • Test with a single file first to isolate the problem

All regions show as unlabeled

Problem: Every region is classified as unlabeled despite having labels

Solutions:

  • Verify labels follow the format structure {type:label_name;}
  • Check for extra spaces or formatting differences in custom attributes
  • Ensure labels are on TextRegion elements, not other elements
  • Examine a sample file manually to confirm label format
  • Check that custom attribute exists and is populated

Processing errors

Problem: "Error processing [file]: [message]"

Solutions:

  • Check that the specific file is valid XML
  • Verify file encoding (should be UTF-8)
  • Ensure file is not corrupted
  • Check file permissions
  • Try processing a smaller subset to identify problematic files

CSV files not created

Problem: Terminal output appears but no CSV files generated

Solutions:

  • Verify pandas is installed (pip install pandas)
  • Check write permissions in the output directory
  • Ensure sufficient disk space
  • Look for error messages in terminal output
  • Try specifying absolute paths for output files

Unexpected label counts

Problem: Label counts differ from manual inspection

Solutions:

  • Check for labels with slight spelling variations
  • Look for extra whitespace in label names
  • Verify that all archief directories are being processed
  • Re-run with examples enabled to inspect sample regions
  • Export detailed CSV and sort by specific label to investigate

Performance

  • Processing speed: ~100-300 files per second (depending on file size and region count)
  • Memory usage: Minimal (processes one file at a time)
  • Scalability: Tested with collections of 1000+ files

Typical processing times:

  • 100 files: <2 seconds
  • 1,000 files: ~5-15 seconds
  • 10,000 files: ~1-2 minutes

Performance depends on:

  • Number of regions per file
  • Complexity of XML structure
  • Disk I/O speed
  • System specifications

Limitations

  • Read-only analysis: Does not modify files or assist with labelling (by design)
  • Structure format dependency: Assumes structure {type:...;} format used by Transkribus
  • Namespace requirement: Expects PAGE 2013-07-15 namespace
  • Folder structure dependency: Requires [archief]/page/*.xml organisation
  • Case-sensitive matching: Label names are case-sensitive
  • No label validation: Does not verify that labels conform to any schema
  • Single structure type: Assumes each region has at most one structure type

Best Practices

  1. Run regularly: Execute the tool at regular intervals throughout the annotation project to track progress
  2. Export both CSVs: Keep both summary and detailed statistics for comprehensive documentation
  3. Review examples: Always examine example regions to verify label quality
  4. Sort detailed CSV: Use spreadsheet software to sort by unlabeled count and prioritise files
  5. Track over time: Maintain statistics from different project stages for progress documentation
  6. Combine with visualisation: Import CSV data into data analysis tools for visual representations
  7. Document findings: Record observations about label distribution in project documentation
  8. Use for planning: Let statistics inform decisions about resource allocation and annotation priorities

Technical Details

The tool processes PageXML files conforming to the PAGE namespace http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15. It extracts structure type information from the custom attribute of TextRegion elements using regular expressions that match the pattern structure\s*\{[^}]*type:([^;}]+).

The tool expects a directory structure where PageXML files are located in subdirectories following the pattern <archief_dir>/page/*.xml. All file processing is performed in read-only mode with no modifications made to source files.

For example region collection, the tool extracts the first two lines of text from each region using the TextLine/TextEquiv/Unicode path in the XML structure.

Notes

  • The script expects PageXML files in the structure: [archief_nummer]/page/[pagina_nummer].xml
  • Only TextRegion elements are analysed
  • The tool is safe to run multiple times as it does not modify any files
  • Original file structure and content remain completely unchanged

Licence

This project is licensed under the MIT Licence.

Contact

Acknowledgements

This Region Calculator Tool was developed within the context of the HAICu project on the Resoluties van de Staten van Overijssel (Resolutions of the States of Overijssel), funded by the Dutch Research Council/Nederlandse Organisatie voor Wetenschappelijk Onderzoek/Nationale Wetenschapsagenda [NWA.1518.22.105].

Development was assisted by Claude (Anthropic) for code implementation and documentation.

Version History

Version 1.0 (2025): Initial release with comprehensive region statistics, automatic label detection, and dual CSV export functionality.

About

A basic tool to calculate how many regions the PageXML that is being checked contain and which labels have been applied (and how many regions are there without any label)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages