A system for evaluating and visualizing the accuracy of HTR (Handwritten Text Recognition) JSON outputs compared to gold standard legal documents. It was developed specifically for cases when legal documents can be categorized according to types that follow the same structure, which can be defined and extracted using vision-enhanced LLMs.
- Python 3.6+
- Node.js 14+ and npm 6+ (for dashboard visualization)
- Internet connection (for npm package installation)
First, make sure Python 3.6+ is installed:
python --version
# or
python3 --versionClone this repository (or download and extract the ZIP file):
git clone https://github.com/username/htr-evaluation-system.git
cd htr-evaluation-systemCreate and activate a virtual environment:
Linux/macOS:
python -m venv htr_llm_env
source htr_llm_env/bin/activateWindows:
python -m venv htr_llm_env
htr_llm_env\Scripts\activate# Core requirements
pip install -U pip
pip install json re argparse typingThe dashboard requires Node.js and npm. Install them if you don't have them already:
Option 1: Download and install from nodejs.org
Option 2: Using package managers:
Linux (Ubuntu/Debian):
sudo apt update
sudo apt install nodejs npmmacOS (with Homebrew):
brew install nodeVerify the installation:
node --version
npm --versionRun a simple test to ensure everything is set up correctly:
python -c "import json, re, argparse, os; print('All required packages are installed!')"python htr_evaluation.py exemple_data/mock-gold-json.json exemple_data/mock-llm-json.jsonThis will:
- Evaluate the predicted JSON against the gold standard
- Generate a dashboard in the default output directory (
./output/) - Launch the dashboard viewer automatically
python htr_evaluation.py exemple_data/mock-gold-json.json exemple_data/mock-llm-json.json --output_dir /custom/path--output_dir: Custom directory to save evaluation results (default: ./output/)
The system:
- Flattens nested JSON structures for comparison
- Normalizes field names by replacing spaces with underscores
- Applies field-specific normalization:
- Phone numbers: Removes non-digit characters
- Numeric fields: Extracts digits
- Dates: Extracts numeric components
- Computes string similarity using Levenshtein distance
- Weights fields based on importance (adjustable for each specific evaluation campaign)
- Categorizes errors into four types
- Critical Error (0%): Completely incorrect or missing values
- Semantic Difference (50%): Similar meaning but significant differences
- Minor Error (80%): Small differences
- Perfect Match (100%): Exact match after normalization
- Overall evaluation score
- Field coverage percentage
- Error distribution visualization
- Detailed table of all errors
After running the evaluation, the script will:
- Generate the dashboard files
- Install npm dependencies
- Start the Parcel development server
- Display access instructions
Open your browser and navigate to http://localhost:1234 to view the dashboard.
You can modify:
- Similarity thresholds in
categorize_error() - Field weights in
get_field_weight() - Field normalization in
normalize_field()
Problem: ModuleNotFoundError: No module named 'xyz'
Solution: Install the missing package: pip install xyz
Problem: Dashboard doesn't launch
Solution:
- Make sure Node.js and npm are installed correctly
- Manually navigate to the dashboard directory and run:
cd output/[file_name]_dashboard
npm install
npm start- Open your browser to http://localhost:1234
This code is distributed under the CC BY-NC-SA 4.0 license, developed by Mikhail Biriuchinskii, NLP Engineer at ObTIC, Sorbonne University.