This repository contains a research-oriented NLP pipeline for processing Armenian epigraphic corpora, extracting structured inscription data, and detecting inscription-type terminology using Named Entity Recognition (NER).
The project focuses on OCR-heavy, layout-complex historical sources and low-resource Armenian language settings.
-
Generated_data/
Model outputs, predictions, and GT-labeled results -
Inference/
ContainsInference.ipynbfor running trained models on unseen data -
Notebooks/
Experimental notebooks for training and evaluation. This is an aggregation of all the work that has been done. Main Folder to look at. -
dataset_formation/
Dataset construction, IOB tagging, and train/dev/test splits -
preprocessing/
OCR cleaning, paragraph restructuring, and sentence splitting -
images/
Figures used for analysis and reporting -
corpus_inscription_terms.xlsx
Expert-curated list of Armenian inscription-type terminology -
progressive_final.ptPt file for finetuned NER model
- OCR-aware text cleaning and restructuring
- Rule-based inscription extraction with metadata
- GPT-assisted ground-truth generation (JSON via Pydantic)
- NER training with:
- spaCy baseline
- Transformer models (mBERT, XLM-RoBERTa, Armenian-specific)
- Evaluation on seen vs unseen terminology
- Preprocess OCR text (
preprocessing/) - Form NER-ready datasets (
dataset_formation/) - Train and evaluate models (
Notebooks/) - Run inference on new corpora (
Inference/) - Inspect outputs (
Generated_data/)
- Strong generalization to known inscription types in new contexts
- Limited discovery of entirely unseen terminology
- Designed as a research prototype, not a production system
- Digital epigraphy
- Cultural heritage digitization
- Low-resource NLP research
- Historical corpus structuring
See LICENSE for details.