Armenian Epigraphy NLP Pipeline

This repository contains a research-oriented NLP pipeline for processing Armenian epigraphic corpora, extracting structured inscription data, and detecting inscription-type terminology using Named Entity Recognition (NER).

The project focuses on OCR-heavy, layout-complex historical sources and low-resource Armenian language settings.

Repository Structure

Generated_data/
Model outputs, predictions, and GT-labeled results
Inference/
Contains Inference.ipynb for running trained models on unseen data
Notebooks/
Experimental notebooks for training and evaluation. This is an aggregation of all the work that has been done. Main Folder to look at.
dataset_formation/
Dataset construction, IOB tagging, and train/dev/test splits
preprocessing/
OCR cleaning, paragraph restructuring, and sentence splitting
images/
Figures used for analysis and reporting
corpus_inscription_terms.xlsx
Expert-curated list of Armenian inscription-type terminology
progressive_final.pt Pt file for finetuned NER model

Main Features

OCR-aware text cleaning and restructuring
Rule-based inscription extraction with metadata
GPT-assisted ground-truth generation (JSON via Pydantic)
NER training with:
- spaCy baseline
- Transformer models (mBERT, XLM-RoBERTa, Armenian-specific)
Evaluation on seen vs unseen terminology

Typical Workflow

Preprocess OCR text (preprocessing/)
Form NER-ready datasets (dataset_formation/)
Train and evaluate models (Notebooks/)
Run inference on new corpora (Inference/)
Inspect outputs (Generated_data/)

Scope & Limitations

Strong generalization to known inscription types in new contexts
Limited discovery of entirely unseen terminology
Designed as a research prototype, not a production system

Use Cases

Digital epigraphy
Cultural heritage digitization
Low-resource NLP research
Historical corpus structuring

License

See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Armenian Epigraphy NLP Pipeline

Repository Structure

Main Features

Typical Workflow

Scope & Limitations

Use Cases

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
Generated_data		Generated_data
Inference		Inference
Notebooks		Notebooks
dataset_formation		dataset_formation
images		images
preprocessing		preprocessing
.gitignore		.gitignore
DHLAB Semester project report-Kamal-NOUR-report.pdf		DHLAB Semester project report-Kamal-NOUR-report.pdf
LICENSE		LICENSE
README.md		README.md
corpus_inscription_terms.xlsx		corpus_inscription_terms.xlsx
progressive_final.pt		progressive_final.pt

License

dhlab-epfl/dhlab-epigraphy-studies

Folders and files

Latest commit

History

Repository files navigation

Armenian Epigraphy NLP Pipeline

Repository Structure

Main Features

Typical Workflow

Scope & Limitations

Use Cases

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages